Emonhub fails to start

borpin · 26 April 2019 07:39

As both of these are python scripts I think that might be the culprit. You could try and reinstall it.

However, I suggest getting a brand new image on a new SD card emonSD pre built SD card Download & Change Log · openenergymonitor/emonpi Wiki · GitHub and try that.

rdavies6 · 2 May 2019 08:33

My EmonPi stopped reporting readings a couple of days ago (the 27th or 28th April)…but I only noticed yesterday. It had been running perfectly for some weeks.I restarted my EmonPi, tried updating all the software…but to no avail, - I just can’t get readings from the EmonPi. The error that’s reported is the same as @mickeymouse690 - EmonHub fails to start

-- Logs begin at Thu 2016-11-03 17:16:43 UTC, end at Thu 2019-05-02 08:28:05 UTC. -- May 01 21:41:32 emonpi systemd[1]: emonhub.service: Service hold-off time over, scheduling restart.
May 01 21:41:32 emonpi systemd[1]: Stopped emonHub service description.
May 01 21:41:32 emonpi systemd[1]: Started emonHub service description.
May 01 21:41:32 emonpi systemd[1]: emonhub.service: Main process exited, code=killed, status=11/SEGV May 01 21:41:32 emonpi systemd[1]: emonhub.service: Unit entered failed state.
May 01 21:41:32 emonpi systemd[1]: emonhub.service: Failed with result 'signal'.
May 01 21:41:32 emonpi systemd[1]: emonhub.service: Service hold-off time over, scheduling restart.
May 01 21:41:33 emonpi systemd[1]: Stopped emonHub service description.
May 01 21:41:33 emonpi systemd[1]: Started emonHub service description.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Main process exited, code=killed, status=11/SEGV May 01 21:41:33 emonpi systemd[1]: emonhub.service: Unit entered failed state.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Failed with result 'signal'.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Service hold-off time over, scheduling restart.
May 01 21:41:33 emonpi systemd[1]: Stopped emonHub service description.
May 01 21:41:33 emonpi systemd[1]: Started emonHub service description.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Main process exited, code=killed, status=11/SEGV May 01 21:41:33 emonpi systemd[1]: emonhub.service: Unit entered failed state.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Failed with result 'signal'.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Service hold-off time over, scheduling restart.
May 01 21:41:33 emonpi systemd[1]: Stopped emonHub service description.
May 01 21:41:33 emonpi systemd[1]: Started emonHub service description.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Main process exited, code=killed, status=11/SEGV May 01 21:41:33 emonpi systemd[1]: emonhub.service: Unit entered failed state.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Failed with result 'signal'.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Service hold-off time over, scheduling restart.
May 01 21:41:33 emonpi systemd[1]: Stopped emonHub service description.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Start request repeated too quickly.
May 01 21:41:33 emonpi systemd[1]: Failed to start emonHub service description.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Unit entered failed state.
May 01 21:41:33 emonpi systemd[1]: emonhub.service: Failed with result 'signal'.

(interestingly, the log viewer has the error messages, but when I try and download the log it tells me that /var/log/emonhub/emonhub.log does not exist!)

In case this is a more systemic issue, rather than just a coincidence…is there any information that i can post that might help someone debug the issue?

Thanks

borpin · 2 May 2019 08:53

Did the emonhub not restart at this point? Did you wait and see?

Did you just update emoncms or Raspbian as well?

rdavies6 · 2 May 2019 09:18

I waited…and waited…but no restart. I then tried to restart EmonHub separately through the admin interface as well, but again to no avail!

I only updated the EmonCMS software via the admin interface buttons…I’ve done nothing to update Raspbian (which I assume i would need to ssh into the RaspPi and run ‘apt-get update’. I could look at doing that if recommended.

I have just tried ssh’ing into the EmonPi (from my phone over a VPN…isn’t technology great!) - and I have access to try some commands now.

Thanks

borpin · 2 May 2019 11:33

Hi Robin, certainly do not update via apt. It is breaking the database setup.

This error is really strange.

How old is the setup?

rdavies6 · 2 May 2019 12:14

I saw the warning re: updating the base OS … so I will leave well alone!

The hardware was installed just before Christmas, and it was commissioned in January. Been working well too! I checked the precise timing of when the data stopped being sent…and it was Apr 30th at 6.15am

borpin · 2 May 2019 13:54

All I can find is a suggestion that the segmentation error is due to a bad SDCard.

I think that using fsck might tell you if there are any errors

 sudo fsck -vn /dev/mmcblk0p3
 sudo fsck -vn /dev/mmcblk0p1

rdavies6 · 2 May 2019 15:00

Brian - thanks for keeping the investigation going here! Here’s the output from my EmonPi

pi@emonpi:~ $ sudo fsck -vn /dev/mmcblk0p3 fsck from util-linux 2.29.2 e2fsck 1.43.4 (31-Jan-2017) Warning!  /dev/mmcblk0p3 is mounted.
/dev/mmcblk0p3 was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes Inodes that were part of a corrupted orphan linked list found.  Fix? no

Inode 466760 was part of the orphaned inode list.  IGNORED.
Deleted inode 504038 has zero dtime.  Fix? no

Deleted inode 504040 has zero dtime.  Fix? no

Inode 614928 was part of the orphaned inode list.  IGNORED.
Inode 614963 is in use, but has dtime set.  Fix? no

Deleted inode 615023 has zero dtime.  Fix? no

Deleted inode 615037 has zero dtime.  Fix? no

Inode 615089 was part of the orphaned inode list.  IGNORED.
Deleted inode 615091 has zero dtime.  Fix? no

Inode 614963, i_size is 18014398509483008, should be 0.  Fix? no

Inode 615092 has a bad extended attribute block 524288.  Clear? no

Extended attribute block 524288 has h_blocks > 1.  Clear? no

Extended attribute block 524288 is corrupt (invalid value).  Clear? no

Extended attribute block 524288 is corrupt (invalid value).  Clear? no

Extended attribute block 524288 is corrupt (allocation collision).  Clear? no

Inode 615092, i_blocks is 0, should be 2.  Fix? no

Deleted inode 615489 has zero dtime.  Fix? no

Inode 615490 was part of the orphaned inode list.  IGNORED.
Inode 615491 was part of the orphaned inode list.  IGNORED.
Deleted inode 615492 has zero dtime.  Fix? no

Deleted inode 615525 has zero dtime.  Fix? no

Inode 615537, i_blocks is 532480, should be 0.  Fix? no

Inode 615538 was part of the orphaned inode list.  IGNORED.
Inode 615677 was part of the orphaned inode list.  IGNORED.
Inode 615677 is in use, but has dtime set.  Fix? no

Deleted inode 615680 has zero dtime.  Fix? no

Deleted inode 615690 has zero dtime.  Fix? no

Inode 615741 was part of the orphaned inode list.  IGNORED.
Deleted inode 615742 has zero dtime.  Fix? no

Deleted inode 615777 has zero dtime.  Fix? no

Deleted inode 615859 has zero dtime.  Fix? no

Deleted inode 615880 has zero dtime.  Fix? no

Inode 615888 was part of the orphaned inode list.  IGNORED.
Inode 615912 was part of the orphaned inode list.  IGNORED.
Inode 615917 was part of the orphaned inode list.  IGNORED.
Inode 615918 was part of the orphaned inode list.  IGNORED.
Deleted inode 615920 has zero dtime.  Fix? no

Inode 615886, i_size is 2252074691854336, should be 0.  Fix? no

Inode 615886, i_blocks is 268435456, should be 0.  Fix? no

Error reading extended attribute block 524288 (Extended attribute block has a bad header).
/dev/mmcblk0p3: ********** WARNING: Filesystem still has errors **********

e2fsck: aborted

/dev/mmcblk0p3: ********** WARNING: Filesystem still has errors **********

pi@emonpi:~ $ sudo fsck -vn /dev/mmcblk0p1 fsck from util-linux 2.29.2 fsck.fat 4.1 (2017-01-24) Checking we can access the last sector of the filesystem
0x41: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
 Automatically removing dirty bit.
Boot sector contents:
System ID "mkfs.fat"
Media byte 0xf8 (hard disk)
       512 bytes per logical sector
       512 bytes per cluster
        32 reserved sectors
First FAT starts at byte 16384 (sector 32)
         2 FATs, 32 bit entries
    348672 bytes per FAT (= 681 sectors) Root directory start at cluster 2 (arbitrary size) Data area starts at byte 713728 (sector 1394)
     87078 data clusters (44583936 bytes)
32 sectors/track, 64 heads
         0 hidden sectors
     88472 sectors total
Checking for unused clusters.
Checking free cluster summary.
Leaving filesystem unchanged.
/dev/mmcblk0p1: 179 files, 45008/87078 clusters pi@emonpi:~ $

borpin · 2 May 2019 15:09

Ok so there are errors on the main file system. The data is on the other partition (which seems to be ok) so should be safe but at your own risk.

What you did there was a dry run.

sudo fsck -vp /dev/mmcblk0p3

should actually fix all the errors it finds.

No guarantees, but if it is bad sectors causing the issue this might fix it.

Bill.Thomson · 2 May 2019 15:15

One caveat here. DON’T do any “repairs” while the volume is mounted.
Doing so will mangle your filesystem.

To repair your FS safely, the partition you want to repair needs to be unmounted.

To unmount it, issue the command sudo umount /dev/mmcblk0p3

Note that’s umount vice unmount

Another method is to put the card in another computer that’s running Linux and run fsck with the card not mounted.

rdavies6 · 2 May 2019 15:18

Thanks for the guidance…I’ll have to wait until I’m back at home to take a look then. Also, finding another Linux box with a SD card reader is going to be a challenge…all my other Rasp Pi’s are in storage after a home move…so the only other option is spinning up a VM on a Macbook.

Bill.Thomson · 2 May 2019 15:25

Looks like we were typing at the same time.
You should be able to unmount the 3rd partion to do the repair.

I edited my post above to include the command to unmount the partition.

borpin · 2 May 2019 15:40

I think we might be barking up the wrong tree. I think it is something to do with changes to systemd to harden it but as yet have not found out what.

pb66 · 2 May 2019 15:54

How are you interpreting the results Brian? The main FS (or rootfs) is on partition 2, it seems you are testing partitions 1 and 3.

Also -n is not dry run, -N is dry-run, -n is just say “no” to certain actions (from the man page)

-n
For some filesystem-specific checkers, the -n option will cause the fs-specific fsck to avoid attempting to repair any problems, but simply report such problems to stdout. This is however not true for all filesystem-specific checkers. In particular, fsck.reiserfs (8) will not report any corruption if given this option. fsck.minix (8) does not support the -n option at all.

-N
Don’t execute, just show what would be done.

Also -v is not a recognised option, I think you were aiming for -V verbose output

-V
Produce verbose output, including all file system-specific commands that are executed.

I think the commands should have perhaps been

sudo fsck -VN /dev/mmcblk0p2
sudo fsck -VN /dev/mmcblk0p3
sudo fsck -VN /dev/mmcblk0p1

Ordinarily though, segmentation faults are not sdcard failures, they are usually from accessing the wrong parts of RAM. PHP is renowned for it and we have seen Redis do it too. It would be good to rule out sdcard errors by fixing with fsck but I’m not entirely convinced that is what the error is.

I think it would be wise to note that we have never seen this issue with emonhub in 5+ years, and in the couple of weeks since changing to systemd (rather than init.d) we have seen 2 identical reports in as many weeks. I am not a big fan of coincidences, not saying they don’t happen, just saying what is often thought to be just coincidental often isn’t.

borpin · 2 May 2019 18:41

Which man? I discovered the man is different for the different tools that are called by fsck, so the options for fsck.ext4 are different to fsck.fat. You can see in the first line which tool is actually called. fsck itself is just a wrapper.

From man of e2fsck, fsck.fat, fsck.ext4, & fsck.ext2

 -v     Verbose mode.

not all have N as an option
fsck.fat

-n  No-operation mode: non-interactively check for errors, but don't write anything to the filesystem.

e2fsck

-n     Open  the  filesystem read-only, and assume an answer of `no' to all questions.  Allows e2fsck to be used non-interactively.  This
              option may not be specified at the same time as the -p or -y options.

-vn are valid options.

-V     Print version information and exit.

I think those services running via systemd have been in place a while longer than that. In both cases they had been running for a while and suddenly did it with apparently no interaction or changes IIRC.

There are some comments that there might be core dumps that could shed some light, but that is outside what I know about.

pb66 · 2 May 2019 20:31

The fsck man as that is the command you ran. I couldn’t find any ref to the lower case -v and for -n it stated it “just output to stdout” any changes that were needed which fitted what we were seeing. Reading closer, I see it also states

Options which are not understood by fsck are passed to the filesystem-specific checker.

which I missed first time round.

None the less, When I was looking at the man previously it was to try and determine how you knew what condition the rootfs (mmcblk0p2) was in (if that’s what you mean by main fs) by running checks on the other 2 mounted partitions (mmcblk0p1 and mmcblk0p3) wondering if there was perhaps some cascading check that automatically checked the fs the target was mounted to, but I didn’t. So was it intentional to only check the boot and data partitions? I’m struggling to understand how we know the “main” fs is ok.

Ok, just checked and the changes to update emonhub to systemd were added to the update process in Feb, you are right that is more than a couple of weeks. But in the grand scale of things, it is still extremely recent when emonhub has not failed to start with these errors in the 5 years prior. We do not know when the buttons were pressed to invoke an update nor how long it might take to fault. All I am saying is it is far more likely to be the (fairly) recent changes to the emonhub service than 2 independent coincidental sdcard failures, especially given the fault description. I’m just saying don’t draw conclusions too quickly nor overlook the common factors of 2 reported errors that appear identical. ie they have both pulled in the recent changes before failing.

borpin · 3 May 2019 08:08

I made that mistake until I dug a bit deeper.

I am reasonably sure that when I checked on my system before offering the advice, I did it on an EmonSD imaged system. I have been swapping between that and an install script image so could have made a mistake in the partition numbers.

I do not think the users did anything initially either to the OS or EmonCMS.

rdavies6 · 3 May 2019 08:10

Many thanks for continuing the discussion with @borpin Unfortunately i didn’t get much of chance to look at repairing the filesystems last night. I did try and quickly unmount the filesystem, but the filesystem was busy, and i didn’t have enough time to then sort out an alternative process of running fsck on a separate device (life - in fact my wife- had other plans involving Ikea and some flatpack furniture last night). However I should be able to take a look at repairing the filesystems over the bank holiday weekend.

I know I have updated the software sometime since February - although I can’t pin down exactly when…however I hadn’t touched the EmonPi over the last few weeks…

rdavies6 · 3 May 2019 11:23

I temporarily enabled coredumps on the EmonPi, and captured a dump that tallies with the Emonhub startup pid listed in the logfile. Would his be helpful to someone (if so, what’s the best way to upload it?)?

emonhub.service - emonHub service description
   Loaded: loaded (/home/pi/emonhub/service/emonhub.service; enabled; vendor preset: enabled)
   Active: failed (Result: core-dump) since Fri 2019-05-03 10:27:00 UTC; 48min ago
  Process: 949 ExecStart=/usr/share/emonhub/emonhub.py --config-file=/home/pi/data/emonhub.conf (code Main PID: 949 (code=dumped, signal=SEGV)

May 03 10:27:00 emonpi systemd[1]: emonhub.service: Failed with result 'core-dump'.
May 03 10:27:00 emonpi systemd[1]: emonhub.service: Service hold-off time over, scheduling restart.
May 03 10:27:00 emonpi systemd[1]: Stopped emonHub service description.
May 03 10:27:00 emonpi systemd[1]: emonhub.service: Start request repeated too quickly.
May 03 10:27:00 emonpi systemd[1]: Failed to start emonHub service description.
May 03 10:27:00 emonpi systemd[1]: emonhub.service: Unit entered failed state.

There were quite a few python related corefiles being generated…so i disable coredumps again

borpin · 3 May 2019 11:28

Intrigued to know how - not something I have ever done

You should be able to attach the file to a post (if it is a reasonable size). If you can’t I’ll bump your membership level - I’m not sure what level it needs to be.