RFC - systemd integration with watchdog support and restart on thread failure

tim · 14 July 2017 15:11

Hello,

Here’s an emonhub tree which includes systemd integration. It could do with some testing, since I’ve only tested it on a machine running Debian 9.0 (stretch) and systemd.

Code review welcome.

Features:

. Systemd unit file
. Logging to journald by default (configurable)
. Updated install script
. Upgrade from sysvinit with install scripts
. Systemd unit file setup to automatically restart emonhub on failure by default
. Supports systemd sd_notify interface to report successful startup
. Supports systemd sd_notify interface’s application watchdog functionality
. Supports automatic restart of emonhub on thread failure via watchdog/heartbeat
. Basic docs

You’ll need to make some config file changes in order to enable the watchdog functions - see conf/emonpi.default.emonhub.conf for details.

tim · 15 July 2017 06:56

You can use the github compare facility to do a patch inspection e.g. Comparing openenergymonitor:emon-pi...tim-seoss:emon-pi-systemd · openenergymonitor/emonhub · GitHub

A big chunk of the work was trying to keep the current behaviour largely consistent with the current install script. Similarly if you call things like /etc/init.d/emonhub restart then the changes to the SysV init.d file mean that this now calls the systemd unit file (but only on systems which are using systemd).

There may well be a case or two I haven’t caught, so I’d appreciate some testing before submitting a pull request.

TrystanLea · 15 July 2017 08:27

Thanks @tim, I will have a good look at this and get a test running over the next couple of days.

tim · 17 July 2017 09:16

Commenting on @pb66 comments from a different thread (but which are specific to this RFC)…

Use of /etc/default/emonhub is retained (but now the install script backs this up if the user has changed it, rather than overwritting their changes). I’ve changed the default to use journald, but the install script does put a message at the end of the log file saying where the logs have been moved to, and also pointing at instructions to restore the old behaviour if you want.

Currently, the init script changes ownership on the log file directory every time emonhub is started - with my sysadmin hat on, this feels wrong to me (as well as being awkward to express in a systemd unit file). As a sysadmin if I do something like change the group ownership of the log file directory so that selected users can view it, it should stay that way, not get changed next time the daemon starts, so I think this should be restricted to the install script. Thoughts?

If a log file is in use, and the service crashes or fails to start, then:

systemctl status emonhub

will contain no useful debug messages (this could be worked around by emitting a “look in $LOGFILE for errors” type message).

How about:

. Default to keeping emonhub.log (but include journald instructions)
. Set file permissions for emonhub.log etc. at install time only
. If running under systemd, and using a logfile, then direct the user to the log file in a way that’s obvious in systemctl status.
. If easy to do, then emit WARNING messages to both mechanisms (haven’t looked at this yet)?

If you don’t use systemd, then the existing behaviour is maintained with my patches.

I think it’s probably best to do both:

Restart threads, but if the daemon locks up completely (rare, but not entirely unheard of), or otherwise the thread restart fails (perhaps some thread exhausts a resource over time, like filehandles, starving the others), then systemd will restart the service as a whole. If internal clean-up code can fix things, then no-problem, systemd doesn’t need to do anything. Ideally in either condition some sort of automatic or semi-automatic backtrace reporting would be great (like Chrome, Firefox, libreoffice, KDE etc. do).

I agree that it’s best not to just plonk a sticking plaster on top of bad code, so the watchdog restart and/or thread restart is only half the solution I think without proper back-trace reporting.

On the other hand especially for an un-monitored potentially remote device, then defaulting to having proper watchdogs are essential, and doing so on all three levels (system-wide, per-daemon, and per-thread) is good to minimise data loss in each different type of failure mode.

BTW - I don’t know if the emonpi does this by default, but a system-wide watchdog can easily be enabled on a Pi by setting #RuntimeWatchdogSec=60 or similar in /etc/systemd/system.conf.

TrystanLea · 17 July 2017 14:51

Hello @tim

I’ve merged your changes locally to test, I’ve added the emonhub.conf entries and tried restarting emonhub with systemctl. It all seems to run fine including the log etc.

Then I tried to trigger one the MQTT interfacer to crash by running “self.stop = True” after about 30 seconds in the MQTT interfacer. It results in thread is dead but emonhub does not restart?

Have I missed a step in installing/merging your changes?

I agree its important to have both watchdogs, automatic restarting etc as well as error tracking so that its not just a sticking plaster. I think this will be a good step forwards. Thankyou for your work on this.

tim · 19 July 2017 09:43

The only thing I can think of is…

I’ve added a couple of new settings to the [hub] section of emonhub.conf. These settings are off by default (i.e. unless otherwise defined in the config file), my changes also alter the default emonhub.conf config file to add these settings, but this file doesn’t get installed into existing installations of course so as not to clobber user’s changes elsewhere in the file…

[hub]
### If set, then enable the systemd service watchdog
systemd_watchdog_timeout_secs = 30

### If the timeout is enabled above, we're running under systemd's
### control, and this is set to 'yes', then if a thread is dead, we'll
### stop issuing watchdog heartbeats.  This will cause systemd to
### restart the emonhub service (enabled by default in the emonhub
### systemd unit file).
systemd_heartbeat_require_all_threads = yes

No problem. I’m keen to get Paul’s @pb66 further feedback, and rework the patches based on that. BTW, I don’t think they should be merged before then - in particular I’ll see if I can come up with something better for log file handling.

What are you feelings about having some automatic checks and tests on commit (using travis or otherwise), continuing the code cleanup, and also putting in some sort of automatic backtrace reporting?

TrystanLea · 19 July 2017 10:01

Thanks @tim I did add those settings in but no luck, if you run “self.stop = True” in one of the interfacers are you able to confirm that it crashes the thread and restarts?
Yes happy to wait for @pb66’s input.

Regarding automatic checks and tests on commit, this is something that @glyn.hudson setup for emoncms and our other repositories, Im all for it but its not something I know as much about, the github integration is nice and pain free.

Yes I think it would be great to include backtrace reporting , It will be vital if we are automatically restarting emonhub if it crashes. A specific local backtrace log file would be useful. Then I guess we need to as testers and developers periodically check this log file to see if there are regular crash events happening. Perhaps ultimately reporting to the emoncms admin interface the number of crash events would be useful but maybe that’s something for a later stage?

tim · 19 July 2017 10:28

Hi @TrystanLea - thanks for testing…

I put a self.stop = True into src/interfacers/EmonHubMqttInterfacer.py and got this in the logs (via journalctl -f -u emonhub):

python[9040]: 2017-07-19 11:16:04,834 DEBUG    MainThread Won't heartbeat - state bad
python[9040]: 2017-07-19 11:16:05,034 WARNING  MainThread MQTT thread is dead
python[9040]: 2017-07-19 11:16:05,035 DEBUG    MainThread Won't heartbeat - state bad
python[9040]: 2017-07-19 11:16:05,235 WARNING  MainThread MQTT thread is dead
python[9040]: 2017-07-19 11:16:05,236 DEBUG    MainThread Won't heartbeat - state bad
python[9040]: 2017-07-19 11:16:05,436 WARNING  MainThread MQTT thread is dead
python[9040]: 2017-07-19 11:16:05,437 DEBUG    MainThread Won't heartbeat - state bad
systemd[1]: emonhub.service: Watchdog timeout (limit 30s)!
systemd[1]: emonhub.service: Killing process 9040 (python) with signal SIGABRT.
systemd[1]: emonhub.service: Main process exited, code=killed, status=6/ABRT
systemd[1]: emonhub.service: Unit entered failed state.
systemd[1]: emonhub.service: Failed with result 'watchdog'.
systemd[1]: emonhub.service: Service hold-off time over, scheduling restart.
systemd[1]: Stopped Open Energy Monitor emonHub.
systemd[1]: Starting Open Energy Monitor emonHub...
python[9446]: 2017-07-19 11:16:21,020 INFO     MainThread EmonHub emonHub 'emon-pi' variant v1.2
python[9446]: 2017-07-19 11:16:21,020 INFO     MainThread Opening hub...
python[9446]: 2017-07-19 11:16:21,020 INFO     MainThread Logging level set to DEBUG
python[9446]: 2017-07-19 11:16:21,021 INFO     MainThread Creating EmonHubJeeInterfacer 'RFM2Pi'

Are you getting any related error messages via journalctl -u emonhub and/or in emonhub.log? Can you paste the output of systemctl status emonhub and systemctl cat emonhub please?

Cheers,

Tim.

TrystanLea · 19 July 2017 12:22

My logs seem to be missing the warning when the thread dies:

MainThread Won't heartbeat - state bad

e.g:

2017-07-19 12:14:03,347 WARNING  MainThread MQTT thread is dead
2017-07-19 12:14:03,548 WARNING  MainThread MQTT thread is dead
2017-07-19 12:14:03,749 WARNING  MainThread MQTT thread is dead
2017-07-19 12:14:03,950 WARNING  MainThread MQTT thread is dead
2017-07-19 12:14:04,235 WARNING  MainThread MQTT thread is dead
2017-07-19 12:14:04,436 WARNING  MainThread MQTT thread is dead
2017-07-19 12:14:04,637 WARNING  MainThread MQTT thread is dead

journalctl -u emonhub returns:

No journal files were found.

I must be missing a small step

In addition to setting emonhub.conf these are the steps I followed to test on my test emonpi here:

git checkout -b tim-emon-pi-systemd emon-pi
git pull https://github.com/tim-seoss/emonhub.git emon-pi-systemd
sudo systemctl daemon-reload
sudo service emonhub restart

tim · 19 July 2017 14:56

@TrystanLea the systemd unit file won’t be installed unless you run the install script - in that case the original init script will still be used as a fall back, and the systemd notify API isn’t enabled. emonhub then doesn’t detect that API being active ( if os.getenv('NOTIFY_SOCKET'): ) and doesn’t enable the watchdog code.

To verify that’s the case, run:

systemctl cat emonhub

before and after running the install script (if the correct unit file hasn’t been installed, then Debian will automatically generate one which refers to the init script, and you’ll see a line like ExecStart=/etc/init.d/emonhub start in the unit file - i.e. the backwards-compatible unit file just calls the init script instead of directly controlling the daemon).

The install script should probably stop emonhub via the init script automatically before installing the new unit file and starting it again (via the new unit file)… I’ll put that on the TODO list.

tim · 19 July 2017 15:11

TODO for these changes (I’ll come back and edit this as appropriate):

Definite:

When the install script is run, if emonhub is running under systemd via the automatic sysv-compatible unit file (systemd-sysv-generator) then restart cleanly.
Improve clarity on the unit file comments e.g. systemd_heartbeat_require_all_threads section is a bit awkwardly worded.

Maybe:

Default to using emonhub.log (but allow the use of journald instead?)
Warn if running under systemd, but the old init script is still being used?
Install script should alter permissions, not the init script (and unit file)?

Later:

pre-commit triggers? Travis?
Auto backtrace reporting?

TrystanLea · 20 July 2017 08:30

Hello @tim, I’ve run the install script, systemctl cat emonhub now returns the system unit file, but I cant seem to start emonhub, this is the example output when I run ‘sudo service emonhub status’

emonhub.service - Open Energy Monitor emonHub
   Loaded: loaded (/etc/systemd/system/emonhub.service; enabled)
   Active: activating (auto-restart) (Result: exit-code) since Thu 2017-07-20 08:26:48 UTC; 6s ago
  Process: 17929 ExecStartPre=/bin/sh -c [ -f ${EMONHUB_LOG} ] && ( tail ${EMONHUB_LOG} | grep -q journald ) || ( expr "x${SYSTEMD_EXTRA_ARGS}" :  x.*--logfile' > /dev/null || echo emonhub logs to journald by default with systemd see /etc/default/emonhub >> ${EMONHUB_LOG} )' (code=exited, status=2)
  Process: 17926 ExecStartPre=/bin/sh -c [ -f $EMONHUB_PATH$NAME.py ] || ( echo 'Fatal: cannot find $EMONHUB_PATH$NAME.py' && exit 1 ) >&2 (code=exited, status=0/SUCCESS)
  Process: 17922 ExecStartPre=/bin/sh -c [ -f $EMONHUB_CONFIG ] || ( /bin/echo Fatal: cannot find $EMONHUB_CONFIG && exit 1 ) >&2 (code=exited, status=0/SUCCESS)
 Main PID: 17840 (code=exited, status=1/FAILURE)

Jul 20 08:26:48 emonpi systemd[1]: Failed to start Open Energy Monitor emonHub.
Jul 20 08:26:48 emonpi systemd[1]: Unit emonhub.service entered failed state.

tim · 20 July 2017 13:34

Thanks @TrystanLea sorry about that. Can you let me know what’s in /etc/default/emonhub and /etc/default/emonhub.local please, and I’ll rework the branch…

TrystanLea · 20 July 2017 15:09

Nothing in /etc/default/emonhub.local. But /etc/default/emonhub looks the same as:

github.com

tim-seoss/emonhub/blob/emon-pi-systemd/conf/default/emonhub

## emonHub settings

# Edit this to configure the parameters used in
# the /etc/init.d/emonhub script.

# n.b. The emonhub install script currently overwrites
# this file, but creates a backup.

# This file should be deployed to /etc/default/emonhub
# unless you have edited the init.d file to give an
# alternate SYSCONF_PATH

# If started from the default systemd service file,
# the optional file /etc/default/emonhub.local
# is used after this file, and will override any
# settings defined here.

# Specify the directory in which emonhub.py is found:
EMONHUB_PATH=/usr/share/emonhub/

This file has been truncated. show original

tim · 21 July 2017 16:58

I’d previously been testing on my box which runs Debian 9.0 Stretch, unfortunately 8.0 (Jessie) has an older systemd which is missing a feature which I was using. I’ve worked around that, but I think it’s probably best to just drop the workaround and limit setting the watchdog timeout in the unit file only to simplify the code. I haven’t done that yet, but in the meantime, I’ve pushed the workaround and fixes to GitHub - tim-seoss/emonhub at emon-pi-systemd

Hopefully that’ll work out of the box.

Still more cleanup work, and I’ll also need to squash some of the commits together. @TrystanLea if you have a minute, please could you give it another spin?

git checkout -b tim-emon-pi-systemd emon-pi
git pull https://github.com/tim-seoss/emonhub.git emon-pi-systemd
./install

should do it…

TrystanLea · 5 August 2017 07:40

Hello @tim, sorry for the delay,

tried your latest change this morning and got the following:

pi@emonpi(rw):emonhub$ sudo service emonhub status
● emonhub.service - Open Energy Monitor emonHub
   Loaded: loaded (/etc/systemd/system/emonhub.service; enabled)
   Active: activating (auto-restart) (Result: exit-code) since Sat 2017-08-05 07:37:31 UTC; 8s ago
  Process: 5659 ExecStart=/usr/bin/python ${EMONHUB_PATH}${NAME}.py --config-file $EMONHUB_CONFIG $SYSTEMD_EXTRA_ARGS (code=exited, status=2)
  Process: 5654 ExecStartPre=/bin/sh -c [ -f ${EMONHUB_LOG} ] && ( tail ${EMONHUB_LOG} | grep -q journald ) || ( expr "x${SYSTEMD_EXTRA_ARGS}" : "x.*--logfile" > /dev/null || echo "emonhub logs to journald by default with systemd see /etc/default/emonhub" >> ${EMONHUB_LOG} ) (code=exited, status=0/SUCCESS)
  Process: 5651 ExecStartPre=/bin/sh -c [ -f $EMONHUB_PATH$NAME.py ] || ( echo 'Fatal: cannot find $EMONHUB_PATH$NAME.py' && exit 1 ) >&2 (code=exited, status=0/SUCCESS)
  Process: 5648 ExecStartPre=/bin/sh -c [ -f $EMONHUB_CONFIG ] || ( /bin/echo Fatal: cannot find $EMONHUB_CONFIG && exit 1 ) >&2 (code=exited, status=0/SUCCESS)
 Main PID: 5659 (code=exited, status=2)

Aug 05 07:37:31 emonpi systemd[1]: Failed to start Open Energy Monitor emonHub.
Aug 05 07:37:31 emonpi systemd[1]: Unit emonhub.service entered failed state.

TrystanLea · 5 August 2017 07:56

I tried resetting the installation with:

sudo systemctl stop emonhub
sudo systemctl disable emonhub
sudo rm /etc/systemd/system/emonhub.service
sudo rm /etc/init.d/emonhub
sudo rm /etc/default/emonhub
sudo systemctl daemon-reload
sudo systemctl reset-failed

then a re-install, but no luck