RFC - systemd integration with watchdog support and restart on thread failure

tim · 17 July 2017 09:16

Commenting on @pb66 comments from a different thread (but which are specific to this RFC)…

Use of /etc/default/emonhub is retained (but now the install script backs this up if the user has changed it, rather than overwritting their changes). I’ve changed the default to use journald, but the install script does put a message at the end of the log file saying where the logs have been moved to, and also pointing at instructions to restore the old behaviour if you want.

Currently, the init script changes ownership on the log file directory every time emonhub is started - with my sysadmin hat on, this feels wrong to me (as well as being awkward to express in a systemd unit file). As a sysadmin if I do something like change the group ownership of the log file directory so that selected users can view it, it should stay that way, not get changed next time the daemon starts, so I think this should be restricted to the install script. Thoughts?

If a log file is in use, and the service crashes or fails to start, then:

systemctl status emonhub

will contain no useful debug messages (this could be worked around by emitting a “look in $LOGFILE for errors” type message).

How about:

. Default to keeping emonhub.log (but include journald instructions)
. Set file permissions for emonhub.log etc. at install time only
. If running under systemd, and using a logfile, then direct the user to the log file in a way that’s obvious in systemctl status.
. If easy to do, then emit WARNING messages to both mechanisms (haven’t looked at this yet)?

If you don’t use systemd, then the existing behaviour is maintained with my patches.

I think it’s probably best to do both:

Restart threads, but if the daemon locks up completely (rare, but not entirely unheard of), or otherwise the thread restart fails (perhaps some thread exhausts a resource over time, like filehandles, starving the others), then systemd will restart the service as a whole. If internal clean-up code can fix things, then no-problem, systemd doesn’t need to do anything. Ideally in either condition some sort of automatic or semi-automatic backtrace reporting would be great (like Chrome, Firefox, libreoffice, KDE etc. do).

I agree that it’s best not to just plonk a sticking plaster on top of bad code, so the watchdog restart and/or thread restart is only half the solution I think without proper back-trace reporting.

On the other hand especially for an un-monitored potentially remote device, then defaulting to having proper watchdogs are essential, and doing so on all three levels (system-wide, per-daemon, and per-thread) is good to minimise data loss in each different type of failure mode.

BTW - I don’t know if the emonpi does this by default, but a system-wide watchdog can easily be enabled on a Pi by setting #RuntimeWatchdogSec=60 or similar in /etc/systemd/system.conf.