A big chunk of the work was trying to keep the current behaviour largely consistent with the current install script. Similarly if you call things like /etc/init.d/emonhub restart then the changes to the SysV init.d file mean that this now calls the systemd unit file (but only on systems which are using systemd).
There may well be a case or two I haven’t caught, so I’d appreciate some testing before submitting a pull request.
Commenting on @pb66 comments from a different thread (but which are specific to this RFC)…
Use of /etc/default/emonhub is retained (but now the install script backs this up if the user has changed it, rather than overwritting their changes). I’ve changed the default to use journald, but the install script does put a message at the end of the log file saying where the logs have been moved to, and also pointing at instructions to restore the old behaviour if you want.
Currently, the init script changes ownership on the log file directory every time emonhub is started - with my sysadmin hat on, this feels wrong to me (as well as being awkward to express in a systemd unit file). As a sysadmin if I do something like change the group ownership of the log file directory so that selected users can view it, it should stay that way, not get changed next time the daemon starts, so I think this should be restricted to the install script. Thoughts?
If a log file is in use, and the service crashes or fails to start, then:
systemctl status emonhub
will contain no useful debug messages (this could be worked around by emitting a “look in $LOGFILE for errors” type message).
. Default to keeping emonhub.log (but include journald instructions)
. Set file permissions for emonhub.log etc. at install time only
. If running under systemd, and using a logfile, then direct the user to the log file in a way that’s obvious in systemctl status.
. If easy to do, then emit WARNING messages to both mechanisms (haven’t looked at this yet)?
If you don’t use systemd, then the existing behaviour is maintained with my patches.
I think it’s probably best to do both:
Restart threads, but if the daemon locks up completely (rare, but not entirely unheard of), or otherwise the thread restart fails (perhaps some thread exhausts a resource over time, like filehandles, starving the others), then systemd will restart the service as a whole. If internal clean-up code can fix things, then no-problem, systemd doesn’t need to do anything. Ideally in either condition some sort of automatic or semi-automatic backtrace reporting would be great (like Chrome, Firefox, libreoffice, KDE etc. do).
I agree that it’s best not to just plonk a sticking plaster on top of bad code, so the watchdog restart and/or thread restart is only half the solution I think without proper back-trace reporting.
On the other hand especially for an un-monitored potentially remote device, then defaulting to having proper watchdogs are essential, and doing so on all three levels (system-wide, per-daemon, and per-thread) is good to minimise data loss in each different type of failure mode.
BTW - I don’t know if the emonpi does this by default, but a system-wide watchdog can easily be enabled on a Pi by setting #RuntimeWatchdogSec=60 or similar in /etc/systemd/system.conf.
I’ve merged your changes locally to test, I’ve added the emonhub.conf entries and tried restarting emonhub with systemctl. It all seems to run fine including the log etc.
Then I tried to trigger one the MQTT interfacer to crash by running “self.stop = True” after about 30 seconds in the MQTT interfacer. It results in thread is dead but emonhub does not restart?
Have I missed a step in installing/merging your changes?
I agree its important to have both watchdogs, automatic restarting etc as well as error tracking so that its not just a sticking plaster. I think this will be a good step forwards. Thankyou for your work on this.
I’ve added a couple of new settings to the [hub] section of emonhub.conf. These settings are off by default (i.e. unless otherwise defined in the config file), my changes also alter the default emonhub.conf config file to add these settings, but this file doesn’t get installed into existing installations of course so as not to clobber user’s changes elsewhere in the file…
### If set, then enable the systemd service watchdog
systemd_watchdog_timeout_secs = 30
### If the timeout is enabled above, we're running under systemd's
### control, and this is set to 'yes', then if a thread is dead, we'll
### stop issuing watchdog heartbeats. This will cause systemd to
### restart the emonhub service (enabled by default in the emonhub
### systemd unit file).
systemd_heartbeat_require_all_threads = yes
No problem. I’m keen to get Paul’s @pb66 further feedback, and rework the patches based on that. BTW, I don’t think they should be merged before then - in particular I’ll see if I can come up with something better for log file handling.
What are you feelings about having some automatic checks and tests on commit (using travis or otherwise), continuing the code cleanup, and also putting in some sort of automatic backtrace reporting?
Thanks @tim I did add those settings in but no luck, if you run “self.stop = True” in one of the interfacers are you able to confirm that it crashes the thread and restarts?
Yes happy to wait for @pb66’s input.
Regarding automatic checks and tests on commit, this is something that @glyn.hudson setup for emoncms and our other repositories, Im all for it but its not something I know as much about, the github integration is nice and pain free.
Yes I think it would be great to include backtrace reporting , It will be vital if we are automatically restarting emonhub if it crashes. A specific local backtrace log file would be useful. Then I guess we need to as testers and developers periodically check this log file to see if there are regular crash events happening. Perhaps ultimately reporting to the emoncms admin interface the number of crash events would be useful but maybe that’s something for a later stage?
I put a self.stop = True into src/interfacers/EmonHubMqttInterfacer.py and got this in the logs (via journalctl -f -u emonhub):
python: 2017-07-19 11:16:04,834 DEBUG MainThread Won't heartbeat - state bad
python: 2017-07-19 11:16:05,034 WARNING MainThread MQTT thread is dead
python: 2017-07-19 11:16:05,035 DEBUG MainThread Won't heartbeat - state bad
python: 2017-07-19 11:16:05,235 WARNING MainThread MQTT thread is dead
python: 2017-07-19 11:16:05,236 DEBUG MainThread Won't heartbeat - state bad
python: 2017-07-19 11:16:05,436 WARNING MainThread MQTT thread is dead
python: 2017-07-19 11:16:05,437 DEBUG MainThread Won't heartbeat - state bad
systemd: emonhub.service: Watchdog timeout (limit 30s)!
systemd: emonhub.service: Killing process 9040 (python) with signal SIGABRT.
systemd: emonhub.service: Main process exited, code=killed, status=6/ABRT
systemd: emonhub.service: Unit entered failed state.
systemd: emonhub.service: Failed with result 'watchdog'.
systemd: emonhub.service: Service hold-off time over, scheduling restart.
systemd: Stopped Open Energy Monitor emonHub.
systemd: Starting Open Energy Monitor emonHub...
python: 2017-07-19 11:16:21,020 INFO MainThread EmonHub emonHub 'emon-pi' variant v1.2
python: 2017-07-19 11:16:21,020 INFO MainThread Opening hub...
python: 2017-07-19 11:16:21,020 INFO MainThread Logging level set to DEBUG
python: 2017-07-19 11:16:21,021 INFO MainThread Creating EmonHubJeeInterfacer 'RFM2Pi'
Are you getting any related error messages via journalctl -u emonhub and/or in emonhub.log? Can you paste the output of systemctl status emonhub and systemctl cat emonhub please?
My logs seem to be missing the warning when the thread dies:
MainThread Won't heartbeat - state bad
2017-07-19 12:14:03,347 WARNING MainThread MQTT thread is dead
2017-07-19 12:14:03,548 WARNING MainThread MQTT thread is dead
2017-07-19 12:14:03,749 WARNING MainThread MQTT thread is dead
2017-07-19 12:14:03,950 WARNING MainThread MQTT thread is dead
2017-07-19 12:14:04,235 WARNING MainThread MQTT thread is dead
2017-07-19 12:14:04,436 WARNING MainThread MQTT thread is dead
2017-07-19 12:14:04,637 WARNING MainThread MQTT thread is dead
journalctl -u emonhub returns:
No journal files were found.
I must be missing a small step
In addition to setting emonhub.conf these are the steps I followed to test on my test emonpi here:
@TrystanLea the systemd unit file won’t be installed unless you run the install script - in that case the original init script will still be used as a fall back, and the systemd notify API isn’t enabled. emonhub then doesn’t detect that API being active ( if os.getenv('NOTIFY_SOCKET'): ) and doesn’t enable the watchdog code.
To verify that’s the case, run:
systemctl cat emonhub
before and after running the install script (if the correct unit file hasn’t been installed, then Debian will automatically generate one which refers to the init script, and you’ll see a line like ExecStart=/etc/init.d/emonhub start in the unit file - i.e. the backwards-compatible unit file just calls the init script instead of directly controlling the daemon).
The install script should probably stop emonhub via the init script automatically before installing the new unit file and starting it again (via the new unit file)… I’ll put that on the TODO list.
I’d previously been testing on my box which runs Debian 9.0 Stretch, unfortunately 8.0 (Jessie) has an older systemd which is missing a feature which I was using. I’ve worked around that, but I think it’s probably best to just drop the workaround and limit setting the watchdog timeout in the unit file only to simplify the code. I haven’t done that yet, but in the meantime, I’ve pushed the workaround and fixes to GitHub - tim-seoss/emonhub at emon-pi-systemd
Hopefully that’ll work out of the box.
Still more cleanup work, and I’ll also need to squash some of the commits together. @TrystanLea if you have a minute, please could you give it another spin?