EmonHub still dying?

jasonnet · 23 October 2017 02:53

Hello.
Since late last year, I’ve been experiencing the emon hub death that people have reported. But a week ago I ran an update from the web gui because I saw that the the August 16th update was hopefully able to fix this problem. That update seemed to install cleanly and it’s been running for almost a week.

Alas, emonHub died again tonight. But this time, the symptoms are different. The log pane in the web gui was not showing infinitely scrolling the error message that has often been reported here. Instead, no logs at all were listed in that pane. The pane there was simply black.

Restarting emonhub from the web GUI seems to have things going again. [sic, see below] I’m not very familiar with how the OEM stack works, but I decided to capture the processes on the machine before and after the “restart”. It seems the restart didn’t terminate any processes, but it did create an emonhub process and there are two new kworker processes (which I assume are unrelated to the restart).

I’m just posting here to the community as an FYI.
This is the first time I’ve seen this blank log crash so I don’t know if it’s something that will keep on happening. I’m posting this info in case someone is already looking in to this and would like more data. I’d be happy to capture more data if it occurs again. 'just let me know what would be helpful.

Correction: Actually,restarting emonhub from the web interface did not fix everything. Yes, updates were being listed in the logs pane and the feed and input pages showed updates, but if I asked the web interface to graph recent data, it showed no data since emonhub died… even after restarting emonhub. – Fortunately, rebooting the raspberry pi has fixed this. I’m again able to see graphs of new data.

Be well.

pb66 · 24 October 2017 12:53

Unfortunately you are not the first to report a problem since the “thread is dead” issue was reported as fixed in. @Dave reported an issue in the EmonPi Crash thread but there doesn’t appear to have been any updates there.

If it happens again can you attach your emonhub.log file to this thread and also there should be a log file at… well I give in! I’ve spent some time looking for some reference to it but I had to give up, I recall it being talked about and I can see that @Dave has attached one to his post but I failed to find where that might be.

To be fair this was never a “fix”! the route cause has never been located or tackled, the “fix” was to stop one thread bringing down another thread when it crashed, add a traceback to help debug crashed threads and to add an automated thread restart (if you look at @Dave’s log it restarted many times in the duration of that one log), the intended result of the first part would be that only the faulty thread would crash, the second part would report info on why it crashed so it could be fixed and then the thread would get restarted, However the 2nd part doesn’t appear to work looking at Dave’s log, so when the third part works it simply “papers over the cracks” until a major stoppage occurs and then without the second part we are none the wiser.

It’s quite interesting that you say the emonhub service had stopped as that is a new symptom, emonhub itself has never crashed out when the interfacer threads have died previously, that combined with the fact you had to reboot (something else @Dave probably had to do as his feedwriter service had stopped/crashed) suggests this issue has evolved to affect other services rather than having been fixed.

If you post the logs and as much info as you can it would help debugging. (perhaps @Dave can tell us where he found the log file)

Dave · 24 October 2017 19:42

Hi Paul
I’ve attached my log file below which has been growing since June this year.
After the change to EmonCMS the crash is different with the emonHUB daemon being reported as down in the administration tab. (As reported here >> EmonPi Crash)
TBH the fix for me was to restart the EmonHUB service on crash using NodeRED and send me a push notification to notify me.
My watchdog log file is located here > /home/pi/data/watchdog.log

watchdog.log.txt (556.9 KB)

Regards
Dave

TrystanLea · 24 October 2017 20:33

@jasonnet thanks for posting that you have had an issue with emonhub, its interesting that you mention that the data did not update even after restarting emonhub. I’ve had an issue with the mqtt_input script myself recently but Im not sure if it has any relation to your issue or @dave’s.

@dave sorry I missed your thread. How did you recover from the crash? did you reboot the pi completely or just restart emonhub?

@pb66 I think its a little unfair to state that “To be fair this was never a “fix”! the route cause has never been located or tackled”, I completely rebuilt the messaging queue implementation so that the right operations where happening in the right threads with the hope that this might fix the underlying issue as you suggested I do!

Dave · 24 October 2017 20:44

A restart is all that’s required to get things back up and running.

Regards
Dave

TrystanLea · 24 October 2017 20:45

Thanks @dave, an emonhub restart?

Dave · 24 October 2017 20:48

No, a system wide restart, I haven’t setup NodeRED to restart the Service EmonHUB as yet.

Regards
Dave

TrystanLea · 24 October 2017 21:02

Ok, thanks, thats useful to know.

Trystan

TrystanLea · 24 October 2017 21:11

Looking at your logs in more detail @Dave, it looks like emonhub is still restarting for you periodically following a detected ‘thread is dead’ event marked by:

2017-06-13 11:49:43,685 INFO     MainThread Exit completed
Emonhub is down, restarting!
Last 100 lines of emonhub.log:

Which suggests that the new message queue implementation has not fixed the underlying issue. I was not aware of this I had thought we had had some success with the new changes, this suggests that there is more to uncover.

pb66 · 24 October 2017 21:48

@TrystanLea - I’m not sure which bit you would consider unfair, I have simply explained the facts. Yes, I did explain the pubsub implementation was the reason emonhub was no longer thread safe and yes I did describe how it was originally written and worked before you removed it in favor of the pubsub route and as far as I know your rework works, but that was only ever a fix for all the code running incorrectly in one thread. It wasn’t the reason for any thread initially crashing, but it was the reason why one crashed thread resulted in multiple crashed threads.

I have said more than once that the crashing is due to unhandled exceptions and poor error logging is the reason we cannot trace the errors. In the early days of the thread is dead reports we found and remedied a couple of format issues but it soon became clear that most if not all errors result in a thread is dead, which due to the pubsub implementation meant all the threads went down, so we didn’t even know which interfacer the fault was in.

At this point in time I have no reason to suspect your queue rework isn’t working and if it is ok that fixes the threading issue, but it doesn’t address or fix the root cause or trigger of the crashes, that’s why I wanted the tracebacks, but they do not appear to work (based purely on the fact I have not yet seen any new tracebacks in the logs). Dave’s first watchdog.log should have had over 20 tracebacks in it, there are 39 stoppages recorded, 11 of which are legitimate restarts (ideally they shouldn’t have triggered a watchdog log) but I couldn’t see any tracebacks or clues as to why the threads are continuously restarting behind the scenes.

I believe there is no single issue at the root of the threads dying, it’s generally just a lack of error handling that means pretty much any fault can cause a crash. We must have proper error handing and reporting to eliminate these errors. The traceback is far more important than the thread restarting, with good information we can improve the code and move forward, with no reporting (and autorestarting) there is no route forward.

For the record I personally do not like python decorators and I would not have implemented tracebacks the way Ben suggested, but I did not speak up at the time as my personal dislike of decorators wasn’t a firm reason for not using them and I was far too busy to rewrite it myself. I cannot comment on how or if those decorators are supposed to work but the lack of tracebacks suggests they are not doing what we needed, for one reason or another.

While we are at it, the way that the thread restarts are implemented isn’t right either, it’s a small point but it’s a bit brutal to reload all the settings to restart one interfacer when we have the interfacer name. Just that one interfacer should be rebuilt.

pb66 · 24 October 2017 21:50

I thought they were where the watchdog.log was being triggered by a legitimate restart, eg an emonpi update. But I have to say I have not looked closely at how the code works yet.

TrystanLea · 25 October 2017 08:32

@pb66 fair point, yes the intention was to couple auto-restart with tracebacks, I will try to look at this again.

@Dave @jasonnet next time you see this happen (where you have to go for a full system restart) could you try restarting mqtt_input service first, just to rule out that option, the command is:

sudo service mqtt_input restart

I have had an issue with this script that has been causing inputs to not update, but have not got to the bottom of it yet. But it may be completely unrelated and my system is running a mixture of development code as well.

beaylott · 25 October 2017 10:41

I just did another test on the thread exception catching I put in before. I created a ‘bad’ interfacer class which randomly throws exceptions. I then created 4 of these interfacers and set them going. You can see in the attached log that it seems to be catching the exceptions from the different threads and writing them to the log. bad.log.txt (25.0 KB)

However, as Paul has stated many times, if the thread itself is not handling errors correctly then it could lead to it dying without an exception. In that case there is nothing you can do in the main process about this except monitor for the thread dying which is done already.

With regards to the two cases mentioned, to me (and I am no expert on this) it looks like these are the older version of the scripts as the “…thread is dead.” message is not accompanied by a “Attempting to restart thread …”. There are no instances after the patch of “thread is dead” that I can see there. This doesn’t mean that the problems they are seeing aren’t caused by threads, just that there are no thread exceptions which weren’t handled (so to speak).

@pb66 I have some sympathy with your dislike of decorators. I dont know why I used it, maybe I was just in a bad mood that day? They can definitely pose issues for thread safety if you do not reason about them right. But I think in this case it is ok. You could just put the ‘try…except’ within EmonHubInterfacer.run … I have no strong feeling.

RE the thread restart procedure, I did acknowledge in the previous thread that what I had done was quite hack-ish. I was trying not to re-implement the functionality in EmonHubInterfacer._update_settings … and the easiest way to do that seemed to be just to call that again! Certainly an area for improvement.

pb66 · 25 October 2017 10:43

I understood this to have already been implemented and tested

You must disable auto-restart at least in a separate “debugging” branch at the very least so that faults are easily noticed and not swept under the carpet, based on the traceback content we can (hopefully?) quickly add minor fixes to improve/implement sanitation checks so that the released version is not dependent on the restarts for normal running.

Another option in lieu of disabling the auto-restart for debugging is to add an email trigger so that rather than (re)logging the last 100 log entries elsewhere, it emails the last 100 entries to the user, this means we can return to a single logfile and there is also an alert to the user so that faults get reported and dealt with at the time.

It is not helpful to repeatedly send all this data silently to a 2nd logfile and then try and debug some months later when things eventually grind to halt, especially as you will now undoubtedly focus on preventing a condition arising that requires a reboot. Plus there is no better time to tackle an issue than when it happens as the user can give further info and context.

Sending an email in python is relatively straight forward and once implemented can be used for other things, I would not want to depend on nodered or any other external services.

As for the tracebacks, it appears the @log_exceptions_from_class_method decorator wraps the run function with this code
https://github.com/openenergymonitor/emonhub/blob/emon-pi/src/interfacers/emonhub_interfacer.py#L34-L42

Personally I would have move the content of run to another function eg _run and called that new function from within a try/except in run eg something along the lines of

def run(self):
    try:
        self._run(self)
    except:
        if self.stop:
            self._log.debug("Stopping "+self.name+" interfacer)
        else:    
            self._log.warning("Exception caught in "+self.name+" thread. "+traceback.format_exc())

using the self.stop in the exception would hopefully filter the crashes from the controlled stops. I haven’t tested this, it is just a thought at this point. I am no longer familiar enough with the emonhub code to know if this would definitely work.

pb66 · 25 October 2017 10:45

Hi Ben, we posted at same time so I only just saw your post after posting mine, I’ll comment again after reading properly.

beaylott · 25 October 2017 10:46

Sure no problem. I agree with your proposal.

pb66 · 25 October 2017 11:10

You did indeed and I’m sorry I didn’t have time back then to engage, but as I said above it was a fairly minor point and as it wasn’t the only occurrence, it seemed quite petty to comment on this and not the bigger points.

I do not recall the particulars of the previous concept by @TrystanLea so I don’t recognize the log entries, I had assumed this was all new as @TrystanLea had walked @Dave though the transition, have you tried a clean clone or are you working from a work in progress? It is unlikely @jasonnet is using anything but a clean version but we haven’t seen his logs yet.

Perhaps the exception logging is working? I have only based my suspicion that it wasn’t working on the fact I haven’t seen one yet. But your logs do look exactly as I would hope.

[EDIT] Actually, having just done some searching on this, I think your decorator and my proposed try/except will possibly work pretty much the same way (although the try/except is arguably easier to follow) however I don’t think this is how I had it set up when I tested previously, I cannot find that code at the moment but I will look further into it when I can, (I must get back on with some proper work right now)

beaylott · 25 October 2017 11:25

Yes my test was done on a clean clone from branch emon-pi on GitHub - openenergymonitor/emonhub: Python service linking and decoding input to MQTT & Emoncms .

I think the way forward is:
(i) Get logs which are clearly post-‘fix’ . The above two cases are not very clear to me what is the issue as a lot of the log is from before the patch. It would be useful to get watchdog to report the version/commit id being used as well (where is this watchdog code?).
(ii) Implement your patch above to ensure we are not missing the case with raised exception along side EmonHubInterfacer.stop .
(iii) Implement some telemetry for crashes as you suggest. I am not sure if email is best method as we could create a lot of server load pain. It would also probably need to be filtered before it gets to server as well.

jasonnet · 25 October 2017 16:39

FWIW I’ve not seen the problem repeat since my earlier posting (but that was just a few days ago). I just looked for all files below ~/pi/ that have log in their name and it doesn’t look like there is a log file left after the problem, emon restart, and reboot.

If this effort would benefit from some help from me, let me know. Ex. I do have a spare more-recent emon-base available that I can use for testing. Or we can arrange to back up my current (apparently failing) emon-base and try some experiments.

pb66 · 25 October 2017 16:59

Thanks Jason, Could you confirm when you last ran the emonPi/emonSD update and attach your current emonhub.log? (found at /var/log/emonhub/emonhub.log) note that logs are held in ram so rebooting will wipe them, if it happens again try to get the logfile before rebooting.