EmonHub still dying?

Ok, thanks, thats useful to know.

Trystan

Looking at your logs in more detail @Dave, it looks like emonhub is still restarting for you periodically following a detected ‘thread is dead’ event marked by:

2017-06-13 11:49:43,685 INFO     MainThread Exit completed
Emonhub is down, restarting!
Last 100 lines of emonhub.log:

Which suggests that the new message queue implementation has not fixed the underlying issue. I was not aware of this I had thought we had had some success with the new changes, this suggests that there is more to uncover.

@TrystanLea - I’m not sure which bit you would consider unfair, I have simply explained the facts. Yes, I did explain the pubsub implementation was the reason emonhub was no longer thread safe and yes I did describe how it was originally written and worked before you removed it in favor of the pubsub route and as far as I know your rework works, but that was only ever a fix for all the code running incorrectly in one thread. It wasn’t the reason for any thread initially crashing, but it was the reason why one crashed thread resulted in multiple crashed threads.

I have said more than once that the crashing is due to unhandled exceptions and poor error logging is the reason we cannot trace the errors. In the early days of the thread is dead reports we found and remedied a couple of format issues but it soon became clear that most if not all errors result in a thread is dead, which due to the pubsub implementation meant all the threads went down, so we didn’t even know which interfacer the fault was in.

At this point in time I have no reason to suspect your queue rework isn’t working and if it is ok that fixes the threading issue, but it doesn’t address or fix the root cause or trigger of the crashes, that’s why I wanted the tracebacks, but they do not appear to work (based purely on the fact I have not yet seen any new tracebacks in the logs). Dave’s first watchdog.log should have had over 20 tracebacks in it, there are 39 stoppages recorded, 11 of which are legitimate restarts (ideally they shouldn’t have triggered a watchdog log) but I couldn’t see any tracebacks or clues as to why the threads are continuously restarting behind the scenes.

I believe there is no single issue at the root of the threads dying, it’s generally just a lack of error handling that means pretty much any fault can cause a crash. We must have proper error handing and reporting to eliminate these errors. The traceback is far more important than the thread restarting, with good information we can improve the code and move forward, with no reporting (and autorestarting) there is no route forward.

For the record I personally do not like python decorators and I would not have implemented tracebacks the way Ben suggested, but I did not speak up at the time as my personal dislike of decorators wasn’t a firm reason for not using them and I was far too busy to rewrite it myself. I cannot comment on how or if those decorators are supposed to work but the lack of tracebacks suggests they are not doing what we needed, for one reason or another.

While we are at it, the way that the thread restarts are implemented isn’t right either, it’s a small point but it’s a bit brutal to reload all the settings to restart one interfacer when we have the interfacer name. Just that one interfacer should be rebuilt.

I thought they were where the watchdog.log was being triggered by a legitimate restart, eg an emonpi update. But I have to say I have not looked closely at how the code works yet.

@pb66 fair point, yes the intention was to couple auto-restart with tracebacks, I will try to look at this again.

@Dave @jasonnet next time you see this happen (where you have to go for a full system restart) could you try restarting mqtt_input service first, just to rule out that option, the command is:

sudo service mqtt_input restart

I have had an issue with this script that has been causing inputs to not update, but have not got to the bottom of it yet. But it may be completely unrelated and my system is running a mixture of development code as well.

I just did another test on the thread exception catching I put in before. I created a ‘bad’ interfacer class which randomly throws exceptions. I then created 4 of these interfacers and set them going. You can see in the attached log that it seems to be catching the exceptions from the different threads and writing them to the log. bad.log.txt (25.0 KB)

However, as Paul has stated many times, if the thread itself is not handling errors correctly then it could lead to it dying without an exception. In that case there is nothing you can do in the main process about this except monitor for the thread dying which is done already.

With regards to the two cases mentioned, to me (and I am no expert on this) it looks like these are the older version of the scripts as the “…thread is dead.” message is not accompanied by a “Attempting to restart thread …”. There are no instances after the patch of “thread is dead” that I can see there. This doesn’t mean that the problems they are seeing aren’t caused by threads, just that there are no thread exceptions which weren’t handled (so to speak).

@pb66 I have some sympathy with your dislike of decorators. I dont know why I used it, maybe I was just in a bad mood that day? They can definitely pose issues for thread safety if you do not reason about them right. But I think in this case it is ok. You could just put the ‘try…except’ within EmonHubInterfacer.run … I have no strong feeling.

RE the thread restart procedure, I did acknowledge in the previous thread that what I had done was quite hack-ish. I was trying not to re-implement the functionality in EmonHubInterfacer._update_settings … and the easiest way to do that seemed to be just to call that again! Certainly an area for improvement.

1 Like

I understood this to have already been implemented and tested

You must disable auto-restart at least in a separate “debugging” branch at the very least so that faults are easily noticed and not swept under the carpet, based on the traceback content we can (hopefully?) quickly add minor fixes to improve/implement sanitation checks so that the released version is not dependent on the restarts for normal running.

Another option in lieu of disabling the auto-restart for debugging is to add an email trigger so that rather than (re)logging the last 100 log entries elsewhere, it emails the last 100 entries to the user, this means we can return to a single logfile and there is also an alert to the user so that faults get reported and dealt with at the time.

It is not helpful to repeatedly send all this data silently to a 2nd logfile and then try and debug some months later when things eventually grind to halt, especially as you will now undoubtedly focus on preventing a condition arising that requires a reboot. Plus there is no better time to tackle an issue than when it happens as the user can give further info and context.

Sending an email in python is relatively straight forward and once implemented can be used for other things, I would not want to depend on nodered or any other external services.

As for the tracebacks, it appears the @log_exceptions_from_class_method decorator wraps the run function with this code
https://github.com/openenergymonitor/emonhub/blob/emon-pi/src/interfacers/emonhub_interfacer.py#L34-L42

Personally I would have move the content of run to another function eg _run and called that new function from within a try/except in run eg something along the lines of

def run(self):
    try:
        self._run(self)
    except:
        if self.stop:
            self._log.debug("Stopping "+self.name+" interfacer)
        else:    
            self._log.warning("Exception caught in "+self.name+" thread. "+traceback.format_exc())

using the self.stop in the exception would hopefully filter the crashes from the controlled stops. I haven’t tested this, it is just a thought at this point. I am no longer familiar enough with the emonhub code to know if this would definitely work.

Hi Ben, we posted at same time so I only just saw your post after posting mine, I’ll comment again after reading properly.

Sure no problem. I agree with your proposal.

You did indeed and I’m sorry I didn’t have time back then to engage, but as I said above it was a fairly minor point and as it wasn’t the only occurrence, it seemed quite petty to comment on this and not the bigger points.

I do not recall the particulars of the previous concept by @TrystanLea so I don’t recognize the log entries, I had assumed this was all new as @TrystanLea had walked @Dave though the transition, have you tried a clean clone or are you working from a work in progress? It is unlikely @jasonnet is using anything but a clean version but we haven’t seen his logs yet.

Perhaps the exception logging is working? I have only based my suspicion that it wasn’t working on the fact I haven’t seen one yet. But your logs do look exactly as I would hope.

[EDIT] Actually, having just done some searching on this, I think your decorator and my proposed try/except will possibly work pretty much the same way (although the try/except is arguably easier to follow) however I don’t think this is how I had it set up when I tested previously, I cannot find that code at the moment but I will look further into it when I can, (I must get back on with some proper work right now)

Yes my test was done on a clean clone from branch emon-pi on GitHub - openenergymonitor/emonhub: Python service linking and decoding input to MQTT & Emoncms .

I think the way forward is:
(i) Get logs which are clearly post-‘fix’ . The above two cases are not very clear to me what is the issue as a lot of the log is from before the patch. It would be useful to get watchdog to report the version/commit id being used as well (where is this watchdog code?).
(ii) Implement your patch above to ensure we are not missing the case with raised exception along side EmonHubInterfacer.stop .
(iii) Implement some telemetry for crashes as you suggest. I am not sure if email is best method as we could create a lot of server load pain. It would also probably need to be filtered before it gets to server as well.

FWIW I’ve not seen the problem repeat since my earlier posting (but that was just a few days ago). I just looked for all files below ~/pi/ that have log in their name and it doesn’t look like there is a log file left after the problem, emon restart, and reboot.

If this effort would benefit from some help from me, let me know. Ex. I do have a spare more-recent emon-base available that I can use for testing. Or we can arrange to back up my current (apparently failing) emon-base and try some experiments.

Thanks Jason, Could you confirm when you last ran the emonPi/emonSD update and attach your current emonhub.log? (found at /var/log/emonhub/emonhub.log) note that logs are held in ram so rebooting will wipe them, if it happens again try to get the logfile before rebooting.

Hi @pb66.
I believe I ran the last update (from the web gui) at Oct 19 04:06 UTC.

That log directory looks like:

total 11592
-rw-r--r-- 1 emonhub emonhub 5119924 Oct 25 17:16 emonhub.log.2
-rw-r--r-- 1 emonhub emonhub 5119878 Oct 25 17:31 emonhub.log.1
-rw-r--r-- 1 emonhub emonhub 1628949 Oct 25 17:31 emonhub.log

As you can see, this doesn’t go very far back in time. And oddly, those last two files have the same date and time. If I wait a minute or two, they both seem to have their mod time updated. If I tail them, it looks like they are both receiving content. Perhaps that’s something you expect, but it seems unusual to me. Here is the tail of those last two files:

pi@emonpi(ro):emonhub$ tail -n 10 emonhub.log.1 emonhub.log    
==> emonhub.log.1 <==
2017-10-25 17:36:21,307 DEBUG    emoncmsorg 144282 adding frame to buffer => [1508860343.774333, 7, 0, 0, 25, 372, 122.99000000000001, 300, 300, 300, 300, 300, 300, 1, -59]
2017-10-25 17:36:21,307 DEBUG    emoncmsorg 144283 adding frame to buffer => [1508860345.086839, 9, 0, 3, 9, 42, 123.13000000000001, 300, 300, 300, 300, 300, 300, 1, 0, -58]
2017-10-25 17:36:21,308 DEBUG    emoncmsorg 144284 adding frame to buffer => [1508860346.106141, 8, 706, 698, 69, 378, 121.97, 300, 300, 300, 300, 300, 300, 1, -37]
2017-10-25 17:36:21,308 DEBUG    emoncmsorg 144285 adding frame to buffer => [1508860347.328397, 10, 69, 367, 19, 0, 120.17, 22.700000000000003, 300, 300, 300, 300, 300, 1, -46]
2017-10-25 17:36:21,309 DEBUG    emoncmsorg 144286 adding frame to buffer => [1508860347.949284, 9, 1, 3, 10, 43, 123.23, 300, 300, 300, 300, 300, 300, 1, 0, -59]
2017-10-25 17:36:21,309 DEBUG    emoncmsorg 144287 adding frame to buffer => [1508860348.677803, 7, 0, 0, 25, 372, 122.88, 300, 300, 300, 300, 300, 300, 1, -57]
2017-10-25 17:36:21,310 DEBUG    emoncmsorg 144288 adding frame to buffer => [1508860348.788598, 26, 20, 19.200000000000003, 76.9, 2.7, 1, -70]
2017-10-25 17:36:21,310 DEBUG    emoncmsorg 144289 adding frame to buffer => [1508860348.949267, 8, 683, 682, 69, 378, 121.92, 300, 300, 300, 300, 300, 300, 1, -37]
2017-10-25 17:36:21,311 DEBUG    emoncmsorg 144290 adding frame to buffer => [1508860350.476834, 22, 20, 0, 59.1, 2.6, -57]
2017-10-25 17:36:21,311 DEBUG    emoncmsorg 144291 adding frame to buffer => [1508860350.89303, 9, 0, 3, 9, 44, 123.3, 300, 300, 300, 300, 300, 300, 1, 0, -57]

==> emonhub.log <==
2017-10-25 17:36:30,887 DEBUG    emoncmsorg 169117 adding frame to buffer => [1508883376.274161, 9, 0, 2, 9, 0, 122.56, 300, 300, 300, 300, 300, 300, 1, 0, -57]
2017-10-25 17:36:30,888 DEBUG    emoncmsorg 169118 adding frame to buffer => [1508883377.106166, 26, 20, 19.3, 77.80000000000001, 2.7, 1, -66]
2017-10-25 17:36:30,888 DEBUG    MQTT       Publishing: emon/TX07SubP/temp4 300
2017-10-25 17:36:30,888 DEBUG    emoncmsorg 169119 adding frame to buffer => [1508883377.621924, 8, 7, 5, 25, 362, 121.81, 300, 300, 300, 300, 300, 300, 1, -37]
2017-10-25 17:36:30,889 DEBUG    emoncmsorg 169120 adding frame to buffer => [1508883379.24569, 9, 1, 3, 9, 0, 122.98, 300, 300, 300, 300, 300, 300, 1, 0, -60]
2017-10-25 17:36:30,890 DEBUG    MQTT       Publishing: emon/TX07SubP/temp5 300
2017-10-25 17:36:30,890 DEBUG    emoncmsorg 169121 adding frame to buffer => [1508883379.662594, 10, 26, 354, 20, 32, 120.67, 19.5, 300, 300, 300, 300, 300, 1, -45]
2017-10-25 17:36:30,891 DEBUG    MQTT       Publishing: emon/TX07SubP/temp6 300
2017-10-25 17:36:30,892 DEBUG    emoncmsorg 169122 adding frame to buffer => [1508883380.382204, 7, 0, 0, 26, 355, 122.99000000000001, 300, 300, 300, 300, 300, 300, 1, -58]
2017-10-25 17:36:30,893 DEBUG    emoncmsorg 169123 adding frame to buffer => [1508883380.496704, 8, 7, 4, 25, 362, 121.92, 300, 300, 300, 300, 300, 300, 1, -37]
pi@emonpi(ro):emonhub$ 

I hope this helps.

I’ll try to promptly capture these logs if this reoccurs.

I will have to take a closer look at a later date, I don’t know the characteristics of the logrotate is configured and need to read up or do some tests. I would guess the same time and date is due to that being when the file was rotated out.

Of greater concern would be if more than one file was being written to. But looking at the file size it might be that emonhub is rotating the files out every few mins and that keeps the times recent and so it just looks like both file are being written to.

My experience with logrotate suggests that if one process had a file handle to the log file open when the rotate occurred, it ends up writing to the .1 file until it (eventually?) releases that open file handle and reconnects to the (new) log file. The newer log is updated because another thread or process opened the file after the rotate. – I’m not saying that’s what happened, but in my past experience, that’s typically the (rare) cause.

If it’s unexpected that two entities would be logging directly to the file like this, let me know what to look for to find those two.

J.

It happened again. The log pane is again black. I’ve included with this posting three attachments. The first contains a variety of observations and questions. The latter two are the non-empty log files.

I’ve not taken any corrective action (no restarts, no reboots), so I can look around further as you direct me to.

emonhub_problem_info.txt (18.9 KB)
emonhub.log.1.txt (4.6 MB)
emonhub.log.2.txt (4.9 MB)

I have one other comment to add here that was not in the log file. If it’s logging so rapidly and logrotate runs every 60 minutes and emonhub.log file is growing 25MB every minute, perhaps the disk did fill up, causing unrecoverable problems for emonhub and the subsequent logrotate (at 17min past the hour) removed a very large emonhub.log.2 file removing evidence that the /var/log disk filled up.

I hope this helps.

J.

Hello @jasonnet

Could you try updating emonhub, we commited a fix for the log file issue yesterday evening, see here for more detail: emonTxV3 stops and requires reboot (emonhub) - #23 by pb66

you can update emonhub to the latest version with a standard emonpi update

Yes, a very brief look at the logs confirms it is the same issue with runaway logging seen in the emonTxV3 stops and requires reboot thread. Please update as @TrystanLea suggests and

Even without logrotate, emonhub has inbuilt log file rotation at 5mb so a log file cannot grow much beyond 5mb x2, but with the current logrotate application settings allowing up to 3 files you could theoretically end up with up to 3x 5mb files if the rotates collide a certain way.

However it might be possible the data being buffered in ram could begin to choke the system if left long enough, the bugfix prevents data being buffered in the emoncmsorg senddata = 0 is set in emonhub.conf. so if the emoncmsorg interfacer isn’t being used or fully configured, senddata = 0 should be set so it doesn’t get bloated with unsent data.

10 posts were split to a new topic: Update EmonPi Button or Update EmonBase Button?