EmonPi needs periodic restarting

Csaba_Zagoni · 24 April 2017 12:46

Hi All,

I have a problem with an emonpi / emontx configuration that I set up several months ago for an NGO I work for as a pilot sub-metering project. Our aim would be to extend this approach to the whole building bearing in mind the limitations of the number of TXs connecting to the pi.

Basically, the system works fine for a number of days (usually between 1-3 weeks) and then the data stops reaching emoncms.

This is the status of emonhub and mqtt when the system is down:

When looking at the logs, they show that the MQTT thread is dead.

I’ve read on other topics that the solution to this problem is a firmware update which I carried out, however, the issue persists. This is the server info I have:

After I restart the emonhub service everything seems to be back to normal:

But obviously I’d like to avoid having to do this manually

The setup is one emonPi sitting on my desk, while three emonTXs are in the distribution board of the NGO’s warehouse. All hardware and software is stock, unmodified apart from the emonpi’s firmware update which I carried out according to your guide.

The setup monitors three phases in a warehouse which has a small server room fed by UPS C and a Yurt outside that has heating - this is my dashboard:

You can see that the last time it stopped was around 0100 19/04/17 and it’d been running since 05/04/17. After the restart the dashboard is getting data again.

If you could guide me towards a solution it would be much appreciated. Also, would someone check please that my firmware update succeeded as necessary (based on the server info posted)?

thanks a lot,
csaba

Dave · 25 April 2017 14:08

Yeh I have this problem that I’ve never managed to resolve, I use a Node-RED process to act as a watchdog that restarts the emonhub service and provide a push notification to my phone via Pushover.
Occasionally the Pi will crash and a reboot through SSH is required.

[{"id":"6aa2920b.c3bf1c","type":"pushover","z":"a04f2145.9774e","name":"Pushover","device":"","title":"Node-RED","priority":0,"sound":"pushover","url":"","url_title":"","x":701.6666259765625,"y":112.72606658935547,"wires":[]},{"id":"6537f6d3.5367a","type":"mqtt in","z":"a04f2145.9774e","name":"Watchdog","topic":"emon/emonpi/TotalGeneration","qos":"2","broker":"a06667e4.71e22","x":161.05929565429688,"y":184.6666717529297,"wires":[["ff9a4262.3eeff8","49eed869.a4cde"]]},{"id":"ff9a4262.3eeff8","type":"trigger","z":"a04f2145.9774e","op1":"","op2":"The Loft EmonPi has stopped updating!","op1type":"nul","op2type":"str","duration":"60","extend":true,"units":"s","reset":"","name":"Wait 60 Secs","x":337.5592803955078,"y":167.66667938232422,"wires":[["34fb4e6a.391dfa","656d6ec9.ca9e68"]]},{"id":"34fb4e6a.391dfa","type":"delay","z":"a04f2145.9774e","name":"5 Sec Delay","pauseType":"delay","timeout":"5","timeoutUnits":"seconds","rate":"1","rateUnits":"second","randomFirst":"1","randomLast":"5","randomUnits":"seconds","drop":false,"x":336.30926513671875,"y":209.1666717529297,"wires":[["ff9a4262.3eeff8"]]},{"id":"656d6ec9.ca9e68","type":"delay","z":"a04f2145.9774e","name":"Limit Messages","pauseType":"rate","timeout":"5","timeoutUnits":"seconds","rate":"1","rateUnits":"minute","randomFirst":"1","randomLast":"5","randomUnits":"seconds","drop":true,"x":520.8449401855469,"y":166.70237731933594,"wires":[["6aa2920b.c3bf1c","41694f41.85cbf"]]},{"id":"204a4fc.3e4a63","type":"comment","z":"a04f2145.9774e","name":"Emoncms Watchdog","info":"","x":280.55926513671875,"y":96.66666412353516,"wires":[]},{"id":"41694f41.85cbf","type":"exec","z":"a04f2145.9774e","command":"sudo service emonhub restart","addpay":true,"append":"","useSpawn":"","name":"Reset emonHub","x":721.8925476074219,"y":171.6072235107422,"wires":[[],[],[]]},{"id":"49eed869.a4cde","type":"trigger","z":"a04f2145.9774e","op1":"","op2":"The Loft EmonPi has crashed!","op1type":"nul","op2type":"str","duration":"5","extend":true,"units":"min","reset":"","name":"Wait 5 Mins","x":515.3450622558594,"y":202.2380828857422,"wires":[["a4c3e255.cceea","6aa2920b.c3bf1c","97b74d10.d3c178"]]},{"id":"a4c3e255.cceea","type":"delay","z":"a04f2145.9774e","name":"5 Sec Delay","pauseType":"delay","timeout":"5","timeoutUnits":"seconds","rate":"1","rateUnits":"second","randomFirst":"1","randomLast":"5","randomUnits":"seconds","drop":false,"x":515.0236206054688,"y":237.1310272216797,"wires":[["49eed869.a4cde"]]},{"id":"97b74d10.d3c178","type":"exec","z":"a04f2145.9774e","command":"sudo reboot","addpay":true,"append":"","useSpawn":"","timer":"","name":"Reboot","x":701.475830078125,"y":220.9882049560547,"wires":[[],[],[]]},{"id":"a06667e4.71e22","type":"mqtt-broker","z":"","broker":"127.0.0.1","port":"1883","clientid":"","usetls":false,"compatmode":true,"keepalive":"15","cleansession":true,"willTopic":"","willQos":"0","willPayload":"","birthTopic":"","birthQos":"0","birthPayload":""}]

Regards
Dave

Jon · 26 April 2017 16:26

Hi Dave - Concerning the node-red flow:

When my emonPi runs a CPU intense script, I sometimes see a MQTT broker disconnect.

Apr 25 20:32:10 emonpi Node-RED[457]: 25 Apr 20:32:10 - [info] [mqtt-broker:19211dbb.e6dee2] Disconnected from broker: mqtt://localhost:1883
Apr 25 20:32:25 emonpi Node-RED[457]: 25 Apr 20:32:25 - [info] [mqtt-broker:19211dbb.e6dee2] Connected to broker: mqtt://localhost:1883

(during the past 12 hours I see three of these MQTT disconnect/connect messages)

Most of the time the MQTT broker disconnect/connect takes about 15 seconds. But there are times when it takes longer. So if you run CPU intense scripts, backups, etc., you may need to increase your 60 second delay.

pb66 · 26 April 2017 16:36

Could it be one of these mqtt broker disconnects that causes the random “mqtt thread is dead” in emonhub ??

Dave · 26 April 2017 19:33

Hi Jon & Paul
TBH my emonpi crashes during the day and in the middle of the night, I don’t run any additional once a day or load intensive scripts, what I do know id that this flow is a workaround.

Regards
Dave

Jon · 27 April 2017 14:11

Dave -
This command will display any mqtt items in the log:

cat /var/log/syslog | grep -ie mqtt -e mosquitto

Csaba_Zagoni · 5 June 2017 09:21

Thanks for all the comments above.

I’m quite disappointed that so far I have not received any constructive response from Open Energy Monitor. This issue seems to be quite common and while I really appreciate Dave sharing his approach it does not work for me.

Having spent about £500 on an energy monitoring system that stops monitoring every 1-2 weeks, I think I’ve gone out of my way to try to get the system working with my own troubleshooting, reading through loads of forum posts, doing a firmware update etc. I think I’ve described my problem in great detail, trying to make it easy for you guys to progress - which has not happened at all.

My understanding has been that the official support for the system was via this forum but please correct me if I misunderstood and direct me to the channel that I can get this sorted.

As it stands now I’m the owner of a kit that is not fit for purpose. In case there is a known reliability issue that results in the system stop logging (which does not seem to be an issue that is affecting only me) I think this should be communicated before the point of sale but at least some sort of support would be nice.

I’m still hoping to get the system properly up and running so all your assistance would be much appreciated.
csaba

pb66 · 5 June 2017 13:36

Can you provide some emonhub.log for the period leading up to the first “Thread is dead” messages?

Depending on how long ago the last fail was and how much traffic your emonPi see’s, that may be a fair way back, perhaps even in the rotated out file (emonhub.log.1). The 2 logfiles can be upto 5mb in size so they can be slow to load and navigate. less tends to be the better tool for this job.

The admin page of emoncms doesn’t actually tell you what version firmware you are running, however emoncms has been updated several times since Feb (v9.8.0) so it might be worth updating emoncms, although I have to admit I cannot recall the recommended way to initiate that without the “update” button being present on the admin page.

Thinking out loud here, I think the easiest way would be to delete the data/emoncmsupdate.log and reboot the emonPi, that would cause a “firsttimebootupdate” and may take a while so don’t accidentally power it off whilst it is updating. HOWEVER!! rebooting will wipe the log files so I would hold off on any updating for now until we get a look at the logfiles before they are lost.

TrystanLea · 6 June 2017 08:17

Hello @Csaba_Zagoni, @Dave @Jon

I’ve been trying to think of a better way to catch the moment that the “thread is dead” problem occurs while also providing automatic recovery.

I think that rather than continue to print the “thread is dead” error message we can force emonhub to close down and then have a watchdog script restart emonhub while saving the last 100 lines of the emonhub log to a dedicated crash log.

For those happy to make modifications via terminal I have outlined the steps to do this below, @Dave it would be great if you could try this. @Csaba_Zagoni I have sent you a PM I would be happy to help you with this if you can give me remote access. If this provides a solution and improved logging of “thread is dead” events from which we can further debug the root cause of the issue then we can push this out as a general update.

Implementation steps:

Open to edit emonhub.py:

sudo nano /home/pi/emonhub/src/emonhub.py

Navigate to line 140 and add “self._exit = True” below the printing of the “thread is dead” error as so:

if not I.isAlive():
    self._log.warning(I.name + " thread is dead") # had to be restarted")
    self._exit = True

If you cant find this point this link might help:

github.com

openenergymonitor/emonhub/blob/emon-pi/src/emonhub.py#L140

    
      
                  # ->avoid modification of iterable within loop
                  for name in kill_list:
                      self._log.warning(name + " thread is dead.")
          
          
            # The following should trigger a restart ... unless the
                      # interfacer is also removed from the settings table.
                      del(self._interfacers[name])
          
          
            # Trigger restart by calling update settings
                      self._log.warning("Attempting to restart thread "+name+" (thread has been restarted "+str(restart_count[name])+" times...")
                      restart_count[name]+=1
                      self._update_settings(self._setup.settings)
                      
                  # Sleep until next iteration
                  time.sleep(0.2)
          
          
def close(self):
              """Close hub. Do some cleanup before leaving."""
          
          
    self._log.info("Exiting hub...")

Restart emonhub at this point

sudo service emonhub restart

We can then add a watchdog that checks if emonhub is running and in the case that it is not restart it.

To create this basic watchdog, create a file watchdog.sh in /home/pi:

rpi-rw
cd
nano watchdog.sh

Paste the following content into that file:

#!/bin/bash
TEST=$( ps aux | grep "python /usr/share/emonhub/emonhub.py --config-file /home/pi/data/emonhub.conf" | wc -l )

LOG=$(tail -n 100 /var/log/emonhub/emonhub.log)

if [ $TEST -lt 2 ]; then
    echo "Emonhub is down, restarting!"
    sudo service emonhub restart

    echo "Last 100 lines of emonhub.log:"
    echo "$LOG"
fi

Save and exit

Make it excutable with:

sudo chmod +x watchdog.sh

Then finally add to crontab with:

sudo crontab -e

crontab entry:

* * * * * /home/pi/watchdog.sh >> /home/pi/data/watchdog.log 2>&1

Dave · 6 June 2017 10:54

Hi Trystan
I’ve added the code but have the following questions…

How will we know if the emonhub crashes? atm I have a “pushover” notification to my phone generated by a NodeRED flow and it also issues a “sudo service emonhub restart” command that restarts the emonhub service.
What’s the location of the log file?
Would it be a good idea to start a new thread so people can post the log files separate to this thread?

Regards
Dave

Csaba_Zagoni · 6 June 2017 13:39

Hi All,

Thanks a lot for the detailed responses.

@pb66 - Unfortunately, I found no usable content in the log files, not even the rotated ones. The last crash was a while ago… Also, there is no data/emoncmsupdate.log file on my pi. So I moved on to Trystan’s suggestion.

@TrystanLea - I’ve implemented the steps you suggested. As the crash happens usually 1-2 weeks from startup I’ll wait to see what is captured and will let you know. What is the location of the crash log file?

All the best,
csaba

TrystanLea · 6 June 2017 15:12

Thanks @Dave, @Csaba_Zagoni

Only from looking at the logfile, but if you want you could perhaps add to watchdog.sh to send some kind of push notification to nodered to then send a notification to your phone?
Location is /home/pi/data/watchdog.log
Yes good idea

Juerg1 · 14 July 2017 01:41

@Csaba: I noticed similar problems with feeds no longer updating after emonPi running for a few days. I have now applied the same steps as suggested by @TrystanLea (many t!hanks!) and will now wait and see for a week or two.

I presume this has not become part of a newer version?

Dave · 14 July 2017 06:11

I’ve been meaning to look for this thread.
@TrystanLea Please see below crash log from the watchdog.

watchdog.log.txt (140.6 KB)

Regards
Dave

TrystanLea · 14 July 2017 12:30

Thanks @Dave. Did it recover ok in every case with the watchdog?

@Juerg1 no this is not yet a part of a newer version. @pb66 and I have been discussing larger changes to EmonHub that should improve things quite a bit though, so there will be progress happening on this.

Dave · 14 July 2017 15:09

HI @TrystanLea
Yes I can confirm that the watchdog resets EmonCMS.
Have you found anything useful in the logs

Regards
Dave

tim · 14 July 2017 15:18

BTW, I’ve integrated systemd sd_notify() watchdog support here which includes an automatic restart option: GitHub - tim-seoss/emonhub at emon-pi-systemd - if you’d like to test it.

See separate thread - RFC - systemd integration with watchdog support and restart on thread failure

Tim.

Juerg1 · 15 July 2017 00:06

Thanks @TrystanLea for the update. Very good to know.
Cheers
Juerg

pb66 · 15 July 2017 10:55

Dave, The logs show no positive indication of a cause for the thread to fail.

However, we can see that the thread died between publishing the first value to the first topic and publishing the second value to the second topic, this is a closed loop publishing each value in the packet to a separate topic in turn.

There is no evidence of any out of range values, wrong datatypes or missing values etc so there is no evident reason for the mqtt interfacer to crash mid loop.

This would possibly point to a unhandled MQTT connection type error, the reason the “RFM2Pi” thread has crashed is because the MQTT code is incorrectly running from within the RFM2Pi thread, so when there is an MQTT issue it brings down both the MQTT and RFM2Pi threads, @TrystanLea is aware of this and it is on the hit list to be dealt with and unlikely to be the cause of the fail (although I cannot be sure). All “RFM2Pi” activity is blocked until all the MQTT is published so it cannot be the RFM2Pi code. This was proved a while back with the addition of the “Sent to channel(start)” and “Sent to channel(end)” messages.

As expected, it looks very much like the MQTT implementation, either internal or external to emonhub is the place to be looking.

It reaches and runs this line without a problem

github.com

openenergymonitor/emonhub/blob/emon-pi/src/interfacers/EmonHubMqttInterfacer.py#L125

    
      
              value = frame['data'][i]
          
          
    # Construct topic
              topic = self._settings["nodevar_format_basetopic"]+nodename+"/"+inputname
              payload = str(value)
              
              self._log.debug("Publishing: "+topic+" "+payload)
              result =self._mqttc.publish(topic, payload=payload, qos=2, retain=False)
              
              if result[0]==4:
                  self._log.info("Publishing error? returned 4")
                  return False
          
          
# send rssi
          if 'rssi' in frame:
              topic = self._settings["nodevar_format_basetopic"]+nodename+"/rssi"
              payload = str(frame['rssi'])
          
          
    self._log.debug("Publishing: "+topic+" "+payload)
              result =self._mqttc.publish(topic, payload=payload, qos=2, retain=False)

But fails to reach that same point in the next loop, given it successfully reached that point once, we might assume it should be able to again, so that might suggest the issue is in the remaining part of the loop rather than in the beginning of the next. very loose assumptions but we need start somewhere.

Therefore you could try editing ~/emon-pi/src/interfacers/EmonHubMqttInterfacer.py and add a try/except to the next line (L126) so it looks like so


                    try:    
                        result =self._mqttc.publish(topic, payload=payload, qos=2, retain=False)
                    except Exception as e:
                        self._log.warning("Unable to publish: "+topic+" "+payload+" error: "+ str(e))

and see if that uncovers anything.

See my comments in post #7 of the “Emonhub minor code cleanup, enhancements and python3 support” thread about adding some tracebacks and restarting threads internally to emonhub.

glyn.hudson · 7 August 2017 20:30

Could anyone still having trouble with ‘thread is dead’ issues try out the new dev branch for emonHub and report back: