Emonpi low-write 9.8.25 - inputs stopped working

@Bernie_Carter @TrystanLea @glyn.hudson @pb66

Are you any further along or have you found anything else out? mine has crashed again and I am almost at the point of a full reinstall and restore of the database or even just create a job to reboot the pi every 2 days.

When it loses connection it dumps about 2 days worth of data. even if you start the feed writer after it crashed, the number in cache counts down but the data is still lost.

Anyone else have any ideas how to fix or even better do we need a watchdog service.

If anyone wants access to it when its crashed let me know and i will setup an ssh session to the pi

Paul

@Pukka

I ruled out NodeRed, as mine was completely down after a failed upgrade,
and MQTT still crashed.

I will look into the situation further this weekend and see if anything
else pops up.

It sounds odd that you are losing data pre-crash. Mine just doesnā€™t log
any new data.

If you want to set me up with an SSH, ill certainly take a look at it and
compare, just PM me the details.

Bernie

@Pukka As I suggested above, the best way to monitor this is to install MONIT and use the ā€˜Timestamp testā€™ to look for when the feed directories are accessed / written to.

You could get it to run a small script to output the status of the services to a file.

You can also monitor the feedwriter service and try a restart.

Looking a bit deeper at the log you posted, it looks to me like a MQTT mosquitto feed that is causing the problem. Which feeds are MQTT feeds? Have you tried disabling any of these feeds?

@glyn.hudson - Paul says he is using the ā€˜low-writeā€™ version. The phpmqtt_input.php file in the low-write branch does not have 220 lines! Iā€™m not really sure how this all hangs together but could this be an issue?

These things can be a real pain to work outā€¦

@borpin

Iā€™m running the exact same version of low-write and my phpmqtt_input has 297 lines.
Line 220 is exactly as shown in the error log
$dbinputs = $input->get_inputs($userid);

Is there a different version?

@borpin - The low-write btanch is a bit of a relic, last updated in Apr '16, ā€œlow-writeā€ in now a ā€œmodeā€ rather than a separate branch or version.

The line cited in the log is still current in the ā€œstableā€ branch,

For that line to trip up I would expect there is no userid, no user with that userid or no inputs for that user. Since the userid is hard coded to "1! and almost all emoncms instances have a user ā€œ1ā€ setup, I doubt it could be either of those so perhaps it couldnā€™t find any inputs for that user, redis flushed or mysql offline or something along those lines?

It would seem that log file is 6weeks old, can you be sure the problem is the same ? You are now running emoncms v9.8.28 and have updated the emonPi since then, so some fresh logs would be useful.

The best course of action is always to fix the cause rather than hide it behind a watchdog and restart.

By all means set up a watchdog to tell you when it misbehaves, and grab the logfiles and info when it does fail, but if you start setting up auto restarts etc you may never find the issue. You might not of even had this issue if prior issues were properly diagnosed and resolved rather than resorting to workarounds.

Can you provide new logs for the time leading up to and beyond a fail (emonhub and emoncms)?

Also check the status of the mqtt_input, emonhub and feedwriter services and post the info?

1 Like

Hi All

Thanks for your help so far

I have updated the PI with the latest os updates and node-red. if/when it crashes in the next few days i will post the log files here. i will also try and download the log files everyday.

Paul

Exactly, thatā€™s the direction Iā€™m looking as well.

Mine shows up as a broken pipe, Pauls shows up with a connection error. Whether these 2 incidents are related remains to be seen, but it appears to me then thereā€™s an issue with the MQTT Broker while running in this configuration.

Iā€™m now running an update on mine as well so that Iā€™m inline with pauls setup, and I will again see what happens.

I cannot really comment on the issue you are seeing as you have not posted any info or logs

Iā€™ve not seen any signs of the Mosquitto broker failing, it does seem the issues are all common to MQTT, but thatā€™s where the most traffic is on an emonPi/SD, plus all the ā€œtrickyā€ stuff, eg bespoke node-red flows and the more exotic emonhub interfacers.

Just looking at the log againā€¦

exception is raised on 116:

Examining this bit of code (and excuse my lack of PHP knowledge) Iā€™d have expected an exception on 116 to be caught by the try catch. If this is in a namesapce (donā€™t really understand that) the ā€˜Exceptionā€™ should have a leading \ (apparently).

The systemd service is set to restart when failed. Next time it crashes can you check the status of the mqtt_input.service and the mosquitto.service?

I have been looking into what is causing paulā€™s issues. I did find one thing of interest:

emoncms.log shows the following:
2018-03-10 13:50:03.913|WARN|phpmqtt_input.php|Not connected, retrying connection
2018-03-10 13:50:03.951|WARN|phpmqtt_input.php|Connecting to MQTT server: Connection Accepted.: code: 0

AND

Node-red changes its PID and PPID:
[GMT Mar 9 13:38:11] error : ā€˜node-redā€™ process PID changed from 543 to 29534
[GMT Mar 9 13:38:11] error : ā€˜node-redā€™ process PPID changed from 1 to 29521
[GMT Mar 9 13:40:11] error : ā€˜node-redā€™ process PID changed from 29534 to 543
[GMT Mar 9 13:40:11] error : ā€˜node-redā€™ process PPID changed from 29521 to 1
[GMT Mar 10 13:38:13] error : ā€˜node-redā€™ process PID changed from 543 to 18153
[GMT Mar 10 13:38:13] error : ā€˜node-redā€™ process PPID changed from 1 to 18139
[GMT Mar 10 13:42:13] error : ā€˜node-redā€™ process PID changed from 18153 to 543
[GMT Mar 10 13:42:13] error : ā€˜node-redā€™ process PPID changed from 18139 to 1
[GMT Mar 11 01:38:30] error : ā€˜node-redā€™ process PID changed from 543 to 28738
[GMT Mar 11 01:38:30] error : ā€˜node-redā€™ process PPID changed from 1 to 28724
[GMT Mar 11 01:42:30] error : ā€˜node-redā€™ process PID changed from 28738 to 543
[GMT Mar 11 01:42:30] error : ā€˜node-redā€™ process PPID changed from 28724 to 1

AND

The Cpu wait time averages 30-71% ever 12 hours (around 1 am and 1 pm daily)

This all corresponds to when Node-Red runs a backup script to Dropbox.
Now I can understand the wait time, but why the PID change and evenmore so why the MQTT error?

The only other issue I saw was:
[GMT Mar 9 23:07:09] error : ā€˜mqtt_inputā€™ process PID changed from 13989 to 1969
[GMT Mar 9 23:07:09] error : ā€˜mqtt_inputā€™ process PPID changed from 2267 to 1

just logged in and found a random input listed.

Yes in a try/catch if the try fails the catch is executed, however that has been somewhat undermined by the content of the catch.

    if ($connected) $log->error($e);

essentially says if there is a working connection log the error, otherwise (if there is no working connection) just carry on as if nothingā€™s wrong.

Presumably this was done to hide all the ā€œfailed to connectā€ error that used to happen, but along with it has gone any useful info, unless of course it raises and exception with a working connection.

I think it would be marginally better to have something like

    if ($connected) {
        $log->error($e);
    } else {
        $log->info($e);
    }

so that the errors can be seen if the ā€œinfoā€ log level is selected. But that still doesnā€™t seem right to me.

This maybe just a coincidence and totally unrelated, BUT, not so long ago I installed a dropbox utility to a Pi3 and I had to remove it again as it regularly brought the Pi to itā€™s kneeā€™s, I mean to the point I could barely ssh in. I donā€™t recall off-hand what code I used, but it certainly wasnā€™t anything node-red as I donā€™t use it.

The emonhub.logā€™s for the time leading up to and beyond the time of the incident (19hrs previous) would be very useful. As would your emonhub.conf nodes section.

If the ā€œemonth1ā€ node is correctly defined emonhub will not allow anything that doesnā€™t conform to that definition through. So if it is defined correctly the ā€œrogue node inputā€ has come from elsewhere bypassing emonhub. Otherwise a tighter definition will prevent further occurrences. But the emonhub.log (if you have it) will be more exact, ie we should see if that packet passed via emonhub and how, rather than assuming the node definition is working.

emonhub.txt (1.2 MB)

#######################################################################
#######################      emonhub.conf     #########################
#######################################################################
### emonHub configuration file, for info see documentation:
### http://github.com/openenergymonitor/emonhub/blob/emon-pi/configuration.md
#######################################################################
#######################    emonHub  settings    #######################
#######################################################################

[hub]
### loglevel must be one of DEBUG, INFO, WARNING, ERROR, and CRITICAL
loglevel = DEBUG
### Uncomment this to also send to syslog
# use_syslog = yes
#######################################################################
#######################       Interfacers       #######################
#######################################################################
[interfacers]

### This interfacer manages communication to BMW API for electric car monitoring
[[BMWi3]]
    Type = EmonHubBMWInterfacer
    [[[init_settings]]]
        timeinverval =600
        duringchargetimeinterval=60
        nodeid = 28
        tempcredentialfile = /tmp/bmwcredentials.json
        bmwapiusername = username
        bmwapipassword = password

    [[[runtimesettings]]]
        pubchannels = ToEmonCMS,

### This interfacer manages the RFM12Pi/RFM69Pi/emonPi module
[[RFM2Pi]]
    Type = EmonHubJeeInterfacer
    [[[init_settings]]]
        com_port = /dev/ttyAMA0
        com_baud = 38400                        # 9600 for old RFM12Pi
    [[[runtimesettings]]]
        pubchannels = ToEmonCMS,
        subchannels = ToRFM12,

        group = 210
        frequency = 433
        baseid = 5                              # emonPi / emonBase nodeID
        calibration = 230V                      # (UK/EU: 230V, US: 110V)
        quiet = true                            # Disable quite mode (default enabled) to enable RF packet debugging, show packets which fail crc
        # interval =  0                         # Interval to transmit time to emonGLCD (seconds)
        

[[MQTT]]

    Type = EmonHubMqttInterfacer
    [[[init_settings]]]
        mqtt_host = localhost
        mqtt_port = 1883
        mqtt_user = emonpi
        mqtt_passwd = emonpimqtt2016

    [[[runtimesettings]]]
        pubchannels = ToRFM12,
        subchannels = ToEmonCMS,

        # emonhub/rx/10/values format
        # Use with emoncms Nodes module
        node_format_enable = 1
        node_format_basetopic = emonhub/

        # emon/emontx/power1 format - use with Emoncms MQTT input
        # http://github.com/emoncms/emoncms/blob/master/docs/RaspberryPi/MQTT.md
        nodevar_format_enable = 1
        nodevar_format_basetopic = emon/

## [[emoncmsorg]]
##  Type = EmonHubEmoncmsHTTPInterfacer
##    [[[init_settings]]]
##    [[[runtimesettings]]]
##        pubchannels = ToRFM12,
##        subchannels = ToEmonCMS,
##       url = https://emoncms.org
##        apikey = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
##        senddata = 1                    # Enable sending data to Emoncms.org
##        sendstatus = 1                  # Enable sending WAN IP to Emoncms.org MyIP > https://emoncms.org/myip/list
##        sendinterval= 30                # Bulk send interval to Emoncms.org in seconds

#######################################################################
#######################          Nodes          #######################
#######################################################################

[nodes]

## See config user guide: http://github.com/openenergymonitor/emonhub/blob/master/configuration.md

[[5]]
    nodename = emonpi
    [[[rx]]]
        names = power1,power2,power1pluspower2,vrms,t1,t2,t3,t4,t5,t6,pulsecount
        datacodes = h, h, h, h, h, h, h, h, h, h, L
        scales = 1,1,1,0.01,0.1,0.1,0.1,0.1,0.1,0.1,1
        units = W,W,W,V,C,C,C,C,C,C,p

[[6]]
    nodename = emontxshield
    [[[rx]]]
       names = power1, power2, power3, power4, vrms
       datacode = h
       scales = 1,1,1,1,0.01
       units =W,W,W,W,V

[[7]]
   nodename = emontx4
   [[[rx]]]
      names = power1, power2, power3, power4, vrms, temp1, temp2, temp3, temp4, temp5, temp6, pulse
      datacodes = h,h,h,h,h,h,h,h,h,h,h,L
      scales = 1,1,1,1,0.01,0.1,0.1, 0.1,0.1,0.1,0.1,1
      units =W,W,W,W,V,C,C,C,C,C,C,p

[[8]]
    nodename = emontx3
    [[[rx]]]
       names = power1, power2, power3, power4, vrms, temp1, temp2, temp3, temp4, temp5, temp6, pulse
       datacodes = h,h,h,h,h,h,h,h,h,h,h,L
       scales = 1,1,1,1,0.01,0.1,0.1, 0.1,0.1,0.1,0.1,1
       units =W,W,W,W,V,C,C,C,C,C,C,p

[[9]]
   nodename = emontx2
   [[[rx]]]
      names = power1, power2, power3, power4, vrms, temp1, temp2, temp3, temp4, temp5, temp6, pulse
      datacode = h
      scales = 1,1,1,1,0.01,0.1,0.1, 0.1,0.1,0.1,0.1,1
      units =W,W,W,W,V,C,C,C,C,C,C,p

[[10]]
    nodename = emontx1
    [[[rx]]]
       names = power1, power2, power3, power4, vrms, temp1, temp2, temp3, temp4, temp5, temp6, pulse
       datacode = h
       scales = 1,1,1,1,0.01,0.1,0.1, 0.1,0.1,0.1,0.1,1
       units =W,W,W,W,V,C,C,C,C,C,C,p

[[11]]
    nodename = 3phase
    [[[rx]]]
       names = powerL1, powerL2, powerL3, power4, Vrms, temp1, temp2, temp3, temp4, temp5, temp6, pulse
       datacodes = h,h,h,h,h,h,h,h,h,h,h,L
       scales = 1,1,1,1,0.01,0.1,0.1,0.1,0.1,0.1,0.1,1
       units =W,W,W,W,V,C,C,C,C,C,C,p
       
[[12]]
    nodename = 3phase2
    [[[rx]]]
       names = powerL1, powerL2, powerL3, power4, Vrms, temp1, temp2, temp3, temp4, temp5, temp6, pulse
       datacodes = h,h,h,h,h,h,h,h,h,h,h,L
       scales = 1,1,1,1,0.01,0.1,0.1,0.1,0.1,0.1,0.1,1
       units =W,W,W,W,V,C,C,C,C,C,C,p

[[13]]
    nodename = 3phase3
    [[[rx]]]
       names = powerL1, powerL2, powerL3, power4, Vrms, temp1, temp2, temp3, temp4, temp5, temp6, pulse
       datacodes = h,h,h,h,h,h,h,h,h,h,h,L
       scales = 1,1,1,1,0.01,0.1,0.1,0.1,0.1,0.1,0.1,1
       units =W,W,W,W,V,C,C,C,C,C,C,p

[[14]]
    nodename = 3phase4
    [[[rx]]]
       names = powerL1, powerL2, powerL3, power4, Vrms, temp1, temp2, temp3, temp4, temp5, temp6, pulse
       datacodes = h,h,h,h,h,h,h,h,h,h,h,L
       scales = 1,1,1,1,0.01,0.1,0.1,0.1,0.1,0.1,0.1,1
       units =W,W,W,W,V,C,C,C,C,C,C,p

[[19]]
   nodename = emonth1
   [[[rx]]]
      names = temperature, external temperature, humidity, battery
      datacode = h
      scales = 0.1,0.1,0.1,0.1
      units = C,C,%,V

[[20]]
   nodename = emonth2
   [[[rx]]]
      names = temperature, external temperature, humidity, battery
      datacode = h
      scales = 0.1,0.1,0.1,0.1
      units = C,C,%,V

[[21]]
   nodename = emonth3
   [[[rx]]]
      names = temperature, external temperature, humidity, battery
      datacode = h
      scales = 0.1,0.1,0.1,0.1
      units = C,C,%,V

[[22]]
   nodename = emonth4
   [[[rx]]]
      names = temperature, external temperature, humidity, battery
      datacode = h
      scales = 0.1,0.1,0.1,0.1
      units = C,C,%,V

[[23]]
    nodename = emonth5
    [[[rx]]]
       names = temperature, external temperature, humidity, battery, pulsecount
       datacodes = h,h,h,h,L
       scales = 0.1,0.1,0.1,0.1,1
       units = C,C,%,V,p

[[24]]
    nodename = emonth6
    [[[rx]]]
       names = temperature, external temperature, humidity, battery, pulsecount
       datacodes = h,h,h,h,L
       scales = 0.1,0.1,0.1,0.1,1
       units = C,C,%,V,p

[[25]]
    nodename = emonth7
    [[[rx]]]
       names = temperature, external temperature, humidity, battery, pulsecount
       datacodes = h,h,h,h,L
       scales = 0.1,0.1,0.1,0.1,1
       units = C,C,%,V,p

[[26]]
    nodename = emonth8
    [[[rx]]]
       names = temperature, external temperature, humidity, battery, pulsecount
       datacodes = h,h,h,h,L
       scales = 0.1,0.1,0.1,0.1,1
       units = C,C,%,V,p
[[19]]
   nodename = emonth1
   [[[rx]]]
      names = temperature, external temperature, humidity, battery
      datacode = h
      scales = 0.1,0.1,0.1,0.1
      units = C,C,%,V

Allows any data presenting itself as node 19 to be passed to emoncms as long as the payload has an even number of bytes (ie undefined number of ints)

[[19]]
   nodename = emonth1
   [[[rx]]]
      names = temperature, external temperature, humidity, battery
      datacodes = h,h,h,h
      scales = 0.1,0.1,0.1,0.1
      units = C,C,%,V

will only allow data packets that are exactly 8 bytes through to emoncms. (notice the use of ā€œdatacodesā€ in place of ā€œdatacodeā€ too)

I havenā€™t checked the emonpi variant of emonhubā€™s code for this yet but, I do not know what the result is of passing a packet with more than 4 ints and reaching the point where emonhub is applying the names and scales. I do not know if it has been written to handle the error if there is no name or scale available for the 5th (and subsequent) values. Perhaps this is the cause of you loss of data??? (if itā€™s an unhandled occurrance). have you noticed these extra inputs previously at all?

ps the emonhub.log is too recent to show anything relating to this occurance

I have made the change in the emonhub

This would be useful too knowā€¦

Yes I have, cannot remember the names.

Can you perhaps at least remember if they overlapped your valid nodes as this one does and/or if you noticed these extra inputs around the same time as any ā€œemonpi stops updatingā€ events?