Community
OpenEnergyMonitor

OpenEnergyMonitor Community

Emonhub crashed with OOM error in /var/log/messages

Tags: #<Tag:0x00007fc9c445b790> #<Tag:0x00007fc9c445b448> #<Tag:0x00007fc9c445ae30>

Hello All,

— edited with further details —

I have a EmonPI that yesterday stopped logging anything and when I noticed in the evening I simply restarted the emonhub process and it started sending data again.

I log all data locally so now have a gap in my data collection.

I went looking for the emonhub logfile but realised that it fills up very quickly so I didn’t have it long enough back in time to see if there were any errors in it, so I went looking at the other log files and noticed that in the /var/logs/messages log file there were a Java OOM error at the same time as the emonhub crashed, it had been running for about 3 weeks at that time.

Here is the error:

Nov 28 10:21:58 emonpi kernel: [1866652.157791] java invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Nov 28 10:21:58 emonpi kernel: [1866652.157812] java cpuset=/ mems_allowed=0
Nov 28 10:21:58 emonpi kernel: [1866652.157839] CPU: 2 PID: 1525 Comm: java Not tainted 4.4.26-v7+ #915
Nov 28 10:21:58 emonpi kernel: [1866652.157849] Hardware name: BCM2709
Nov 28 10:21:58 emonpi kernel: [1866652.157889] [<80018784>] (unwind_backtrace) from [<80014058>] (show_stack+0x20/0x24)
Nov 28 10:21:58 emonpi kernel: [1866652.157914] [<80014058>] (show_stack) from [<80321364>] (dump_stack+0xd4/0x118)
Nov 28 10:21:58 emonpi kernel: [1866652.157938] [<80321364>] (dump_stack) from [<801550fc>] (dump_header+0x64/0x18c)
Nov 28 10:21:58 emonpi kernel: [1866652.157961] [<801550fc>] (dump_header) from [<80102c60>] (oom_kill_process+0x350/0x49c)
Nov 28 10:21:58 emonpi kernel: [1866652.157982] [<80102c60>] (oom_kill_process) from [<80103130>] (out_of_memory+0x30c/0x344)
Nov 28 10:21:58 emonpi kernel: [1866652.158005] [<80103130>] (out_of_memory) from [<80108448>] (__alloc_pages_nodemask+0x9a0/0x9e0)
Nov 28 10:21:58 emonpi kernel: [1866652.158025] [<80108448>] (__alloc_pages_nodemask) from [<801015c4>] (filemap_fault+0x238/0x50c)
Nov 28 10:21:58 emonpi kernel: [1866652.158049] [<801015c4>] (filemap_fault) from [<801ebd14>] (ext4_filemap_fault+0x3c/0x50)
Nov 28 10:21:58 emonpi kernel: [1866652.158074] [<801ebd14>] (ext4_filemap_fault) from [<8012abbc>] (__do_fault+0x4c/0xb4)
Nov 28 10:21:58 emonpi kernel: [1866652.158096] [<8012abbc>] (__do_fault) from [<8012e4a8>] (handle_mm_fault+0x8d8/0x112c)
Nov 28 10:21:58 emonpi kernel: [1866652.158116] [<8012e4a8>] (handle_mm_fault) from [<805bb448>] (do_page_fault+0x340/0x3b4)
Nov 28 10:21:58 emonpi kernel: [1866652.158136] [<805bb448>] (do_page_fault) from [<800092a8>] (do_PrefetchAbort+0x44/0xa8)
Nov 28 10:21:58 emonpi kernel: [1866652.158155] [<800092a8>] (do_PrefetchAbort) from [<805bafa4>] (ret_from_exception+0x0/0x1c)
Nov 28 10:21:58 emonpi kernel: [1866652.158167] Exception stack(0xb9b7bfb0 to 0xb9b7bff8)
Nov 28 10:21:58 emonpi kernel: [1866652.158181] bfa0:                                     76287900 ffffffff 00000032 00000000
Nov 28 10:21:58 emonpi kernel: [1866652.158198] bfc0: 76286c00 7698e4d0 76287900 76206438 76206448 76286c00 769b4360 638fed3c
Nov 28 10:21:58 emonpi kernel: [1866652.158213] bfe0: 00000000 638fed10 767c03d0 767ecb64 60000010 ffffffff
Nov 28 10:21:58 emonpi kernel: [1866652.158222] Mem-Info:
Nov 28 10:21:58 emonpi kernel: [1866652.158247] active_anon:216082 inactive_anon:18403 isolated_anon:0
Nov 28 10:21:58 emonpi kernel: [1866652.158247]  active_file:592 inactive_file:580 isolated_file:32
Nov 28 10:21:58 emonpi kernel: [1866652.158247]  unevictable:0 dirty:0 writeback:0 unstable:0
Nov 28 10:21:58 emonpi kernel: [1866652.158247]  slab_reclaimable:2267 slab_unreclaimable:3452
Nov 28 10:21:58 emonpi kernel: [1866652.158247]  mapped:2939 shmem:26226 pagetables:1274 bounce:0
Nov 28 10:21:58 emonpi kernel: [1866652.158247]  free:3060 free_pcp:88 free_cma:31
Nov 28 10:21:58 emonpi kernel: [1866652.158300] Normal free:12240kB min:8192kB low:10240kB high:12288kB active_anon:864328kB inactive_anon:73612kB active_file:2368kB in
Nov 28 10:21:58 emonpi kernel: [1866652.158312] lowmem_reserve[]: 0 0
Nov 28 10:21:58 emonpi kernel: [1866652.158327] Normal: 554*4kB (UMEC) 271*8kB (UMEC) 87*16kB (UMEH) 27*32kB (UME) 26*64kB (UMEH) 15*128kB (UMEH) 6*256kB (UM) 1*512kB (
Nov 28 10:21:58 emonpi kernel: [1866652.158404] 27490 total pagecache pages
Nov 28 10:21:58 emonpi kernel: [1866652.158415] 0 pages in swap cache
Nov 28 10:21:58 emonpi kernel: [1866652.158425] Swap cache stats: add 0, delete 0, find 0/0
Nov 28 10:21:58 emonpi kernel: [1866652.158433] Free swap  = 0kB
Nov 28 10:21:58 emonpi kernel: [1866652.158441] Total swap = 0kB
Nov 28 10:21:58 emonpi kernel: [1866652.158449] 253952 pages RAM
Nov 28 10:21:58 emonpi kernel: [1866652.158457] 0 pages HighMem/MovableOnly
Nov 28 10:21:58 emonpi kernel: [1866652.158464] 4837 pages reserved
Nov 28 10:21:58 emonpi kernel: [1866652.158472] 2048 pages cma reserved
Nov 28 10:21:58 emonpi kernel: [1866652.158481] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Nov 28 10:21:58 emonpi kernel: [1866652.158521] [  134]     0   134     2651      769       9       0        0             0 systemd-journal
Nov 28 10:21:58 emonpi kernel: [1866652.158538] [  136]     0   136     2827      100       7       0        0         -1000 systemd-udevd
Nov 28 10:21:58 emonpi kernel: [1866652.158555] [  526]     0   526     1700       57       8       0        0             0 cron
Nov 28 10:21:58 emonpi kernel: [1866652.158571] [  527]     0   527      849       51       5       0        0             0 systemd-logind
Nov 28 10:21:58 emonpi kernel: [1866652.158588] [  536]   106   536      999      105       6       0        0             0 avahi-daemon
Nov 28 10:21:58 emonpi kernel: [1866652.158604] [  538]   105   538     1404      121       6       0        0          -900 dbus-daemon
Nov 28 10:21:58 emonpi kernel: [1866652.158620] [  599]   106   599      969       56       5       0        0             0 avahi-daemon
Nov 28 10:21:58 emonpi kernel: [1866652.158636] [  646]     0   646     8036      319       9       0        0             0 rsyslogd
Nov 28 10:21:58 emonpi kernel: [1866652.158651] [  727]   999   727   123232   113409     239       0        0             0 python
Nov 28 10:21:58 emonpi kernel: [1866652.158668] [  932]     0   932      641       72       5       0        0             0 dhcpcd
Nov 28 10:21:58 emonpi kernel: [1866652.158685] [  934]     0   934     1966      115       8       0        0         -1000 sshd

Leading up to this was a number of ‘divide by zero’ error messages in /var/log/daemon like these ones:

Nov 28 10:05:58 emonpi mqtt_input[20720]: PHP Warning:  Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:05:58 emonpi mqtt_input[20720]: Warning: Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:00 emonpi mqtt_input[20720]: PHP Warning:  Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:00 emonpi mqtt_input[20720]: Warning: Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:00 emonpi mqtt_input[20720]: PHP Warning:  Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:00 emonpi mqtt_input[20720]: Warning: Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:00 emonpi mqtt_input[20720]: PHP Warning:  Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:00 emonpi mqtt_input[20720]: Warning: Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:04 emonpi mqtt_input[20720]: PHP Warning:  Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:04 emonpi mqtt_input[20720]: Warning: Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:04 emonpi mqtt_input[20720]: PHP Warning:  Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:04 emonpi mqtt_input[20720]: Warning: Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:04 emonpi mqtt_input[20720]: PHP Warning:  Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666
Nov 28 10:06:04 emonpi mqtt_input[20720]: Warning: Division by zero in /var/www/emoncms/Modules/process/process_processlist.php on line 666

They were then followed by this also from /var/log/daemon:

Nov 28 10:22:00 emonpi openhab.sh[1086]: 2017-11-28 10:22:00.010 [ERROR] [o.i.t.m.i.MqttBrokerConnection] - MQTT connection to broker was lost
Nov 28 10:22:00 emonpi openhab.sh[1086]: org.eclipse.paho.client.mqttv3.MqttException: Connection lost
Nov 28 10:22:00 emonpi openhab.sh[1086]: at org.eclipse.paho.client.mqttv3.internal.CommsReceiver.run(CommsReceiver.java:138) [mqtt-client-0.4.0.jar:na]
Nov 28 10:22:00 emonpi openhab.sh[1086]: at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]
Nov 28 10:22:00 emonpi openhab.sh[1086]: Caused by: java.io.EOFException: null
Nov 28 10:22:00 emonpi openhab.sh[1086]: at java.io.DataInputStream.readByte(DataInputStream.java:267) ~[na:1.8.0_111]
Nov 28 10:22:00 emonpi openhab.sh[1086]: at org.eclipse.paho.client.mqttv3.internal.wire.MqttInputStream.readMqttWireMessage(MqttInputStream.java:56) ~[na:na]
Nov 28 10:22:00 emonpi openhab.sh[1086]: at org.eclipse.paho.client.mqttv3.internal.CommsReceiver.run(CommsReceiver.java:100) [mqtt-client-0.4.0.jar:na]
Nov 28 10:22:00 emonpi openhab.sh[1086]: ... 1 common frames omitted
Nov 28 10:22:00 emonpi openhab.sh[1086]: 2017-11-28 10:22:00.077 [ERROR] [o.i.t.m.i.MqttBrokerConnection] - MQTT connection to 'mosquitto' was lost: Connection lost : ReasonCode 32109 : Cause : null
Nov 28 10:22:00 emonpi openhab.sh[1086]: 2017-11-28 10:22:00.079 [INFO ] [o.i.t.m.i.MqttBrokerConnection] - Starting connection helper to periodically try restore connection to broker 'mosquitto'
Nov 28 10:22:00 emonpi systemd[1]: Stopping LSB: Save the current clock & ntp drift on read-only filesystem...
Nov 28 10:22:00 emonpi ntp-backup[6423]: Running ntp-backup to save current clock data.
Nov 28 10:22:00 emonpi Node-RED[30119]: 28 Nov 10:22:00 - [info] [mqtt-broker:843dc542.449488] Disconnected from broker: mqtt://localhost:1883
Nov 28 10:22:01 emonpi systemd[1]: Starting LSB: Save the current clock & ntp drift on read-only filesystem...
Nov 28 10:22:01 emonpi ntp-backup[6446]: Starting ntp-backup.
Nov 28 10:22:01 emonpi systemd[1]: Started LSB: Save the current clock & ntp drift on read-only filesystem.
Nov 28 10:22:01 emonpi Node-RED[30119]: 28 Nov 10:22:01 - [info] [solaredge:82667858.1ee358] {"overview":{"lastUpdateTime":"2017-11-28 10:14:28","lifeTimeData":{"energy":1.2598511E7,"revenue":1815.375},"lastYearData":{"energy":3931629.2},"lastMonthData":{"energy":163355.03},"lastDayData":{"energy":790.1149},"currentPower":{"power":819.5},"measuredBy":"INVERTER"}}
Nov 28 10:22:01 emonpi cron[526]: 2017-11-28 10:22:01 1eJd1d-0001g7-H0 Failed to create spool file /var/spool/exim4/input//1eJd1d-0001g7-H0-D: Read-only file system
Nov 28 10:22:10 emonpi openhab.sh[1086]: 2017-11-28 10:22:10.082 [INFO ] [o.i.t.m.i.MqttBrokerConnection] - Starting MQTT broker connection 'mosquitto'
Nov 28 10:22:14 emonpi dhclient: DHCPREQUEST on wlan0 to 192.168.1.1 port 67
Nov 28 10:22:15 emonpi Node-RED[30119]: 28 Nov 10:22:15 - [info] [mqtt-broker:843dc542.449488] Connected to broker: mqtt://localhost:1883

My emoncms latest version I think:

Server Information
Emoncms	Version	low-write 9.8.8 | 2017.07.18
Modules	Administration | App v1.0.0 | Backup v1.0.0 | EmonHub Config v1.0.0 | Dashboard v1.1.1 | EventProcesses | Feed | Graph v1.0.0 | Input | postprocess | CoreProcess | Schedule | setup | Time | User | Visualisation | WiFi v1.0.0
Buffer	47 feed points pending write
Writer	Daemon is running with sleep 60s
Server	OS	Linux 4.4.26-v7+
Host	emonpi emonpi (127.0.1.1)
Date	2017-11-29 09:17:19 UTC
Uptime	09:17:19 up 22 days, 13:26, 1 user, load average: 0.31, 0.21, 0.23
HTTP	Server	Apache/2.4.10 (Raspbian) HTTP/1.1 CGI/1.1 80
Database	Version	MySQL 5.5.52-0+deb8u1
Host	localhost (127.0.0.1)
Date	2017-11-29 09:17:19 (UTC 00:00‌​)
Stats	Uptime: 1949163 Threads: 1 Questions: 173874 Slow queries: 0 Opens: 58 Flush tables: 1 Open tables: 51 Queries per second avg: 0.089
Redis	Version	2.8.17
Host	localhost:6379 (127.0.0.1)
Size	126 keys (552.12K)Flush
Uptime	22 days
MQTT	Version	1.4.10
Host	localhost:1883 (127.0.0.1)
Pi	CPU Temp	57.46°CShutdownReboot
Release	emonSD-07Nov16
File-system	Set root file-system temporarily to read-write, (default read-only)Read-Write Read-Only
Memory	RAM	
Used 44.61%
Total: 973.11 MB Used: 434.07 MB Free: 539.03 MB
Disk	Mount	Stats
/	
Used 64.80%
Total: 3.33 GB Used: 2.16 GB Free: 1.01 GB
/home/pi/data	
Used 60.88%
Total: 193.66 MB Used: 117.89 MB Free: 65.77 MB
/boot	
Used 34.75%
Total: 59.95 MB Used: 20.83 MB Free: 39.12 MB
PHP	Version	5.6.27-0+deb8u1 (Zend Version 2.6.0)
Modules	apache2handler | bcmath | bz2 | calendar | Core v5.6.27-0+deb8u1 | ctype | curl | date v5.6.27-0+deb8u1 | dba | dio v0.0.4RC4 | dom v20031129 | ereg | exif v1.4 | fileinfo v1.0.5 | filter v0.11.0 | ftp | gettext | hash v1.0 | iconv | json v1.3.6 | libxml | mbstring | mcrypt | mhash | mosquitto v0.3.0 | mysql v1.0 | mysqli v0.1 | openssl | pcre | PDO v1.0.4dev | pdo_mysql v1.0.2 | Phar v2.0.2 | posix | readline v5.6.27-0+deb8u1 | redis v3.1.4 | Reflection | session | shmop | SimpleXML v0.1 | soap | sockets | SPL v0.2 | standard v5.6.27-0+deb8u1 | sysvmsg | sysvsem | sysvshm | tokenizer v0.1 | wddx | xml | xmlreader v0.1 | xmlwriter v0.1 | Zend OPcache v7.0.6-devFE | zip v1.12.5 | zlib v2.0 | 

Is there any way to monitor the emonhub memory usage or should I simply add a cron job that restarts emonhub say once a week?

Is this a known issue?

Best Regards

Morten

1 Like

Hi Morten, Firstly thanks for providing the info, this is very useful.

There are multiple layers to the issue you have experienced and unfortunately at this point the picture isn’t complete so I can only speculate based on previous debugging experiences in this area and the new info you have here.

The emonHub logs would have been very useful but logrotate swaps them out too soon. emonhub has it’s own log file rotation in built at 5mb file size and a max of 2 files, so you could edit the logrotate config to exclude emonhub.log and that would help retain more log (<10mb instead of <3mb).

Additionally can I suggest you run the emonPi updater? There were some changes made to the emonpi variant of emonhub recently to curtail some newly introduced runaway logs, I cannot be sure if you have those changes or not as emoncms 9.8.8 was the current version prior to and after the emonhub changes.

There is are known issues with the mqtt_input of emoncms and we know there was an issue with emonhub’s mqtt interfacer crashing out and pulling down other threads in emonhub. There have been some changes to combat the latter which, again I cannot tell if you have or not.

The “Division by zero” error is an unhandled scenario in emoncms, it seems this could occur any time an input value is represented with the same timestamp. This is something that could really happen quite frequently and IMO be handled in emoncms to make it more robust.

However in this particular instance, the QOS of any mqtt data from emonhub is 2 (send exactly once) so this raises a question about why the same timestamp was seen multiple times. None the less the problem should be handled.

Your logs indicate there was an mqtt issue, this was reported by openhan and node-red, we do not yet know if emonhub reported anything in emonhub.log. The importance of the emonhub.log is to determine if the mqtt interfacer was possibly brought down by an mqtt issue and whether it failed to recover in it’s own right or if that fail was transferred elsewhere.

At this point is I were to hazard a guess I would say possibly the emoncms mqtt_input caused a problem that it, node-red and openhab all recovered from, and that emonhub did not, hence restarting emonhub resumed play.

Where is all your data coming from? Is it just the emonPi data or do you have data coming from other places?

I see some soleredge data coming via node-red, tell us more about that too. Is that coming in to emoncms via the mqtt_input? What is the update frequency and data make up? I’m wondering if it’s possible for it to resend the same data. Is there a QOS defined?

Personally I think that is a bit crude and it just masks the issue making it harder to locate and fix, If I took my car to the garage and said it keeps cutting out every 30-40 miles and they suggested I pull over and restart it every 20miles I would not be happy :slight_smile:

But you are not the first to suggest this and if that is what you want to do as a quick fix, fair enough.

I’m not able to investigate how to right now, but I would recommend making logrotate exclude emonhub.logs after fully updating the emonpi and keep an eye on it for a while before resorting to a cron restart.

We can easily add a fix for the “Divide by zero” issue that may (or may not) be the trigger. Something like

    if ($time_diff>0) {
            $power = ($val_diff * 3600) / $time_diff;
         
            if ($val_diff>0 && $power<$max_power) $totalwh += $val_diff;
    }

should do the trick.

[edit] @TrystanLea, could you confirm this change to the whaccumulator is valid/viable?

2 Likes

Hi Paul,

Many thanks for getting back to me, and I think you’re right that it wasn’t necessarily emonhub that caused the issue bot some other java process that eventually ate up all the memory and then java OOM process kicked in and restarted things.

I have just run the emonPi updater after I got this issue this morning and was poking around in the admin UI so with a bit of help that would have cured the issue.

My data is coming from a emonPi and a emonTH so there isn’t that much data I would think. The SolarEdge and Winderground weather data is collected via API calls every 15 minutes from node-red and then sent back to emonhub via MQTT, I also have a presence script in node-red that tries to detect my phone.

I agree with you Paul that simply restarting emonhub once a week is not a valid solution, especially if it’s just a random casualty from some other java process eating up all the memory so for now I will look at the logrotate configuration to give me a bit more history if / when it happens again. :slight_smile:

The ‘divide by zero’ issue could probably be fixed like you suggest but why were there two samples with the same timestamp? What’s the granularity of the timestamps, are they in seconds or fractions of that?

I haven’t changes any of the QOS settings so they should be all default I think.

The timestamp granularity in emoncms is one second, it casts all incoming timestamps to ints. In that respect a timestamp for any whole second would be considered the same as any subsequent timestamps up to 999 milliseconds later

It is unlikely it is going through emonhub. If node-red is publishing to MQTT it will be emoncms not emonhub that is subscribing to that data.

IMO emoncms should avoid any attempt to divide by zero so it’s not worth debating/debugging in any great depth where the repeated input came from, if indeed that was the trigger. but it is highly unlikely to be from emonHub unless the emonTH or emonPi are updating at an interval less than 1 second because it doesn’t repeat the timestamps, it creates a new timestamp (in seconds to 3 decimal places) on receipt of each new packet.

I have no experience with the solaredge inverter nor do I have any experience of any node-red flow you might be using, I do have experience of weather wunderground, but not in node-red, so I d not know if any QOS is defined nor if any packets can be repeated.

The log shows 7 occurrences over 6 secs, in fact it looks like they occur in 2 batches of 3 once after 2secs and again after 4 secs, that’s quite a bit of repetition, either the QOS is wrong (at least once) or could it be the presence detection? how fast does that update? does it report on change or does it report status every few seconds?

If you are only posting data locally, it would be wise to disable the emoncmsorg interfacer by commenting out the type= line or removing/commenting out the whole [[emoncmsorg]] section. If that interfacer is active and data is not being consumed (posted to emoncms.org) it is wasting memory and processing power.

1 Like

Hi Paul,

The presence script is run via a timer every 10 seconds in node-red and runs a script that pings my phone and check if it’s bluetooth transceiver is within range or not and then sends a single message via MQQT if my phone is present.

I will disable the emoncmsorg section as you mentioned there is no point in having it configured and wasting resources if I’m not using it, I will just have to remember to enable it again if I decide to send something to the emoncms cloud later. :stuck_out_tongue:

I would agree with you in getting emoncms should not attempt to divide by zero in any case so that should probably be looked into, I was just trying to figure out where those two updates within less than a second came from.

I will check QOS and see if any of those are set up wrong.

I have raised a PR for the change I proposed to be reviewed.

Another possible explanation, how many whaccumulator processes to you have set up? If you had (for example) 4 in one input’s processlist it would only need that one input to be updated once to trigger the 4 whaccumulator processes and each would be expected to fail the same way as the same timestamps are used. This wouldn’t explain the reason the interval between batches is only 2 secs though.

Indeed, but that is the go to place for enabling/disabling/configuring so hopefully not too hard to remember, you will need to add the apikey here too.

1 Like