How to investigate problem with emonBase?

For the past little while I’ve been having problems with my OEM system. My difficulty is that the problems are intermittent and don’t seem to have consistent symptoms.

I tend to keep a browser window open on the http://emonpi/emoncms/feed/list web page so I can easily check everything is working correctly. ‘emonpi’ in this case is the Raspberry Pi that is my my emonBase. Normally I see the various feeds ticking over with updating time-since-last-data fields. Sometimes it stops updating.

So I started also keeping a couple of ssh sessions open to the pi, one running journalctl -f and the other running tail -f /var/log/emonhub/emonhub.log. They haven’t shown me anything interesting unfortunately but they have demonstrated another symptom. Sometimes I find them sitting with broken ssh sessions. There’s never anything I can see to indicate why. The feed page continues updating and if I re-open the ssh sessions I can restart the log monitors.

Sometimes, like this morning, the ssh sessions are broken and the feed page is no longer updating but I can restart the feed page by refreshing it, or sometimes it has logged me out and I need to log in to the web site again and then load the feed page.

But other times I can’t get the feed page working again and I can’t re-open the ssh sessions (no route to host) and I end up rebooting the pi. After which everything goes back to normal for some time.

I’m at a loss to understand what’s going on. I haven’t found anything indicating a problem in any logs I’ve looked at. The browser connection and the ssh connections are all over wired ethernet so its not a wireless problem. There’s a UPS-style battery powering the pi so I don’t think it’s flaky power.

Does anybody have any thoughts as to what might be going wrong or how to investigate the problem? Otherwise I’m tempted to just replace the whole system, which will undoubtedly prove to be a pain. :roll_eyes:

It does seem slightly odd. On what system are the SSH sessions open? Is the host OS going to sleep and breaking the SSH connection?

What sort of issues? Feeds not updating?

You will have to remind me of your hardware setup (EmonPi/EmonTX/Pulse sensor etc). By EmonBase do you mean a Pi with an RFM addon card or the actual EmonPi system?

No, they’re on linux desktops that don’t sleep or hibernate and on good days the sessions stay up (the emonhub tail stops after a while. Presumably emonhub rotates its logfile every now and again for some reason?)

That’s what most of my post was about - describing the symptoms.

Sorry, it’s an emonBase, bought from the shop. i.e. pi with radio card. There are two emonTx associated with it, plus a load of data that I feed in via HTTP. The data flow doesn’t seem to be interrupted except when there’s a major lockup that requires a reboot.

Dave
I’ve had similar issues in the past, and tried to work through possible causes, without ever truly identifying what the problem(s) was. I was also feeding more data via mqtt from other sensors etc.
In my case these issues arose:
I was getting data clashes from another rfm node that would occassionally lock up the emonBase (every few days depending on rate of sampling). Resolved this by changing this sensor to Lora, and running this through another Pi.
I was running a full backup on the emonBase every night, and noticed the cpu was going to 100% during the “zip” phase lasting over 30 mins. During this time, the system was unresponsive (not helped by also running node-red/mqtt on the same pi). Resolved by upgrading to a Pi4 and moving node-red/mqtt to another pi.
SSH is normally very robust, especially on wired ethernet. So looks like the Pi is dropping the connection.
In my case I suspect it was an overloaded Pi, coupled with a poor power supply. Since upgrading to a Pi4 I’ve not had these problems.

Yes, this could easily be the solution. Several option for monitoring this. I have had issues with dropped data when there are too many MQTT topics being read (sending as a Key/Value JSON reduces this issue as multiple values come in as one message). I don’t use the EDmonPi for MQTT on my production system.

My current go to setup is an old laptop (built-in UPS) with ProxmoxVE installed and emoncms installed on an Ubuntu container.

I also off load the Mosquito instance onto another container.

Very light weight and PVE will take a snapshot of the container (more efficient than a file backup).

You described how you have tried to debug the first issue, and the subsequent issues, but what was the main problem that caused you to try tailing the logs in the first place? Missing data?

I’d check using htop for a memory leak; do you have anything else running on the PI?

OK, I’m running htop. Do I just keep an eye on the Mem usage? It shows 190M/976M at present. I suppose if and when ssh dies it will just show the usage at the moment the connection broke?

And no there’s just my emon system running on the pi.

@djh you have still not said what the initial problem was you were trying to explore.

htop will also show if one process is running at high CPU.

Please reread my last message. How many more times do I have to repeat it? “Sometimes it stops updating.”

OK, thanks. Well the ssh sessions broke again, although the feed page is still updating. htop shows Mem as 186M/976M so lower than before and it doesn’t seem like that is a problem.

For some reason I don’t seem able to select the text in my terminal (?) so here’s a screenshot of the terminal window:

The broken ssh message is at the top and a prompt on the machine the ssh was running from is just below. It happens to be another pi called RpiBplus. to the right of the prompt and the mouse cursor are some spurious characters that appeared when I tried to select the text in the window. The rest of the snapshot shows what htop was showing when the link broke.

Normally emonhub tends to be at the top of the process list. It looks like the nightly backup jobs were running just after 9 pm. I haven’t noticed any particular time of occurrence of problems before but I’ll keep an eye on that now. I don’t see any other hints as to problems though. Does anybody else?

Fine I won’t bother then.

For anybody who’s still reading, I may have solved the problem, although I still don’t understand exactly what caused it.

Brian’s suggestion to run htop led me to [re]discover that I had three cron jobs that all started at the same time, each making a backup of particular parts of the pi’s filesystems. It showed a high load average at that time. So I’ve offset the times of the cron jobs so only one runs at once. I haven’t seen a problem since doing that so hopefully that’s fixed it.

Quite why having a lot of jobs available to run should cause problems for other software running on a linux system is the part I still don’t understand.