ESP8266 - questions about possible problems & reliability

We are going to review and do further testing on that configuration and will only send out units with the external adapter for now, which allows removal.

Thanks @Bramco

Good to know will note this as a aspect of the firmware to look into further.

thumbsup

3 Likes

Anyone programming ESP8266 recommend reading this: My ESP crashes running some code. How to troubleshoot it? — ESP8266 Arduino Core 3.0.2-25-g3f5a76cc documentation

3 Likes

I’ve verified that one string in the screenshots in the doc Nuno linked does cause a Save to EEPROM (as reported), so it’s possible that an error log was produced by the ESP. I haven’t yet spotted the string that will cause the ac voltage calibration to go to zero (a solitary ‘k’ will do it,)

2 Likes

Wish I’d had this a few years ago!! Great guide and also the examples of what an esp can output on the serial port when the esp firmware or libraries hit an issue. These dumps are almost certainly what people are seeing being sent to the emonTXs.

Simon

Hi Robert,

To save a massive amount of effort won’t it be enough to make sure NOTHING can get back to the emonTX by removing the link?

That way, if anyone reports this, the answer will always be - if you’ve built a DIY version, remove the link and all should be good - although you might want to check out why the esp8266 is resetting…

Simon

It’s a bit late for that. :cry:

Too much time on your hands mate :wink:

Simon

It looks like a real problem here - not THE problem, is that it’s not obvious on the Shop page that it isn’t a standard ESP8266 with a standard header soldered in, the link to the forum page shows a picture and if you know the receive connection isn’t there, then you know it’s not hidden in the shadows, but otherwise you could easily think it was, but moreover, the text is downright misleading because it specifically says “Alternatively, 6 way ribbon cable with the RX/TX lines swapped could be used, and may be a bit easier to put together.” Now if that doesn’t state that the receive line is connected, I don’t know what it does say.
So from all that, I had no reason to think that data wasn’t coming into the emonTx from the ESP8266.

1 Like

Worth noting that emonESP and openEVSE use by now an old version of the Espressif core. Tasmota is kept up to date. There are a gazillion bug fixes between the two.

Short delays like the ones used in the Timer branch are actually OK. They delay function in esp8266 calls yield() which makes sure background tasks are attended to. None of the delays are long enough to trigger the watchdog.

edit: omg, just thought the delays in the (what was called) Timer branch actually improved stability because of the call to yield(), that could be it.

1 Like

Sorry to jog this one again but…

The thread on esp32 losing wifi has resurfaced - given the emonESP is based on the openESVE software could this be a coincidence. Is there something in the way the system works that causes both systems to reset?

And thanks for jogging my memory Dan about some of the esp issues.

Simon

Really hard to say isn’t it. I don’t know of anyone here knows what’s happening lower down in the wifi code, and how that might interact with stuff higher up, and how the compiler is really bringing code together.
We could find a way to have a testing process. Might take a while. It might be a hardware or router issue in the end it seems, and there may be no relationship between the specific esp32 and esp8266 issues.

I’ve found the wifi library to be very stable and it maintains the connection without fail, router restarts, power fails etc.

I remember in the very early stages of writing my code, I had lots of checks on the wifi status in my code with routines to reconnect etc. and also had lots of resets. Clearing all that checking out of my code and relying on the esp wifi library to manage things solved the resets.

If I have time I’ll go through my code and the emonESP code to see if there’s anything that jumps out.

Trying to picture what a testing setup could be… one way could be have a really minimal wifi connection code with nothing else apart form debugging info. The info is sent somewhere we can look at it later. Then we find a context where the wifi is dropping, and add emonESP code each day or two until it breaks (or works?!).
Could take weeks but possibly worth it.

Apart from delays and yields I could only think it was something to do with the way wifi is being set up (link to timer branch merge). Is the emonESP still stable or are there lots of drop outs? My understanding is this thread was started because of a Tx-ESP interaction on the UART port.

FWIW in my experience, with thousands of them running all over the world, there was a sea change in WiFi integrity with the latest core and, in particular, the introduction of lwip2. Where there previously were heap memory leaks and disconnects, with this core/lwip2 they are practically bulletproof.

Specifically, with multiple units running on the same WiFi network, unknowable events on the WiFi seem to affect all of the units at the same time, causing disconnects and/or loss of heap. I know this because they are all logging diagnostic information to influxDB continuously. The failures could not be caused by one ESP unit as the problems with all units were coincident. While I can still see coincident events happening across all of the units, there is no lasting effect and they all now recover their heap and do not disconnect.

Without a doubt, the version control available in platform IO is critical in isolating and comparing long term experience and issues with various software components and related libraries. I’m using the latest core published to PIO as 2.4.0.

2 Likes

It’s not quite clear if your experience in the specific case of your wifi network was due to core upgrade, is this what you mean?
How exactly do you know lwip2 improved stability? Are you looking at your user’s debugging data between upgrades?

When I upgraded my emonDC (emonESP pre-timer branch) core I had to make some changes to make it compile. I also peppered my code with yield() and changed the wifi setup procedure. No real tests yet. Doesn’t really matter so far.

Edit: what I’m getting at is that we need evidence that specific upgrades make something more stable.

FWIW was “for what it’s worth”.

Yes. Pushing out a release with a new core and lwip2 to a thousand machines eliminated heap degradation problems and a class of disconnects related to weak WiFi RSSI (say greater than -75db). There were minimal and unrelated changes elsewhere.

This was observed both in the perpetual farm of test systems as well as in field problem reports. The firmware is well instrumented with diagnostic trace and logging, as well as the previously mentioned influxDB diagnostic information collected from the perpetual farm of test systems. Field issues related to WiFi (or anything else for that matter) are few and far between.

It’s true, the newer core also has some incompatibilities, but these were easily found and accommodated. Most were related to different failure modes and arbitrary results from what was either not or ambiguously documented. My sense is that all of those issues are becoming more, rather than less, logical.

FWIW.

1 Like

I think the esp32 has an issue with the wifi strength, if i have a weak signal the unit will lockup after ~24 hours but having an access point next to it i have not seen the problem in 3 days