emonTH pulse counting firmware error

digbyt · 13 July 2016 05:47

I have an emonTH bought from the openenergymonitor.com shop (order 11461, 15 Feb 2016) which I believe came with firmware V2.6 installed. It was supplied with a DHT22 and external DS18B20 installed, and I have since added an optical pulse counter via the terminal strip.

After adding the pulse counter, I found that I was getting very large percentage of bad temperature and humidity readings (humidity frequently dipped to 0% for one reading, the DS18B20 would jump to about 85C, and the DHT22 returned a zero temperature value).

I have looked at the firmware source (emonTH_DHT22_DS18B20_RFM69CW_Pulse.ino) and believe that I have found a problem with the way low power time delays are generated by using the watchdog timer to re-activate the ATmega after putting it into a power down mode. The pulse counting interrupts are causing premature reactivation of various critical delays resulting in invalid data being generated.

Is this a known problem? If not, is this the right forum with which to share the fix?

Most of the errors are fixed with the following patch in ‘dodelay()’:

#ifdef INSOMNIA
  while(Sleepy::loseSomeTime(ms) == 0)
    if(insomnia++ > 100)
      break;
#else
  Sleepy::loseSomeTime(ms); // JeeLabs power save function: enter low power mode for x seconds (valid range 16-65000 ms)
#endif

but there are some other problems like an apparent attempt to de-bounce the pulses which doesn’t seem to be able to work (unless I have misunderstood the intention).
.
Regards,
DigbyT

glyn.hudson · 13 July 2016 10:25

Hi Digby,

Thanks for reporting this issue. I’m trying to understand the fix you are suggesting. Would this not have a significant detrimental effect on battery life? I will need to do some further testing.

Has anyone else noticed this issue?

digbyt · 13 July 2016 12:33

Hi Glyn,

If 'Sleepy::loseSomeTime() returns 0 then it returned before the watchdog expired, so the effect of my change should be to extend the sleep if it is interrupted by a pule. There is no information about how much of the delay had elapsed before the interrupt, so the actual duration of an interrupted delay is changed from the range ‘0-d’ to ‘d-2d’ by re-issuing the request. (It seemed less damaging to oversleep than to undersleep). It could be longer still if multiple interrupts occur during one delay request.

The ‘insomnia’ counter is displayed and cleared on the debug line if ‘debug’ is enabled (only when non-zero). This is only seen when optical pulses are occurring. I also added a flag that is used to alert me when the range checks in the existing code that are applied to temperature values have resulted in an old reading being returned. (this could otherwise be masking some occurences of the fault)

The break (when insomnia > 100) should never happen - it is just defensive programming to eliminate the risk of a delay value longer than the interval between pulses being issued and causing dodelay() from ever returning, and hence preventing any updates/diagnostics… (I tested this by holding the pulse sensor against a monitor which easily produced more than 100 interrupts in a few seconds).

So it should have no effect on battery life if there is no pulse counter, and if there is a pulse counter, may well improve it.

There were also a few remaining busy wait ‘delay()’ calls which I was able to change to ‘dodelay()’. The only one that remains is the one to allow the acia to finish transmitting, and that only occurs when debug is enabled.

As an aside, I noticed that, contrary to a comment I read in the firmware description, the serial debug is still defaulting to ‘on’. To simplify my debugging, I re-defined switch 2 in my modified firmware to set the debug state. Changing the switch setting causes the ACIA to be enabled/disabled as appropriate, and the ‘debug’ flag to be changed.Because debug is now normally disabled, I also modified the code to re-instate the LED flash on transmit, but only when the debug flag is set.

So by only having debug enabled when I plug something in that can monitor it, I think my modified firmware will extend my battery life.

I was not able to find any other mentions of this problem in the forum. It was a rather obvious fault, so my guess is that not many people have tried simultaneous pulse counting and temperature measurement measurement on an emonTH, and anyone that did just gave up without mentioning the problem…

Would it help to send you a copy of my hacked version?

glyn.hudson · 13 July 2016 12:53

Yes, please attach to this thread

dBC · 13 July 2016 21:10

Possibly unrelated, but your mention of spurious temperature readings reminded me of this bug that I reported a while back and failed to follow up on: https://community.openenergymonitor.org/t/get-temperature-can-return-a-bogus-25-5c/593. It may have fallen through the cracks.

Robert.Wall · 13 July 2016 22:01

D’Oh! I meant to chase that one down, but I’ve been busy, and it had slipped my memory.

dBC · 13 July 2016 22:26

Actually, I think the deal was I was supposed to remind you, but I tend to report them and move on. Is there a more official channel for reporting coding errors?

Robert.Wall · 14 July 2016 01:15

I don’t think there’s any official channel.

Having been brought up (properly) in industry where there is somebody called the Design Authority, who is responsible for the design, and proper change control procedures where any change requires documentation and justification before being authorised, all of which is then recorded, I find the ad hoc methods here rather - shall we say - strange.

If the author is attributed, then my route is a PM to them. If not, or if there’s no response (and what I was going to do when I got round to it), then you research and develop a fix, fully test and prove it, then send the file with an explanation to G&T to update Github.

glyn.hudson · 16 July 2016 09:49

A github pull request is the best way to submit a bug fix. That way we can review the changes, test, comment and merge the changes once we are happy. This way the github commit change-Log with show the origin of the fix and create an audit trail.

Unlike forums posts which quickly drop off the radar pull requests will not be lost as each open request needs to be manually closed.

dBC · 16 July 2016 11:40

I wasn’t trying to submit a bug fix, I was trying to report a bug. I’ve no idea what the correct solution is, and even if I were prepared to take a stab, I’ve no OEM hardware to test it on. I assume the original author had something in mind for out-of-range temperatures, but forgot to code the else-branch, so we don’t know what.

glyn.hudson · 16 July 2016 16:51

Github issue on the repository in question is the best way to officially
submit a bug report. However it’s a good ideas to post on the forum first
to confirm that it’s actually a bug.

sent from my mobile device

Robert.Wall · 16 July 2016 17:17

I think it is deficient coding - the error handling is poor. If that’s how you define a bug, it’s a bug. @pb66 has suggested expanding the error code, and I’m in complete agreement. I put a temporary ‘bodge’ in the 3-phase sketch, but it’s nowhere near ideal as it’s constrained with the use of getTempC( ).
I have the hardware, all it needs is the time to sit down and read the data sheet, then pick apart the library and (presumably) add the error coding in there - unless it’s in there already and nobody has noticed.
If the consensus is this is more important than the CM library, I’ll tackle this first.

digbyt · 16 July 2016 20:35

Hi Glyn,

Sorry for the slow follow up. I was expecting to get an email when any new posts appeared on this thread, but either I didn’t configure things correctly, or the message went into my spam box

Anyway, I am attaching a copy of the modified version of the emonTH_DHT22_DS18B20_RFM69CW firmware that I have been running since my original post, as requested.

emonTH_DHT22_DS18B20_RFM69CW_Pulse.ino (18.5 KB)

The good news is that since my original post (when I made my changes), I have had no recurrences of the misreadings which were previously occurring a number of times per hour, and I have not observed any newly introduced issues.

On the down side, I am afraid I have made a number of other changes to the code which are not directly related to the fix. so a simple diff won’t suffice to identify the relevent changes. I thought about going back to the original source and just adding fixes for the problem I described, but I didn’t want to post something that had not had adequate testing.

There are, I am afraid, a number of gratuitous stylistic changes added for my own convenience and ease of debugging. I didn’t do that with the expectation of trying to push it back - I know you will want to stick to your own consistent style. For example, I moved some comments from the right of code to their own lines to prevent excessive line wrap on printouts, and put the diagnostic output statements in a separate function so as not to complicate the loop() function with statements not directly part of the main functionality and often not executed. I also like to explicitly declare scope for declarations, and place function definitions before use… (so as not to need explicit prototypes).

Anyway - just thought I would mention that so as not to provoke a debate on something as subjective as coding style.

I think the functional changes are all conditionally included via pre-processor definitions near the top of the source. The changes associated with the prematurely truncated delays are included by defining ‘INSOMNIA’… That should make it reasonably easy to pick out the changes associated with the problem fix for application to your code should you decide it appropriate.

I should also point out that I am not asserting that this is an optimal fix. It is a simple minded fix which resolved the problem for me, but I have not yet finished reading the ATmega328 datasheet, so there may be better/more power efficient ways to correct the problem.

If you would like me to go put go back to the original source and patch in the fixes in a minimally intrusive way suitable for consideration as an update ready for testing, let me know. Otherwise I am happy to leave it to whoever the official maintainer is.

Hope that helps.
Digby

dBC · 16 July 2016 21:09

Robert, which library do you mean? The Arduino Dallas temperature library does return:

hash-define DEVICE_DISCONNECTED_C -127

when there’s a checksum error. get_temperature() accidentally turns that back into a reasonable looking reading of 25.5C.

Robert.Wall · 16 July 2016 21:24

I hadn’t looked at the library, all I looked at was the call to it. But I do know that 85 °C is the default scratchpad value that can be returned, and that needs looking into too. (It’s very hot water, but plausible!)

Are you saying that if there’s a checksum error, the library flags an error somewhere (which is ignored) and the error code is in the variable that normally returns the temperature?

I think I really must check what that library does.

dBC · 16 July 2016 21:35

Did you see my original report (linked above)? It goes into some detail of the failure mechanism, but perhaps not enough, or perhaps not clearly enough?

Robert.Wall · 16 July 2016 21:51

I did read it at the time, but it meant nothing without looking at the library code itself. That’s what I haven’t had time to do yet. From what you’ve written, it looks as if the library call should pass a pointer, which the library fills, and the return value should be a pass/fail code, so you’d use it something like
if (errorval=(GetTemp(&temp1)))
emontx.temp = temp1;
else
emontx.temp = errorval;

There are of course many ways it could be done - that’s a ‘first thought’.

dBC · 16 July 2016 22:03

Yes, I think that would work, but it may be over-complicating things. Perhaps my bug description over-complicated things.

All you really need to know about the library is that it returns a temperature of -127C to indicate a comms problem. That choice doesn’t seem unreasonable given it’s well outside the temperature range of the hardware. It’s easy for the caller to deal with that, but that last line in OEM’s get_temperature() fails to.

Robert.Wall · 16 July 2016 22:48

If we’re changing anything, the “85 °C” problem needs looking at as well.

dBC · 16 July 2016 22:53

Fair enough. I’m pretty sure the library doesn’t know about that problem. It returns whatever it gets back from the h/w or -127C if the checksum fails. Given the total ambiguity of an 85C reading, it may only be the end user who’s in a position to determine if it’s a real or bogus reading.