EmonTx4 DS18B20 Temperature sensing & firmware release 1.5.7

EmonTx4 firmware release 1.5.7 is now available. Providing improved temperature sensing support, mitigating most of the observed interference with electricity monitoring. The following posts discuss:

  1. The observed effect, comparing firmware versions up to 1.5.4 with 1.5.7.
  2. The mechanism behind the error.
  3. The improved implementation in firmware version 1.5.7, which based on modelling of the code, reduces the introduced error to ~0.04% with 3 temperature sensors or 0.07% with 6 temperature sensors.
  4. Plans to improve the implementation further with future hardware.

Version 1.5.7 of the EmonTx4 firmware is now available via the emonPi/base firmware updater tool or if you prefer to compile and upload yourself the firmware can be downloaded here: emontx4/firmware at main · openenergymonitor/emontx4 · GitHub

Make sure to either run a full emonPi / emonBase update before uploading the firmware, or using the emoncms > admin > components update tool, update EmonScripts.


Part 1: The observed effect

When I first ported the emonTx3 continuous sampling firmware using @Robert.Wall’s emonLibCM library to the emonTx4, I was unaware that temperature sensing with DS18B20 temperature sensors would interfere with the continuous sampling electricity monitoring firmware in a small but noticeable way. I noted the effect in the emonTx4 documentation earlier this year here: https://docs.openenergymonitor.org/emontx4/other_sensors.html.

Example of impact on measurements with the original implementation (EmonTx4 firmware versions up to and including 1.5.4, with 3x temperature sensors connected)

In this example I have an emonTx4 with 3x DS18B20 temperature sensors connected and 3 CT sensors, plugged into CH1, CH3 & CH5. The load is a 3kW heater. CH2, CH4 & CH6 do not have CT sensors connected, yet they are registering up to 4W of consumption, a 0.13% error.

{"MSG":18,"Vrms":241.22,"P1":3100,"P2":3,"P3":3102,"P4":2,"P5":3103,"P6":3,"E1":23,"E2":0,"E3":23,"E4":0,"E5":23,"E6":0,"T1":20.37,"T2":20.50,"T3":20.87,"pulse":0}
{"MSG":19,"Vrms":241.10,"P1":3096,"P2":3,"P3":3097,"P4":2,"P5":3098,"P6":4,"E1":31,"E2":0,"E3":31,"E4":0,"E5":31,"E6":0,"T1":20.37,"T2":20.50,"T3":20.87,"pulse":0}
{"MSG":20,"Vrms":241.17,"P1":3095,"P2":1,"P3":3096,"P4":2,"P5":3098,"P6":3,"E1":40,"E2":0,"E3":40,"E4":0,"E5":40,"E6":0,"T1":20.37,"T2":20.50,"T3":20.87,"pulse":0}

With no load, an error appears only on CT6, we can see here an error again of ~4W:

{"MSG":2,"Vrms":244.59,"P1":0,"P2":0,"P3":0,"P4":0,"P5":0,"P6":2,"E1":0,"E2":0,"E3":0,"E4":0,"E5":0,"E6":0,"T1":20.37,"T2":20.50,"T3":20.87,"pulse":0}
{"MSG":3,"Vrms":244.53,"P1":0,"P2":0,"P3":0,"P4":0,"P5":0,"P6":4,"E1":0,"E2":0,"E3":0,"E4":0,"E5":0,"E6":0,"T1":20.25,"T2":20.50,"T3":20.87,"pulse":0}
{"MSG":4,"Vrms":244.60,"P1":0,"P2":0,"P3":0,"P4":0,"P5":0,"P6":3,"E1":0,"E2":0,"E3":0,"E4":0,"E5":0,"E6":0,"T1":20.37,"T2":20.50,"T3":20.87,"pulse":0}

Example of measurements with improved implementation in the 1.5.7 firmware release:

As above, emonTx4 with 3x DS18B20 temperature sensors connected and 3 CT sensors, plugged into CH1, CH3 & CH5. The load is a 3kW heater. CH2, CH4 & CH6 do not have CT sensors connected. They are now registering 0W correctly.

{"MSG":12,"Vrms":241.85,"P1":3107,"P2":0,"P3":3105,"P4":0,"P5":3105,"P6":0,"E1":63,"E2":6,"E3":60,"E4":9,"E5":58,"E6":11,"T1":20.37,"T2":19.87,"T3":20.37,"pulse":0}
{"MSG":13,"Vrms":242.36,"P1":3116,"P2":0,"P3":3114,"P4":0,"P5":3117,"P6":0,"E1":71,"E2":6,"E3":69,"E4":9,"E5":67,"E6":11,"T1":20.37,"T2":20.00,"T3":20.37,"pulse":0}
{"MSG":14,"Vrms":242.87,"P1":3126,"P2":0,"P3":3126,"P4":0,"P5":3128,"P6":0,"E1":80,"E2":6,"E3":77,"E4":9,"E5":75,"E6":11,"T1":20.50,"T2":20.00,"T3":20.37,"pulse":0}

With no load, CT6 is also showing 0W correctly:

{"MSG":27,"Vrms":242.50,"P1":0,"P2":0,"P3":0,"P4":0,"P5":0,"P6":0,"E1":101,"E2":6,"E3":99,"E4":9,"E5":97,"E6":11,"T1":20.62,"T2":20.25,"T3":20.50,"pulse":0}
{"MSG":28,"Vrms":244.34,"P1":0,"P2":0,"P3":0,"P4":0,"P5":0,"P6":0,"E1":101,"E2":6,"E3":99,"E4":9,"E5":97,"E6":11,"T1":20.62,"T2":20.12,"T3":20.50,"pulse":0}
{"MSG":29,"Vrms":244.70,"P1":0,"P2":0,"P3":0,"P4":0,"P5":0,"P6":0,"E1":101,"E2":6,"E3":99,"E4":9,"E5":97,"E6":11,"T1":20.62,"T2":20.12,"T3":20.50,"pulse":0}

I will dive into the mechanism behind the error in the next post.

1 Like

2. The mechanism behind the error

Temperature sensing using the one-wire DS18B20 temperature sensors that we use relies on precise timing for the digital communication 1’s and 0s. This communication and timing is implemented using a bit-banging technique in software.

Here’s an example of what the OneWire write bit function looks like:

void OneWire::write_bit(uint8_t v)
{
	IO_REG_TYPE mask IO_REG_MASK_ATTR = bitmask;
	volatile IO_REG_TYPE *reg IO_REG_BASE_ATTR = baseReg;

	if (v & 1) {
		noInterrupts();
		DIRECT_WRITE_LOW(reg, mask);
		DIRECT_MODE_OUTPUT(reg, mask);	// drive output low
		delayMicroseconds(10);
		DIRECT_WRITE_HIGH(reg, mask);	// drive output high
		interrupts();
		delayMicroseconds(55);
	} else {
		noInterrupts();
		DIRECT_WRITE_LOW(reg, mask);
		DIRECT_MODE_OUTPUT(reg, mask);	// drive output low
		delayMicroseconds(65);
		DIRECT_WRITE_HIGH(reg, mask);	// drive output high
		interrupts();
		delayMicroseconds(5);
	}
}

The crucial part to note here is the way that it disables interrupts with noInterrupts(); delays for 65 microseconds and then enabled interrupts again with interrupts();

The one wire reset command has a similar 70 microsecond period with interrupts disabled.

Electricity monitoring on the EmonTx4 is implemented using ADC continuous conversion. The ADC performs the ADC sampling in the background firing an interrupt service routine (ISR) when each sample is ready. The configured sample rate on the EmonTx4 for each ADC sample is 39.4 microseconds. The ISR function is therefore called every 39.4 microseconds with the next sample ready for processing.

When the OneWire temperature sensing code disables interrupts it blocks the ISR interrupt generated by the ADC for a period of 65-70 microseconds (longer than the 39.4 microsecond sample rate). This in turn causes the ADC to lose it’s place in the sequence of channels that it is sampling.

The ~4W of consumption on CT6 in the example above is a small number of samples from the voltage channel, which was the next one in line to sample, being allocated to CT6 incorrectly.

With 3kW on CT1, CT3 & CT5 we again see the effect of the ADC code losing it’s place and allocating samples incorrectly. The effect is relatively small as the ADC regains the correct allocation by the next sample and the vast majority of samples are allocated correctly.

1 Like

3. The improved implementation in firmware version 1.5.7

To get around the issue of the ADC code loosing it’s place and allocating samples incorrectly, rather than have the OneWire code enforce it’s timing by disabling interrupts, I added a flag in it’s place. This flag e.g onewire_active = true; is then picked up by the ADC ISR code and used to skip the processing on a sample when the OneWire code requires precise timing. This allows the ADC code to keep it’s place, removing the error caused by misallocation, but in order to exit the ISR as quickly as possible, we need to compromise by discarding that particular sample.

void onewire_write_bit(uint8_t v)
{
	IO_REG_TYPE mask IO_REG_MASK_ATTR = bitmask;
	volatile IO_REG_TYPE *reg IO_REG_BASE_ATTR = baseReg;

	if (v & 1) {
		//noInterrupts();
		onewire_active = true;
		DIRECT_WRITE_LOW(reg, mask);
		DIRECT_MODE_OUTPUT(reg, mask);	// drive output low
		delayMicroseconds(10);
		DIRECT_WRITE_HIGH(reg, mask);	// drive output high
		//interrupts();
		onewire_active = false;
		delayMicroseconds(55);
	} else {
		//noInterrupts();
		onewire_active = true;
		DIRECT_WRITE_LOW(reg, mask);
		DIRECT_MODE_OUTPUT(reg, mask);	// drive output low
		delayMicroseconds(65);
		DIRECT_WRITE_HIGH(reg, mask);	// drive output high
		onewire_active = false;
		//interrupts();
		delayMicroseconds(5);
	}
}

With 3x temperature sensors this approach results in about 450 discarded samples out of 253,807 samples in every 10s period (0.17%).

This is what that looks like in terms of the distribution of skipped/discarded samples:

We can see that the majority is lost in the first 100ms, a small amount is lost at 1.7s which is when the command to start conversion is issued.

Zooming in on the first 120ms, this is what it looks like:

This is the distribution of skipped ISR calls on a per channel basis:

and the ADC results for the skipped ISRs (3.8kW load 20A CT sensors):

The staring position and portion of the waveform that is lost changes each time, another run gave:

image

We can see that there is a danger of skipping an unbalanced amount of samples from the bottom or top half of the waveform. The timing could be adjusted to distribute the skipped samples more evenly so that we loose roughly a similar amount of + and - samples.

However simulating the effect of these missed samples without distributing these further suggests that the error introduced even if unbalanced is small. With the samples discarded focused on either positive or negative peak, the following simulation suggests a 0.035% error with 3x temperature sensors and ~0.065% error with 6x temperature sensors. I have rounded these up to 0.04% and 0.07% in the first post above.

emonlibcm_sim.zip (1.4 KB)

Overall Im happy with the improvement that this approach provides, a potential 0.04-0.07% error in the context of a hardware component tolerance error of ~1.2% is not that much, but that’s obviously a judgment that I cannot make for everyone. So I have tried to present this here as transparently as possible.

If you only use your EmonTx4 for electricity monitoring you can of course ignore all of this as the EmontX4 disables the OneWire temperature sensing code and therefore any interference effect or introduced error from skipped samples.

The next post discusses our plans to improve this further so that there is no compromise in future hardware.

1 Like

4. Plans to improve the implementation further with future hardware.

  • emonPi2: We are currently working on a new emonPi2 design which will be feature identical to the EmonTx4 (3 voltage inputs, 6x CT channels, extender for another 6CT sensors). The main difference being that it connects directly to a RaspberryPi (via the standard GPIO header layout), all in a single unit. Rather than handle temperature sensing on the AVR-DB microcontroller, the one-wire bus will be controlled by the RaspberryPi, the AVR-DB can then be dedicated to the electricity monitoring code. There’s more on the emonPi2 here: https://github.com/openenergymonitor/emonpi2.

  • Further down the line we may explore a design that does allow both temperature sensing and electricity monitoring to run on a single core using e.g the SAMD21 microcontroller. Though we may switch to the SAMD21 for other reasons and still use the Pi for temperature sensing.

  • The existing EmonTx4 design does break out the OneWire bus, this could in theory with the right software support be controlled by an attached PiZero or ESP8266, in much the same way that we are planning for the emonPi2. This does however introduce complexity and cost in terms of the additional hardware for a relatively small improvement, given that the error expected from the code above is really quite small.

EmonTx4 3Phase firmware
Robert Wall will soon be releasing emonLibDB: 3 phase, 12CT compatible library and firmware for the EmonTx4. Due to more complex phase calibration and other factors we have decided that for now, we would not attempt to support temperature sensing as part of this library. Temperature sensing will only be available on the single-phase 6CT EmonTx4 firmware, which will be maintained alongside the new firmware. The emonPi2 will then allow temperature sensing alongside 3 phase & 12 CT’s as the Pi will handle temperature sensing directly.

1 Like

Presumably they need the modified 1wire and EmonLibCm libraries as well?

Have you checked how things look on the 1wire bus? Trying to bit-bang a 10 usec pulse with interrupts enabled will be challenging no matter how quickly the ISR exits as a result of the flag. 10 usecs isn’t many cycles and it will cost quite a few to preserve state, get into the C handler to check the flag then back out and restore everything.

delayMicroseconds() very carefully counts instruction cycles to busy wait just the right amount of time. I think jumping to an ISR during that will blow it all out of the water.

The other consideration is that the ADC interrupt isn’t the only one active - the standard Arduino runtime is likely to have timer and uart interrupts potentially firing during the 10 usec pulse generation.

1 Like

Thanks @dBC I will get back to you about this in more detail soon.

A quick note on the library requirements in the mean time. The modified one wire code: reset, write and read is included in the EmonLibCM library (this is specifically my avrdb branch of EmonLibCM). The OneWire library itself does not therefore need to be changed, though it does still need to be included as it is still used for the initial device discovery.

1 Like

Very interesting @TrystanLea

In terms of retaining information, might I suggest this would be better as a Blog, linked to a Forum Post much like Home Assistant do for their announcements?

It would help with the retention of information and the ‘now why did we do that?’ question.

Just a thought.

Who reads blogs?

I will link this thread into the EmonTx4 documentation. That way an interested user can follow any further discussion that @dBC and I or anyone else get into here. I will get back to you @dBC soon. Will capture and post what I can see on the 1 wire bus, amongst other tests.

Well, me sometimes. It is the mechanism used by HA for their release notifications. It means you can actually find this sort of information quite easily and stops it being lost in the blizzard of posts (as usually happens).

We seem to be using the blog less and less. It feels like it makes a lot more sense to post updates on the forums. Perhaps what we need is a list of relevant forum posts in chronological order somewhere… So that the posts can be navigated like a blog in a sense…

anyway we are getting off topic! If we need to expand on this further lets create another thread :slight_smile:

1 Like

Hello @dBC had a chance this morning to capture what the oscilloscope is seeing. I have one trace looking at what is happening on the DS18B20 bus and another trace just flagging when we enter and exit one of those 10us write periods.

Here’s the result:

First an apparently unobstructed 10us pulse:

then I assume one that is delayed extending to just under 20us:

The temperature measurement’s are stable. See longer test below.

If I experiment with extending delayMicroseconds(10);. I continue to receive valid temperature measurements up to delayMicroseconds(17); at which point I start to see a 304 response from one out of six temperature sensors. At 18us there’s usually about 2x 304 responses, At 19us about 5/6 showing 304, at 20us 6/6.

Looking at the oscilloscope, delayMicroseconds(20); can result in up to 28-30us on the digital pin.

At the other end of the scale it seems that it is possible to reduce to delayMicroseconds(0) and it keeps returning valid temperature measurements on all 6 sensors. So perhaps if stability is an issue reducing the value might be an option…

I ran an extended test over the last 5 days with 6 temperature sensors and 20m of cat 5 cable and a 6 way RJ45 breakout and didn’t get a single 304 or other error. It does seem to be stable. At least with the DS18B20 sensors that I have here - which is probably an important caveat!

1 Like

Hi Trystan, good timing (pardon the pun ;-)… I got around to having a look at this myself yesterday, but didn’t have time to do a write up of what I found, so here goes.

Firstly I don’t have an AVR-DB so dusted off an old 16MHz Uno for these measurements. Assuming you’re running at 24MHz then my results will be 1.5x worse than yours, and since you’re running on the real h/w it’s your results that matter.

To remove the randomness/luck I wrote a main loop whose sole purpose was to generate 10 usec pulses pretty much as fast as it can. There’s no 1wire, and no devices… just an IO pin and a scope.

while (1) {
    // noInterrupts();
    PORTD &= 0xfb;                                  // digitalWrite(2, LOW);
    delayMicroseconds(10);
    PORTD |= 0x4;                                   // digitalWrite(2, HIGH);
    // interrupts();                                 
    if (millis() > 10000) onewire_active = true;
  }

I then set up the ADC in free running mode and copied large parts of your interrupt handler including the fast path when onewire_active is true. To avoid having the compiler optimise that test away, I don’t set it true until the system has been running for 10 seconds as seen in the code above.

With the interrupt protection enabled I get superb pulses…


The variations are down in the psecs, even after measuring 30K pulses. I do note that the pulses are closer to 9 usecs than 10 but I’m not the first to observe that.

Then I removed the interrupt protection and got…


The vast vast majority of pulses are still 8.876 usecs but whenever the ADC interrupt fires that blows out to 15.570 usecs. The scope stats make it look like they’re all that wide but that’s only because I’ve triggered on the pulse being wider then 10 usecs, effectively filtering all the good 8.876 usec pulses out of the stats. I think that works out at an additional 107 cycles to process the ADC fast path ISR. On a 24MHz core that ought add an additional 4.5 usecs to the pulse.

More alarming was the 21.8 usec max in those stats, so I changed the trigger to capture pulses > 20 usecs:


They arrive at quite a clip - about every 20 msecs. I’m pretty sure that’s the timer0 stuff that the Arduino runtime does to maintain millis() - it has short paths and long paths depending on when it needs to adjust fractions. I think that works out at 202 cycles which would add about 8.4 usecs to your pulse on a 24MHz core. That might be the one you captured at just under 20 usecs.

So I guess we came to roughly the same measurements via slightly different paths, which is always reassuring. How much impact that has on 1wire reliability in the field is very hard to quantify. There’s a good quote in GUIDELINES FOR RELIABLE LONG LINE 1- WIRE NETWORKS

Incorrect 1-Wire Timing
When software (firmware) is used to generate 1-Wire waveforms
(sometimes called “bit-banging” the waveform), it is easy to make
mistakes that do not become apparent immediately.
By far the most common mistake made in programming the 1-Wire
master is sampling data from slaves too late after the leading (falling)
edge of the time slot. Slaves can vary in their timing over a wide range
just as temperature and voltage vary. Slaves can also change from
batch-to-batch due to process variations. A design in which the
waveform is sampled at 30µs might pass lab tests and even go into
production, committing the improper timing to shipped products. Later,
when batch or network conditions change and the slaves move from
32µs to 29µs, this master-end interface fails. It is therefore critical that
waveform parameters be verified by the specifications, despite
seemingly perfect system operations in laboratory environments.

:+1:

 

Thanks @dBC great to see that we came to roughly the same measurements. I do expect someone will experience some reliability issues arising from this and am happy to explore other solutions as mentioned above if someone does get stuck with this.

and thanks again for taking the time to go into the detail on this! Your knowledge is invaluable!

You’re very welcome.

A bit more probing revealed the cause of the “fat” pulse - it happens when the timer0 interrupt and the ADC interrupt both occur during the one call to delayMicroseconds(). The timer interrupt fires every msec and the ADC interrupt at whatever the sampling rate (~100 usecs on my Uno). Roughly every 20 msecs those two line up close enough to each do their damage to the one call to delayMicroseconds().

In that light… I pulled up the AVR128DB datasheet - their ADC has come a long way since I last used them. You may have considered this and rejected it for other reasons, but in case not…

Rather than setting your flag to tell the ISR to take the fast path I got to wondering if you could just slow the ADC down a bit during those 1wire noInterrupts periods? 33.3.2 lists a bunch of things not to be changed during conversions for fear of unpredictable results and unfortunately the Prescaler is one of them. But SAMPDLY and SAMPLEN aren’t listed. SAMPLEN looks particularly promising. It looks like you can add anywhere from 0 to 255 additional ADC clocks to each sample. Would that slow things down enough to ensure you don’t lose your place in the sequence while 1wire has interrupts disabled?

[EDIT]
OK, thanks to your well commented code I see you currently run with it set to 14 and a divider of 32 to give you your 39.3 usec sample rate:

// DIV 32 = (2+14+13.5)÷(24÷32) = 39.3 us

Assuming you can write to SAMPCTRL on the fly, you’ve potentially got a lot of headroom there to slow things down while interrupts are disabled… right out to 360 usecs at the extreme. I imagine it won’t take affect until the next conversion starts but hopefully that should let you push that second interrupt out to well past when interrupts are re-enabled and the first (delayed) interrupt dealt with.

Thanks @dBC an interesting proposition. If I tried setting SAMPLEN in place of the position of the flag, it sounds like I’d then need some way to hold up the one wire code until we are in that longer ADC period? Im not sure how I would go about that other than delaying by a full 39.2 us, before continuing with noInterrupts? Sounds like I’d need to do some experimentation here and look at what this looks like on an oscilloscope to work out what this actually looks like! :slight_smile:

1 Like

I don’t think there’d be any need to hold back because you’ll only lose your sequence if a second ADC interrupt arrives before you process the first. Setting SAMPLEN to something big before you go into the 70 usec busy wait pushes out that second interrupt arrival into the distance.

Consider this code snippet…

        ADC0.SAMPCTRL = 255;
        noInterrupts();
        DIRECT_WRITE_LOW(reg, mask);
        DIRECT_MODE_OUTPUT(reg, mask);	// drive output low
        delayMicroseconds(70);
        DIRECT_WRITE_HIGH(reg, mask);	// drive output high
        interrupts();
        ADC0.SAMPCTRL = 14;

Let’s say the ADC is busy working away on sample n when that code runs. That first line tells it: when you start working on sample n+1 I want you to work on it for 360 usecs. Meanwhile, sample n will complete on the old schedule (39 usecs after sample n-1 completed). So it will complete somewhere in that 70 usec blackout zone, worst case right at the beginning of that 70 usec blackout zone. The IRQ for sample n will sit there patiently pending until you come out of the 70 usec delay and re-enable interrupts, at which time you’ll service it normally.

I don’t know how long your ISR service time is but I hear it’s pretty tight, so let’s say maybe 35 usecs? So you spent 70 usecs waiting for the blackout zone to end, then another 35 usecs to process IRQ n. That’s 105 usecs, but interrupt n+1 won’t be arriving for another 255 usecs (360 less 105) so you’ve heaps of headroom.

Definitely one to be studied with a scope. I think you’ll be able to get away with a much smaller SAMPLEN setting than 255 - you could even refine it based on the length of each blackout zone.

1 Like

Actually, I see now that changing the sample rate on the fly would wreak havoc with your phase error correction code, so probably a non-starter for that reason alone.

1 Like

Indeed. Almost everything relates to the ADC interrupt rate.

The newly released emonLibDB records around 3 cycles of samples - I’ve long considered temperature measurement a non-starter given how hard the continuous monitoring requirement is pushing the processor - but I’ve recognised (but not considered in detail) it might be feasible to “replay” a cycle or two after the one-wire bus has occupied the processor’s full attention.

Possibly even worse, the basic ADC rate has to be slowed when there’s a CT measuring line-line on a 3-phase system due to the additional calculations required (the VTs remain in star, so only line-neutral voltages are available).