SAMD21 ADC and emonLibCM

Hi @dBC integer. Usually values are scaled up so as not to loose precision. The Cortex M0 does not have an FPU (you have to go to the M4 & M7 for that) but it does have a 32-bit by long 32-bit in a single CPU cycle of 48MHz.

12-bit wide data sample so V x I could yield a maximum 25-bit result (0x1000 x 0x1000 or decimal 4096 x 4096).

The Arduino C/C++ compiler supports uint64_t data type.

Thanks. I was actually referring to Robert’s AVR code and his battle with the inevitable new data arrival tick, but I guess you’re both working from the same code base.

@Robert.Wall , are you familiar with AVR201: Using the AVR Hardware Multiplier?

If you don’t fancy slipping into AVR assembler, Atmel have provided some nice C inline functions, and you can find a copy of them here, and you can see how @cbmarkwardt used them here.

In that example…

mac16x16_32(vstats.val2_sum,vval,vval); // .. squared voltage

takes two 16 bit values, multiplies them and accumulates the result into a 32 bit variable. There’s also a mac16x16_24() that’s even faster, if you know the summation can fit in 24 bits.

1 Like

No - not found that.

Ouch - that’s going to stir up the old - and I mean old - grey matter. It must be around the mid 1980s when I last had serious dealings with assembler.
One doubt I have - I’m not a career programmer obviously (I’m really a systems/applications/projects engineer who just happens to have done a fair bit of programming over the years), would my attempts at assembler be better than a decent compiler? Yes, I know a compiler might think it needs to cater for all eventualities whereas I know the bounds of every value, but even so, I wonder.

24 bit multipliers would suit me, then accumulate into 32 or 64-bit - the plan was 2 stages; 32-bit until it could be close to full, then accumulate the sub-total to 64-bit, which would be about every 100 ms.

There will be two f.p. operations per power per cycle when I calculate the real power from the ‘partial powers’ accumulated over that cycle.

Thanks for all those pointers, I’ll have to look at them.

Fear not, when you dig through those pointers you’ll see you don’t need to write any assembler. To you, it’ll look like you’re just calling a C function - there’ll be no overhead of a function call though,it’s all inlined.

It’s definitely better than anything you’ll get out of gcc. C will be doing 32-bit operations (on an 8 bit micro) so it’ll be constantly checking for carries etc. These inline functions do the bare minimum instructions to do what they promise - multiply two 16-bit numbers together and add the result into a 24-bit accumulation for example. You can’t do that in C. C will just promote everything to 32-bits and assume the worse with regards dynamic range so have to do a lot of carry checks.

hmm… they may not work for you then. If you could take advantage of 16-bit multiplies, being accumulated into a 32-bit accumulator, they’ll be faster than C.

Have a look at the code generated when you multiply two 32-bit values together in C, on an 8-bit micro. It’s an awful lot of instructions… most of which can be done away with once you know your own dynamic range limits.

Once you’re on a 32-bit cpu like @ozpos that aspect goes away - it all just becomes single instruction. But 32-bit arithmetic is hard work for an 8-bit micro.

(N.B. The shorthand is +=→ means accumulated into n bits etc)

Yes, but 12 × 12 → 32, followed by 32 += → 32 a few times, then 32 +=→ 64 is surely faster than 16 × 16 +=→ 32 etc? And I’d guess (hopefully, without evidence) that would be faster than 16 × 16 +=→ 64 (because won’t it do the whole lot as 64-bits?).

We know that - but I’m stuck with it. It’s a great pity that the STM32 bit the dust after all the hard work you put into it. It’s a matter of great regret to me at least.

But how can you do that in C?

I don’t think there is a mac16x16_64() in those inlines, but you could potentially add one - just base it on mac16x16_32()

If you write it in C yes, it will all get promoted to 64 bits and be very slow. If you use these inlines then no, they’d do exactly what they say on the pack.

Exactly - what you’re saying, even just the inline macro for the multiplication only will be a significant improvement over (12 → 16) × (12 → 16) → 16
which in turn must be a significant improvement over (12 → 32) × (12 → 32) → 32

Looking at the assembler in

I don’t want to multiply two numbers, I want to square one number. So surely, rather than doing
(A + B) × (C + D) = A×C + A×D + B×C + B×D,

there’s a further economy to be had:

(A + B)² = A × A + B × B + (A × B) << 1

in other words, get rid of one multiply and replace it by a left-shift?

It seems like I’ve got to re-learn my assembler - Atmel-style.

Any more detail on A and B, like size, range constraints, signed Vs unsigned etc? For example, if A was straight out of the ADC you could say it’s unsigned in the range of 0…4095.

In true RISC fashion, pretty much all the ALU instructions are single cycle, so a left shift costs the same as a multiply: 1 cycle. But being an 8-bit machine, the multiply has to be 8-bit x 8-bit → 16-bit. So without knowing their sizes, it’s hard to compare.

Which got me wondering… do you just need approximate magnitudes of V and I for your dynamic phase adjustment (like high, medium or low) or do you need the same precision that you need for everything else?

Both will be almost straight out of the ADC, but with 2048 subtracted - so signed 12-bits

I don’t need the best accuracy, because I’m computing a very rough log (of the squared value) by simply looking where the most significant bit is, but I want rather more than 3 levels. I’ll probably then straight line interpolate, but put a limit value above and below.
[Phase error is close to a straight line against log(V) or log(I)]

So log(A+B)2 where A and B are both signed 12-bit numbers?

Isn’t that just

2log(A+B)

[EDIT] - I guess the problem there is (A+B) can be negative?

Any chance of losing the " - 2048" so they both become unsigned? I guess that would be the equivalent of adding a known DC offset to the signal, that you might be able to deal with towards the end?

Yes - but it doesn’t help. I was referring to the assembler multiply: I was using A & B to represent high byte & low byte.

Is your suggestion that squaring the totally raw 12-bit value and then subtracting 2048² will be faster? I’ll need to think that through, where to remove the offset and how it affects accumulation over the reporting period etc.

I still need the Value² to carry forward to the ultimate rms calculation without the log, so the natural way to do this is accumulate the cycle’s worth, use it for the phase error, and carry on accumulating cycles’ worths over the reporting period.

Ah ok. I misunderstood. I thought we were still at the C level and that you had two signed 12-bit values you wanted to add together and then square.

So as an example of using those inlines, and assuming the result of your square can fit into 24-bits, I started with this code snippet…

int16_t b1, b2;
int32_t square_b1, square_b2;

b1 = random(4096)-2048;
b2 = random(4096)-2048;

and then used two different techniques to calculate the squares:

muls16x16_24(square_b1, b1, b1);
square_b2 = (int32_t)b2 * (int32_t)b2;

The first one takes 17 cycles, and the second takes 48 cycles, so a saving of about 2usecs, assuming 16MHz.

1 Like

Out of interest, how do you find that bit?

Also, is there enough dynamic range in the V reading to move the MSB?

Given the sample rate that’s hoped for :wink: , 2 µs is significant. Thanks for that - I haven’t had chance to touch this for several days, I’m afraid.

Right-shift one bit at a time until it’s zero (it’s unsigned of course), then use the number of shifts. I did say it was rough. If it turns out to be not good enough, then’s the time for a rethink.

I’m less concerned about that - I’ll probably not use the log of that with the ZMPT101B c.t. as the voltage transformer. We know the commercial I.Cs assume a voltage divider so off no means of adjusting for phase error, the ZMPT isn’t there, but it’s a lot closer than an off-the-shelf a.c. adapter power supply transformer.

OK, I’m not sure if that part is in the critical path, but gcc provides some nice built in functions for that kinda’ stuff that are pretty highly optimised: the clz series count leading zeroes.

I just did some tests on an AVR. I started with a very simple implementation of your description above:

inline int rough_log1(uint32_t value) {
  int shift_count = -1;

  while (value) {
    value >>= 1;
    shift_count++;
  }

  return shift_count;
}

and then wrote a plug compatible version that uses the built-in…

inline int rough_log2(uint32_t value) {
  return (31 - __builtin_clzl(value));
}

The first one works really well on very very small input values (unsurprisingly), but each iteration costs an additional 12 cycles. The second one uses anywhere from about 30 cycles to 70 cycles depending on the input value. For input values of 64 16 and above (output values of 4 and above), the second one kills it.

You can check out their approach here, but they basically divide-and-conquer the bytes and then bits as needed.