Hi @dBC integer. Usually values are scaled up so as not to loose precision. The Cortex M0 does not have an FPU (you have to go to the M4 & M7 for that) but it does have a 32-bit by long 32-bit in a single CPU cycle of 48MHz.
12-bit wide data sample so V x I could yield a maximum 25-bit result (0x1000 x 0x1000 or decimal 4096 x 4096).
The Arduino C/C++ compiler supports uint64_t data type.
Thanks. I was actually referring to Robertâs AVR code and his battle with the inevitable new data arrival tick, but I guess youâre both working from the same code base.
If you donât fancy slipping into AVR assembler, Atmel have provided some nice C inline functions, and you can find a copy of them here, and you can see how @cbmarkwardt used them here.
In that exampleâŚ
mac16x16_32(vstats.val2_sum,vval,vval); // .. squared voltage
takes two 16 bit values, multiplies them and accumulates the result into a 32 bit variable. Thereâs also a mac16x16_24() thatâs even faster, if you know the summation can fit in 24 bits.
Ouch - thatâs going to stir up the old - and I mean old - grey matter. It must be around the mid 1980s when I last had serious dealings with assembler.
One doubt I have - Iâm not a career programmer obviously (Iâm really a systems/applications/projects engineer who just happens to have done a fair bit of programming over the years), would my attempts at assembler be better than a decent compiler? Yes, I know a compiler might think it needs to cater for all eventualities whereas I know the bounds of every value, but even so, I wonder.
24 bit multipliers would suit me, then accumulate into 32 or 64-bit - the plan was 2 stages; 32-bit until it could be close to full, then accumulate the sub-total to 64-bit, which would be about every 100 ms.
There will be two f.p. operations per power per cycle when I calculate the real power from the âpartial powersâ accumulated over that cycle.
Thanks for all those pointers, Iâll have to look at them.
Fear not, when you dig through those pointers youâll see you donât need to write any assembler. To you, itâll look like youâre just calling a C function - thereâll be no overhead of a function call though,itâs all inlined.
Itâs definitely better than anything youâll get out of gcc. C will be doing 32-bit operations (on an 8 bit micro) so itâll be constantly checking for carries etc. These inline functions do the bare minimum instructions to do what they promise - multiply two 16-bit numbers together and add the result into a 24-bit accumulation for example. You canât do that in C. C will just promote everything to 32-bits and assume the worse with regards dynamic range so have to do a lot of carry checks.
hmm⌠they may not work for you then. If you could take advantage of 16-bit multiplies, being accumulated into a 32-bit accumulator, theyâll be faster than C.
Have a look at the code generated when you multiply two 32-bit values together in C, on an 8-bit micro. Itâs an awful lot of instructions⌠most of which can be done away with once you know your own dynamic range limits.
Once youâre on a 32-bit cpu like @ozpos that aspect goes away - it all just becomes single instruction. But 32-bit arithmetic is hard work for an 8-bit micro.
(N.B. The shorthand is +=â means accumulated into n bits etc)
Yes, but 12 Ă 12 â 32, followed by 32 += â 32 a few times, then 32 +=â 64 is surely faster than 16 Ă 16 +=â 32 etc? And Iâd guess (hopefully, without evidence) that would be faster than 16 Ă 16 +=â 64 (because wonât it do the whole lot as 64-bits?).
We know that - but Iâm stuck with it. Itâs a great pity that the STM32 bit the dust after all the hard work you put into it. Itâs a matter of great regret to me at least.
I donât think there is a mac16x16_64() in those inlines, but you could potentially add one - just base it on mac16x16_32()
If you write it in C yes, it will all get promoted to 64 bits and be very slow. If you use these inlines then no, theyâd do exactly what they say on the pack.
Exactly - what youâre saying, even just the inline macro for the multiplication only will be a significant improvement over (12 â 16) Ă (12 â 16) â 16
which in turn must be a significant improvement over (12 â 32) Ă (12 â 32) â 32
Any more detail on A and B, like size, range constraints, signed Vs unsigned etc? For example, if A was straight out of the ADC you could say itâs unsigned in the range of 0âŚ4095.
In true RISC fashion, pretty much all the ALU instructions are single cycle, so a left shift costs the same as a multiply: 1 cycle. But being an 8-bit machine, the multiply has to be 8-bit x 8-bit â 16-bit. So without knowing their sizes, itâs hard to compare.
Which got me wondering⌠do you just need approximate magnitudes of V and I for your dynamic phase adjustment (like high, medium or low) or do you need the same precision that you need for everything else?
Both will be almost straight out of the ADC, but with 2048 subtracted - so signed 12-bits
I donât need the best accuracy, because Iâm computing a very rough log (of the squared value) by simply looking where the most significant bit is, but I want rather more than 3 levels. Iâll probably then straight line interpolate, but put a limit value above and below. [Phase error is close to a straight line against log(V) or log(I)]
So log(A+B)2 where A and B are both signed 12-bit numbers?
Isnât that just
2log(A+B)
[EDIT] - I guess the problem there is (A+B) can be negative?
Any chance of losing the " - 2048" so they both become unsigned? I guess that would be the equivalent of adding a known DC offset to the signal, that you might be able to deal with towards the end?
Yes - but it doesnât help. I was referring to the assembler multiply: I was using A & B to represent high byte & low byte.
Is your suggestion that squaring the totally raw 12-bit value and then subtracting 2048² will be faster? Iâll need to think that through, where to remove the offset and how it affects accumulation over the reporting period etc.
I still need the Value² to carry forward to the ultimate rms calculation without the log, so the natural way to do this is accumulate the cycleâs worth, use it for the phase error, and carry on accumulating cyclesâ worths over the reporting period.
Given the sample rate thatâs hoped for , 2 Âľs is significant. Thanks for that - I havenât had chance to touch this for several days, Iâm afraid.
Right-shift one bit at a time until itâs zero (itâs unsigned of course), then use the number of shifts. I did say it was rough. If it turns out to be not good enough, thenâs the time for a rethink.
Iâm less concerned about that - Iâll probably not use the log of that with the ZMPT101B c.t. as the voltage transformer. We know the commercial I.Cs assume a voltage divider so off no means of adjusting for phase error, the ZMPT isnât there, but itâs a lot closer than an off-the-shelf a.c. adapter power supply transformer.
OK, Iâm not sure if that part is in the critical path, but gcc provides some nice built in functions for that kindaâ stuff that are pretty highly optimised: the clz series count leading zeroes.
I just did some tests on an AVR. I started with a very simple implementation of your description above:
inline int rough_log1(uint32_t value) {
int shift_count = -1;
while (value) {
value >>= 1;
shift_count++;
}
return shift_count;
}
and then wrote a plug compatible version that uses the built-inâŚ
inline int rough_log2(uint32_t value) {
return (31 - __builtin_clzl(value));
}
The first one works really well on very very small input values (unsurprisingly), but each iteration costs an additional 12 cycles. The second one uses anywhere from about 30 cycles to 70 cycles depending on the input value. For input values of 6416 and above (output values of 4 and above), the second one kills it.
You can check out their approach here, but they basically divide-and-conquer the bytes and then bits as needed.