That’s close to what I do, except I preserve machine state (including the offending PC of the hang) to external FRAM rather than EEPROM. The internal EEPROM is only good for 100K cycles so if there’s any chance it can get into a mode where it happens continuously, you might need to throttle back to avoid wearing the EEPROM out. I’ll attach my code below in case it’s of any use (much of it is specific to my system and can be ignored).
I just checked the datasheet for your 328P and it looks like the BOD only has three settings: 4.3V and 2.7V and disabled. It’d be interesting to know whether you guys have it set to 2.7V or disabled. And the 2.7V is nominal, it can trigger anywhere from 2.5V to 2.9V and needs to remain below the threshold voltage for tBOD before it triggers. Mysteriously tBOD is referenced but not specified in the 328P datasheet I have. For reference, in the 2560 that I’m more familiar with, it’s 2 usecs.
It looks like your device is spec’d to run fine all the way down to 2.7V, but only at 8MHz. I vaguely recall you guys are already overclocking it even at 3.3V. I run my 2560 at 8MHz and 3.3V, and set the BOD to 2.7V.
How do you deal with that in the case of a real power failure?
If you set the AVR BOD to trigger early the other potential issue to deal with is that it only resets the AVR, not the entire board. So your AVR code will start afresh at init() but that code may well assume that all the external h/w has also just come out of reset and is in virgin state. Depending on the nature of the Vcc glitch/sag and the various BODs in all the devices, that may not be a valid assumption - the RFM module might still be in mid-transaction from the AVR’s previous life. The approach I take to that is to have a processor GPIO output pin /RESET_EXT_DEVICES that drives all the /RESET pins on the external devices. Then when the AVR starts a new life it always bangs on that pin so it knows all the external devices are also fresh out of reset.
//
// Now that the actual ISR has carefully fetched the stack pointer from the stack
// frame, we can go all gung-ho with further stack usage and implement the guts
// of the wdog ISR, which is basically to preserve machine state to FRAM.
//
static void __attribute__ ((noinline)) wdog_isr_guts (uint32_t *stackp) {
uint32_t prog_counter, pc_swapped;
prog_counter = *stackp; // fetch it off the stack
//
// Perversely, the AVR uses big-endian for the return PC on stack and little endian
// for everything else. On our 2560 the PC is 3 bytes wide, so we swap around byte0 and byte2,
// leave byte1 where it is, and zero byte3.
//
pc_swapped = (uint32_t)((prog_counter & 0xff) << 16) |
(uint32_t)(prog_counter & 0x0000ff00) |
(uint32_t)((prog_counter & 0x00ff0000) >> 16);
//
// Next, all AVR instructions are multiples of 2-byte words long (typically just one 2-byte word long).
// The PC is a word pointer, but the gcc listings and maps are all byte based, so we times by 2 here
// so our displayed PC matches what's in the link maps and disassembly listsings.
//
pc_swapped *= 2;
//
// Update the health report block with wdog info, and write it out to FRAM for
// reporting in our next life.
//
health_report.wdog_portc = PORTC;
health_report.wdog_porta = PORTA;
PORTA = PORTC = SEL_NOBODY; // Give everyone plenty of time to get off the bus
health_report.wdog_pc = pc_swapped;
health_report.wdog_fw_version_maj = FW_MAJOR_VERSION;
health_report.wdog_fw_version_min = FW_MINOR_VERSION;
health_report.wdog_link_status = link_led_shadow;
health_report.last_known_pid = last_pid;
write_fram_block(0, (uint8_t *)&health_report, sizeof(health_report));
//
// Prepare for death
//
cli(); // In case we call it from somewhere other than ISR
while(1); // Wait for the 2nd bite.
}
//
// The wdog has been set up to generate an ISR on the first firing, and a
// reset on the second. This is the handler for that first firing. This ISR
// never returns, so we don't have to preserve any system state. By going NAKED,
// we disable all prologue which means SP is pointing to the first free byte of
// stack, just 1 byte below where the PC has been stored. We want to get that
// PC to help determine where the hang is. The one thing we really do want from
// the missing prologue is the re-zeroing of r1. If the interrupt happened to fire
// just after a MUL instruction, then r1 will be non-zero, but the compiler assumes
// it will always be zero. We brute-force it back to zero here, just in case the
// code in the guts() routine requires it. Again, we don't care about preserving its
// old value, because we're never going back.
//
ISR(WDT_vect, ISR_NAKED) {
register uint8_t *stack_pointer; // user 'register' to avoid allocating more stack
asm("eor __zero_reg__ , __zero_reg__"::); // ensure r1 is zero, normally done by isr prologue
stack_pointer = (uint8_t *)SP; // fetch the current stack pointer
stack_pointer++; // back up one byte to return PC
wdog_isr_guts((uint32_t *)stack_pointer); // let guts treat it like a 32-bit entity
while(1); // wait for the reset to happen, if guts doesn't
}