-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ultra-low overhead record last failed memory allocation, dump on a panic #4220
Conversation
Sample panic output:
|
Looks good, although it is a sad that we have to replicate this code twice to get the caller address. It would be much nicer to do this in Maybe we should look into enabling |
The docs say that anything other than (0) (i.e. caller) may crash in the docs. I can't vouch one way or the other. https://gcc.gnu.org/onlinedocs/gcc/Return-Address.html
|
On Xtensa with Window ABI it can indeed crash, if one tries to get the return address beyond the end of the stack. With sufficient checking of return values, we found it to be safe in practice. |
I tried and |
@d-a-v Good datapoint, what exactly was built and is that 500 bytes of add'l stack usage or 500 bytes of code? I'm not sure how large a full-fat stack frame is on the LX106, but if it's large this could have an impact on highly object-ified code. We'd also need to rebuild axtls, lwip, and newlib with the option if you want to get reasonable debugging coverage. I'd vote against changing this setting with a debug flag. This is most useful in the case when I don't have any debug enabled...if you're running w/debug, just enable OOM debug and you'll get even more info than this can provide. If you're not, at least you get a call stack (w/no-omit) or calling function (w/o no-omit) to work from. To move into umm_malloc, it would have to use no-frame-omit. OTW you'll catch only the wrapper malloc()/new()/etc. function as the caller. Right now it works because it overwrites multiple times in some cases (i.e. operator new() calls malloc. Malloc will write the caller as new() and return NULL. Operator new() will then overwrite the caller as (__builtin_caller(0)) and return NULL. I actually had it there, too, until I realized that it was effectively a no-op since nobody but the wrappers call umm_malloc(). Maybe a stepped approach. Get it out there for a release as-is, leaving the ABI stack-frame settings as-is, see if it's useful or people clamor for the more detail that the no-omit-frame can provide. |
I'm confused. I see the changes in new/delete/_*alloc_r(), but how does it work when the user calls malloc/realloc/calloc() directly? |
As well you should be, @devyte . It was there when I was testing, but I didn't include it in the patch as I thought (quite incorrectly, after examining umm_malloc_cfg.h) that the non *_rs actually fell through to the re-entrant versions (which would actually end up in infinite recursion upon closer inspection). I'll re-up the patch including the umm_malloc.c changes as well. The same if(fail) { record caller/size } 4-lines of code will be in umm_malloc, umm_realloc, umm_calloc. These will work because it's basically brute-force overwriting the last callback function multiple times on a failure. Should the -fno-omit-frame change go through, only the umm_malloc() calls would need to be instrumented (as long as the # of recorded callbacks exceeds 2 you'll end up back in a userland function). Is there some method/place in the repo for putting in test sketches? Something like this would be easily caught with a set of ~6 basic tests (each test would have to fault and abort so you couldn't roll them into a single one). I've been manually writing little one-offs to test, but it's obviously not methodical or easily reproducible. W/o an emulator doing CI on an embedded system is darn hard... |
@earlephilhower I have yet to fully dwell into the whole testing frame, so my knowledge about it is limited, but my current understanding is that there are 2 places for unit tests:
|
I've recently added some documentation about the tests: https://github.com/esp8266/Arduino/blob/master/tests/README.md#testing-on-device |
Cool, I hadn't noticed that at all. After looking at it I don't think it can be used for exception testing, but for standard library-type things it doesn't look too bad to implement! |
At runtime, whenever a realloc, malloc, calloc, or new call fails to find enough memory, record the calling function and size requested. Do NOT print anything or call any heavyweight functions on this, as it is legal to return NULL to an alloc-type call. If the main application handles this NULL properly, there should be no panic and nothing different will happen. If the main application panics() at any later point in time, this record will be printed out as part of the crash log for processing with an updated EspExceptionDecoder.java. This adds 2-3 instructions overhead in the normal case, and around 10-12 instructions in the failing case, and requires an additional 8 bytes of .BSS to function. Only a single address is kept, the final failing malloc-type function call before a panic, but it is trivial to extend to a fifo with little overhead in the common, non-failing case.
ff1241c
to
372cd00
Compare
may i suggest something like this:
this will avoid some code duplication and allows the feature to be switched off via compile flags. |
@earlephilhower what's the status of this? Is it ready for merge? |
@devyte, It'll be good for merge tomorrow after a minor merge fix due to some changes in postmortem.c. I'll have to ping @me-no-dev to see if he'd be willing to look at the the associated pull in me-no-dev/EspExceptionDecoder#31 so the decoder will actually show this new info, but this can go in w/o it as the info will just be ignored with the old decoder. |
At runtime, whenever a realloc, malloc, calloc, or new call fails to find enough memory, record the calling function and size requested. Does not print anything or call any heavyweight functions on this, as it is legal to return NULL to an alloc-type call. If the main application handles this NULL properly, there should be no panic and nothing different will happen. If the main application panics() at any later point in time, this record will be printed out as part of the crash log for processing with an updated EspExceptionDecoder.java. This adds 2-3 instructions overhead in the normal case, and around 10-12 instructions in the failing case, and requires an additional 8 bytes of .BSS to function. Only a single address is kept, the final failing malloc-type function call before a panic, but it is trivial to extend to a fifo with little overhead in the common, non-failing case. (cherry picked from commit 4a958c8)
I know there was an issue of gitter discussion about this with @d-a-v and @devyte , but I can't seem to find it now.
This patch implements a very low overhead way of recording the caller/size of the last allocation that failed, and only printing it out on a panic. Unlike the OOM debug, there's no add'l strings carried, as we use a gcc builtin to get the malloc/etc.'s caller address (a simple offset from a1 gives the return address, so this is very low CPU overhead wven when it fails).
Combined with a patch to EspExceptionDecoder this gives a reasonably useful insight into where things are going bad w/o changing your compilation options (i.e. works w/NDEBUG to allow highest RAM/flash app usage).
-edit- Update to pull 31, which has the toolchain bug workaround
EspExceptionDecoder patch: me-no-dev/EspExceptionDecoder#31
---snip---
At runtime, whenever a realloc, malloc, calloc, or new call fails to
find enough memory, record the calling function and size requested.
Do NOT print anything or call any heavyweight functions on this, as it
is legal to return NULL to an alloc-type call.
If the main application handles this NULL properly, there should be no
panic and nothing different will happen.
If the main application panics() at any later point in time, this record
will be printed out as part of the crash log for processing with an
updated EspExceptionDecoder.java.
This adds 2-3 instructions overhead in the normal case, and around 10-12
instructions in the failing case, and requires an additional 8 bytes of
.BSS to function.
Only a single address is kept, the final failing malloc-type function call
before a panic, but it is trivial to extend to a fifo with little overhead
in the common, non-failing case.