-
Notifications
You must be signed in to change notification settings - Fork 7.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TW#17765] I2C crashing - watchdog timeout (master & v3.0 branch) #1503
Comments
In addition, I've not seen the problem occur with WiFi disabled, but when I enable WiFi I've now seen it happen about five times. |
Another dump with slightly different info:
|
Also, if I hold down the reset button on my ESP32 board for about a second (rather than just pressing it quickly) then the issue does clear on next boot. Pressing it too quickly doesn't help. |
From what I can tell so far, this only occurs when there's a glitch on the I2C bus and WiFi is enabled. If WiFi is disabled, the glitch may happen but it doesn't cause this cascade-reset issue. |
I've now seen this happen with the v3.0 branch (e7dc749). |
To be clear, I am using a fairly long cable with my I2C light sensor (TSL2561) - this cable has the side effect of causing the occasional I2C error (missing slave ACK or timeout). Normally this doesn't cause a problem (apart from the failed transaction) however it appears to trigger this issue occasionally. If I use a much shorter cable, the occasional I2C error does not occur, and neither does this issue. So the cable is clearly related to the issue, but I believe it is triggering the problem by inducing I2C errors, rather than causing it. I think the I2C master shouldn't fail like this in the event of a slave or comms issue. |
I have now observed this issue occur when WiFi is disabled. |
Thanks for the detailed descriptions. I have a question that how large is the pull-up resister? We will have some tests under similar condition and get back to you. |
Thanks for the detailed descriptions. |
@panfeng-espressif https://github.com/DavidAntliff/esp-mqtt/ It is a submodule of his other project |
SDA and SCK pull up resistance measured (with ESP32, and AVR, removed) at 7.4 kOhm each. Not sure if it's relevant but I sometimes get a similar error (and resulting continual crash) involving RMT (this time with ESP IDF v3.0):
Might be a red herring though... |
On several occasions the system has run for 3+ days without me seeing this error. Other times it can happen after just a few minutes since power up. I have noticed that it doesn't seem to crash, or enter this continual crash loop, if I don't have a serial console connected, but I need to do more tests to confirm this for sure. |
Hi @DavidAntliff , |
There isn't really any modification required - I am currently running from the head of my Extra info: I have it in a crash-loop now, and it's failing over and over during or shortly after the call to
It ran without errors, performing many I2C operations (one or two a second) for 4 hours, 3 minutes and 43 seconds before this occurred. |
Hi, @DavidAntliff |
I'm mostly using my ESP-IDF_v3.0 branch (with ESP IDF v3.0) at the moment (which has the same problem as my branch that uses the ESP IDF master branch), and I do call Here's the log when my application (commit 801dca9ce8833707fa6bb44c6b614fc0f24fc7b4) starts:
|
If I hold down the reset button for a few seconds, I get a slightly different log:
|
What do these lines mean during boot?
They don't appear on a first boot after power cycle or a long board reset, but they do appear on a |
Let me configure that you are using three I2C light sensors(TSL2561) as your I2C slave and ESP32 as I2C master. |
Here's my schematic: https://drive.google.com/file/d/1Tl_T7OjuoEaj_AxchCVL4cJy9trVFNh_/view?usp=sharing Less importantly, it's currently built on vero board: https://drive.google.com/open?id=1ERjawt9woHZCbJvHvDzopnqYay1ns9jU On the I2C side of things I'm using one TSL2561 sensor, and one ATTiny84A as an I2C slave (see code here), and one LCD 1602 display with I2C adapter. I'm also using 3 DS18B20 temperature sensors on the OneWire bus, and using RMT to drive this, and I'm sometimes seeing issues with that too - not sure yet if it's related to I2C. |
The strange thing about this problem is that sometimes it can run for several days without error (thousands of successful I2C operations), and other times it encounters an error in just a few minutes. I've been unable to determine any real pattern to this at all. The only thing that seems certain is that once it crashes, it just keeps crashing over-and-over. I'm planning to run some tests with fewer I2C devices to see if I can work out which combination causes the issue. This seems like an obvious thing to do, and I've already started doing that, but since days can elapse between instances, it's very difficult to know for certain whether a particular combination is OK or not. What I really need to do is find a 100% reliable way to trigger the fault, and then I can eliminate combinations very quickly. Is there any way that an I2C slave (such as the AVR) can hold the bus in such a way that the ESP32 is unable to recover? |
Where is the definition of the function 'rmt_get_ringbuf_handler'? |
@panfeng-espressif the function has been renamed recently to |
This thread on the forum is about the same thing: https://esp32.com/viewtopic.php?f=13&t=4275 With help from ESP_Angus I decoded the watchdog timeout address to be at line 1045 in i2c.c. This section of code deals with ACK and TIMEOUT errors, then tries to push a command to the command event queue. If the queue is full, and the call blocks, would this cause a watchdog event? However Is there a hardware state that can happen that would cause many multiple failures (and queue posts) in the same call to the handler? What if the I2C bus was a bit unstable, and errors were not only possible but fairly common? If this situation can persist over a reset, then that suggests it's being left in an unexpected hardware state - should the driver attempt to reset the peripheral (perhaps via I notice the loop in the interrupt handler beginning at line 363 only terminates when status == 0. If the FSM is stuck, would this loop ever terminate? Will it yield in all cases? If not, would this cause a watchdog event? I recall that there is a hardware issue that was worked around in 3.0 with |
Also, not sure if it's useful, but after observing with an oscilloscope I see that there is no edge activity on the I2C bus once this crash (and subsequent rolling crashes) occurs. Both SDA and SCL stay high (not sure if driven by a bad slave or master, or just by pull-ups - I can probably test that with a known resistive load). So the second and subsequent crashes are happening without any real I2C operations taking place. |
Can you call xQueueOverwriteFromISR from non-isr context? |
@negativekelvin I was just looking at that too - the first call to |
Well in normal cases I don't think the first call would reach those code paths. Have you tried removing the attiny device from the equation? Does calling i2c_hw_fsm_reset before starting the first transaction stop the boot loop? |
Hi David ! I suggest that you scope (with osciloscope) the SDA and SCL lines during your tests. Test one device at time. I noticed that both lines are very sensitive to the squaring of signals! In my case, I used 3K3 ohms resistors to pullup lines. I have seem that you are using internall pullups. ESP32 - I2C Scanner using Arduino IDE (working) |
@Gustavomurta thanks for the advice. Here's the thing - I know that my I2C bus isn't perfect, and it would be good if I could condition my signals to avoid an issue, but the problem is that there's always a risk of errors on the bus due to noise. In the event of a failed I2C transaction, the bus will be in an error state, and that's fine if the software can detect that and return an error code to the caller. The problem is that the ESP32 I2C peripheral has a bug that causes its internal finite-state-machine (FSM) to lock up if SDA or SCK are electrically affected in certain ways. This is a known issue and acknowledged by Espressif. There is a fix in the 3.0 stream that attempts a FSM reset when there is a transaction timeout and the hardware busy flag is still raised. I see this fix activate sometimes and it seems to work. The issues that I have documented here are related to this, I think, but take it further:
Because 2. happens almost every single time 1. does, I suspect that 1. is related to the FSM failure. It may be a cause, or it may be incidental, I'm not sure. I don't know enough about the FSM failure to know whether it can cause a flood of interrupts. So my point is that although there's a lot I can do to improve the I2C bus in my particular circuit, there's a real issue with the ESP32 software interaction with the hardware at the moment that is causing I2C for multiple people, and Espressif are in the best possible place to investigate this now that there's a way to reproduce it. I am using external pull-ups BTW. The issue is also unrelated to bus speed. It happens at 10 kHz almost as often as it happens at 100 kHz. |
Ok, I was curious to know what the value of the resistors you are using. Thank you. |
I have also recently faced this problem, and I agree with @DavidAntliff that there is something wrong in i2c driver not being fully tolerant to errors in transmissions. In my case, it was due to not using external pullups, so I was just using the internal ones, with just one peripheral and 5cm cables in a breadboard. I was seeing random communication errors, and at some point, the wdt interrupt timeout error would appear. Once this error appeared, ESP32 was stuck in a reboot loop. I had to unpower the ESP32, or make a long push on the reset button, in order to work again. When touching the cables with my fingers, this happened a lot more and earlier, that's how I realized I needed external pullups. Once I put external pullups (4.7K), all communication errors and wdt interrupt errors are gone. But I understand there is something wrong with the driver, and there is the risk of entering this error state. Maybe if there is a lot of external EMI? Or maybe if some connection is a bit loose? Or in case the i2c pad of the wrover module I am using disconnects from my pcb and catches EMI noise... I dont know... This scares me because my product has internal battery and there is no power switch, nor a reset button... and the battery and pcb is inside a housing, not accesible to the user... So if this happens, I am in a big trouble, and the product will have to be shipped back to me in order to open it and unpower it. Right now I am looking for a solution and I may add a way of resetting ESP32 from another MCU, my product has ESP32+nRF52. This should be addressed looking for the real cause and fixing it, in order to make i2c driver fully tolerant to communication errors. I have seen some people at the forum reporting that some gpio interrupts are happening more times than expected, could this be related? |
Hi Luis, If there are many doubts, I suggest you to test with 3K3 resistors for 3.3V. |
Thank you @Gustavomurta . |
@panfeng-espressif sorry, I missed your comment:
No, I haven't specifically noticed this - is this something you've seen with my circuit, or in general? My reproduction method ("brush the SDA/SCL wires together") could easily replicate this scenario though. I thought that it was a known issue that grounding SCL or SDA at the wrong time could cause the FSM to enter a locked state? |
Just found this other issue at the arduino-esp32 repo, could be related? Can anyone test if changing that pin declaration as they propose has any effect? |
@DavidAntliff |
@luisonoff thanks, I'll look into it with my reproduction code. @panfeng-espressif ok, that sounds strange - I can't recall anything in my code that would drive SCL low - the SCL GPIO is only used for I2C operations. Do you see this often? I will keep an eye out for it. |
I think I have had the same boot-loop issue a few times, using IDF da81b97 (git from 2017-12-31). EDIT: confirmed, the stack trace is similar to other crashes, but has minor differences:
And the boot-loop after that:
Hardware: one TCA9535 I2C expander |
I am seeing this issue using esp-idf master 3ede9f (Feb 20, 2018) and using ESP32-DevKitC. In my case, I first observed this issue while running a test whose side effect is some noise on the I2C bus. This noise in turn leads to some failed I2C transactions after which I immediately see the crash with During normal circumstances when all I2C transactions succeed (with a 1.2K pullup), I can keep running without any errors. In trying to reproduce the same error without running my complicated test, I was able to use pull-ups of a higher value (4.7K) to cause the crash. In this case I observe quite a few failed I2C transactions during normal operation and then the crash. Seems like the driver is not able to handle some error conditions and ultimately causes the crash. Would be great if someone can look into this since it is hard to guarantee that the I2C bus will always function without any errors. |
@luisonoff When the FSM gets hungup in a Bus Busy state, it can be recovered by manually stimulating SCL, SDA. I have connected 2 additional GPIO pins, one to SCL. one to SDA. By transmitting a START, (SDA going low while SCL high, then nine SCL High->Low, then (STOP) SCL High (100us delay)then SDA High. This bus busy state can be cleared. Chuck. |
@stickbreaker interesting - I see you've done some work on this on the Arduino side. Can you explain how your FSM clearing approach is different to However the main issue, in my opinion at least, is the watchdog reset inside the interrupt service routine, and it's not 100% clear to me that this is caused by the FSM busy state (although all indicators so far suggest that it is). Is there an opportunity to trap and clear this FSM busy state before the watchdog timeout occurs? I suspect not though, since the interrupts have already been generated. But this is all guesswork on my part. @panfeng-espressif @igrr Given that many people have now reported seeing this issue too, is Espressif able to give us an update on their investigation please? I'm hoping for a fix so my project can proceed. |
@DavidAntliff
Since all of the problems have been encountered by people using the ESP32 in a SINGLE Master I2C configuration, the FSM will infinitely hang waiting for the 'other' master to complete it's transaction. The use of additional GPIO pins to act as another I2C master will solve the Bus_Busy problems, Actual TIMEOUT interrupt cascades I haven't solved.
gpio_set_direction(scl_io, GPIO_MODE_OUTPUT_OD);
gpio_set_direction(sda_io, GPIO_MODE_OUTPUT_OD);
gpio_set_level(scl_io, 0);
gpio_set_level(sda_io, 0);
for (int i = 0; i < 9; i++) {
gpio_set_level(scl_io, 1);
gpio_set_level(scl_io, 0);
}
gpio_set_level(scl_io, 1);
gpio_set_level(sda_io, 1);
i2c_set_pin(i2c_num, sda_io, scl_io, 1, 1, I2C_MODE_MASTER);
return ESP_OK; This code should be changed to something like this: gpio_set_level(scl_io, 1); // initial condition SCL needs to be High
gpio_set_level(sda_io, 1);// initial conditions SDA needs to be high
// a small delay (5us) 1/2 clock at 100khz
gpio_set_level(sda_io, 0);// Issues START
// a small delay (5us) 1/2 clock at 100khz
for (int i = 0; i < 9; i++) {
gpio_set_level(scl_io, 0);
// a small delay (5us) 1/2 clock at 100khz
gpio_set_level(scl_io, 1);
// a small delay (5us) 1/2 clock at 100khz
}
gpio_set_level(sda_io, 1); // Issue STOP The history of needing this function traces back to hardware glitches that occur when attaching the GPIO pins to the hardware peripheral. @ESP32DE and I solved these glitches for the Arduino environment proposal to i2c. I don't know the equivalent pin assignment sequence for IDF. I don't use directly use esp32-IDF. In my testing I no longer need to execute this function at every boot. with a quick looking through of the IDF i2c code, I see a few design idea I don't support.
I think the multiple if (activeInt & I2C_TRANS_COMPLETE_INT_ST_M) {
i2cIsrExit(p_i2c,EVENT_DONE,false);
return; // no more work to do inside From my point of view, ISR's must complete. They are atomic. If an ISR can't complete in a short FINITE timespan, it is coded wrong.
Chuck. |
@DavidAntliff I have reported this issue, and our engineer is working on the problem now. As soon as we have any conclusion, I will reply to you here. |
Is this Problem related to https://www.esp32.com/viewtopic.php?f=2&t=2632&p=16910&hilit=juergen#p16910? I saw that the I2C Interrupt randomly stopped working and the i2C Interrupt monitoring timed out. What I found out was: it never occured if I have no Network traffic through WLAN. (using REST requests). And when I am hammering the System with a high frequency of web requests the I2C fails much more often and the I2C timeout stopps. The i2c Fails when having Websocket acticities. When there is more activity it fails more often And it does not without websocket activities. Juergen |
Software reset does not reset peripherals. so, when the I2C state machine has a serious error, the system will restart repeatedly unless a system reset. I found, when the the communication environment is relatively poor, ack will be wrong, and then generate an interruption. we use a while loop to handle all interrupts, houwerver, for some reason, ack error caused this loop to execute more than 700 times in isr context. so the system crashed. we are looking for a solution to this problem and any progress will be notified here. thanks !! |
@koobest I just found a way to reset the I2C peripherals that seems to do a power on reset. It clears out all registers, resets the Bus_busy flags, initializes the hardware state back to a Power On condition. Chuck. |
@koobest if you can work out why the ISR loop runs 700 times (and presumably neglects kicking the watchdog resulting in the crash) then that would be very useful, as I believe that is the main issue here. If you call |
We are trying to enable |
Just wanted to let you know that there is a new commit in master from today that seems to fix this issue. I have not tested it yet. :) |
@luisonoff thank you for the alert! I have tested ESP-IDF commit 391c3ff with my reproduction project on one of my "DOIT" boards and I can happily report that I am unable to reproduce my issue by rubbing SDA and SCL together rapidly. I spent maybe 4 minutes rapidly mechanically shorting them and did not see a single crash. Then I reverted back to 2e7613b (just prior to the "fix") and verified that the issue can be reproduced. In fact it was extremely easy to reproduce it, many times per minute. So I can conclude from this that merge 892f390 appears to resolve the issue, for me at least, on this board. I'll try it on my Wemos LoLin32 Lite next, and report back if the results are any different. EDIT: looks good on the LoLin32 also - no crashes seen. Good work Espressif! Thank you. |
I can also confirm that this fix works for me. My ESP32 does not crash anymore even though there is noise on the I2C line. |
Another successful story. I was using an SSD1306 via i2c combined with a UART device and it used to fail. Updated to the latest version that includes this fix and now things just work!!! :D Thanks for this great fix. My hobby project can now continue!! |
Hi DavidAntliff , |
@psatya111 the patch is for ESP-IDF not Arduino - you can't apply it, it's different code. You might be able to find help in the esp32 Arduino forum. |
I have a project that has three I2C slave devices on a single bus (running at 100kHz). For some time I was developing with ESP-IDF 2.1.1 and everything was working pretty well, except for a weird problem where the I2C master would freeze up after a few minutes. I did some research and it looks like this is a problem with the I2C master hardware state machine which has been addressed in more recent commits of ESP-IDF. So to make use of this fix I migrated my project to use master (595688a). I had to make a few changes (remove references to FreeRTOS heap measurement commands, add
nvs_flash_init()
before initialising WiFi) but then everything seemed to work well. The slave devices are all being polled correctly and everything seems happy.The project is here: https://github.com/DavidAntliff/esp32-poolmon/tree/ESP-IDF_master
I came back a little while later and the application is crashing over and over with the following console output shortly after boot:
A software or on-board reset does not stop this endless reset behaviour, however removing power for a short period of time does "fix" the issue. It is strange that a brief ESP32 reset does not clear it. (EDIT: but a long reset press does).
The text was updated successfully, but these errors were encountered: