Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog triggering #94

Closed
jordens opened this issue Oct 13, 2020 · 15 comments · Fixed by #114
Closed

Watchdog triggering #94

jordens opened this issue Oct 13, 2020 · 15 comments · Fixed by #114
Assignees
Labels
bug Something isn't working
Milestone

Comments

@jordens
Copy link
Member

jordens commented Oct 13, 2020

Before the speed up due to #92 (i.e. in v0.1), on a 8 channel device there are spurious watchdog triggers.
(Presumably watchdog). My assumption is that the speed up addresses this but we should keep an eye on this.

Triage would be reproduce it with v0.1 and debug it there. #92 will likely mask it.

@hartytp
Copy link

hartytp commented Dec 17, 2020

If a fix for this won't be forthcoming in the near term, it's probably worth recommending using release in the docs.

@jordens
Copy link
Member Author

jordens commented Dec 17, 2020

I think the easiest is to cut a new release. I haven't seen it with the current code.
@ryan-summers I guess you gave it a spin as well on your device?

There is typically no way to actually prove that a watchdog doesn't trigger. You can only reduce the rate heuristically.
I would also agree that we should recommend release builds to users (and debug to developers).

@jordens jordens mentioned this issue Jan 4, 2021
@jordens jordens added the bug Something isn't working label Jan 4, 2021
@jordens jordens added this to the 0.2 milestone Jan 4, 2021
@ryan-summers
Copy link
Member

I believe an issue here is that the W5500 driver may block for up to 1.8 seconds by default when trying to connect a TCP socket to the MQTT broker (configured by the W5500 ARP timeout, 9x retires at 200ms per retry). When the broker is not available on the network, this means that each idle processing loop can last for 1.8 seconds (thus, a watchdog request is only made once per 1.8 seconds from the idle task).

Upon boot, all tasks are scheduled to occur simultaneously. It's possible that tasks align in such a way that there is 200ms delay additionally required due to servicing higher-priority tasks, which could increase the watchdog service interval beyond the acceptable 2 second configured interval.

A reasonable solution here is therefore to increase the watchdog timeout from the existing 2 second interval to 4 seconds, which will give us much more time in case tasks stack up suboptimally.

@hartytp
Copy link

hartytp commented Jan 4, 2021

I believe an issue here is that the W5500 driver may block for up to 1.8 seconds by default when trying to connect a TCP socket to the MQTT broker (configured by the W5500 ARP timeout, 9x retires at 200ms per retry). When the broker is not available on the network, this means that each idle processing loop can last for 1.8 seconds (thus, a watchdog request is only made once per 1.8 seconds from the idle task).

Upon boot, all tasks are scheduled to occur simultaneously. It's possible that tasks align in such a way that there is 200ms delay additionally required due to servicing higher-priority tasks, which could increase the watchdog service interval beyond the acceptable 2 second configured interval.

That would only cause the watchdog to trigger at boot, right?

I see it tripping randomly long after the MQTT connection has established. I also at least once saw (what I assume was a watchdog) triggering when I pressed a FP button.

@ryan-summers
Copy link
Member

ryan-summers commented Jan 4, 2021

I see it tripping randomly long after the MQTT connection has established. I also at least once saw (what I assume was a watchdog) triggering when I pressed a FP button.

@hartytp Would you mind confirming that these were observed with the latest develop firmware and not release v0.1.0? Additionally, would you mind clarifying how "long" it was in between events (approximately. E.g. was it a few minutes, hours, days?). I'm working from home today, but I'll be at my office tomorrow and can test with the Booster that I have at my desk to try and reproduce this to figure out root cause.

@hartytp
Copy link

hartytp commented Jan 4, 2021

@hartytp Would you mind confirming that these were observed with the latest develop firmware and not release v0.1.0?

Yes, I plan to look at that shortly.

Additionally, would you mind clarifying how "long" it was in between events (approximately. E.g. was it a few minutes, hours, days?).

I don't have great statistics on it. Felt like somewhere between tens of minutes and a few hours.

@ryan-summers
Copy link
Member

I don't have great statistics on it. Felt like somewhere between tens of minutes and a few hours.

That should be good enough to get me started on it tomorrow - thanks for the info.

@hartytp
Copy link

hartytp commented Jan 4, 2021

testing with 5d08a80 built in release mode.

@hartytp
Copy link

hartytp commented Jan 5, 2021

Using 5d08a80 built in release mode I just got Booster to restart (presumably a watchdog timeout) by plugging in the network cable.

@hartytp
Copy link

hartytp commented Jan 5, 2021

aah, that's not surprising since I was working off an old commit. I hadn't noticed that everything is in develop (why not master?)

@hartytp
Copy link

hartytp commented Jan 5, 2021

after switching to the newest firmware, I don't believe I've seen any issues with this. However, I haven't tried plugging / unplugging ethernet, pressing buttons / etc. I also haven't looked that carefully (or run the firmware for that long) so could have missed something. But it's certainly a lot better than it was!

@ryan-summers
Copy link
Member

I'm about to merge a PR to resolve this. If you see any watchdogs with the latest firmware, please do open a new issue. Any occurrence of a watchdog is not expected and indicative of a bug.

@hartytp
Copy link

hartytp commented Jan 6, 2021

@jordens / @ryan-summers can you change the issue permissions so anyone can close/reopen issues?

@hartytp
Copy link

hartytp commented Jan 6, 2021

@ryan-summers with the latest firmware built in release mode, I cannot enable Booster. When I press the interlock reset button, I get green + yellow LEDs for a fraction of a second, then a pulse of the fans followed by no LEDs.

@hartytp
Copy link

hartytp commented Jan 6, 2021

this was building from 0a1c340

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants