Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd-watchdog: improve notifications to watchdog #595

Merged

Conversation

alexmohr
Copy link
Contributor

@alexmohr alexmohr commented Feb 2, 2024

systemd will now be notified on send timeouts
and when event handling is running for a long time

this prevents dlt-daemon from being killed by
systemd watchdog when processing log messages
would take longer than the watchdog timeout

To test we need the following things

  • dlt-daemon started via systemd with enabled watchdog
  • a lot of logging going into the daemon

Excecute the script below on your test system.
It spawns a lot of dlt-receive proceses, the goal is for dlt-daemon to survive.

end_time=$((SECONDS+180))

while [ $SECONDS -lt $end_time ]; do 
  dlt-receive -a 127.1 &
done 

killall -9 dlt-receive

if [[ $(uname) == "Linux" ]]; then
  systemctl status dlt.service
  echo Service should be running for > 3 minutes
else
  pidin -faAt | grep dlt-daemon | grep -v grep
  echo Service should be running at least since system boot
fi

The program was tested solely for our own use cases, which might differ from yours.
Licensed under Mozilla Public License Version 2.0

Alexander Mohr, alexander.m.mohr@mercedes-benz.com, Mercedes-Benz Tech Innovation GmbH, imprint

system will now be notified on send timeouts
and when event handling is running for a long time

Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>
@michael-methner
Copy link
Collaborator

Hello @alexmohr ,
thanks for the patch. It looks good but I haven't touched the systemd_notify code yet.

One thing which I wonder about is that the watchdog is triggered at several places instead from one central place. E.g. in case the poll in dlt_daemon_handle_event() returns. Would this make more sense?

@alexmohr
Copy link
Contributor Author

alexmohr commented Feb 5, 2024

Hi @michael-methner,

it's triggered from multiple places to prevent watchdog timeouts when we run into socket timeouts.
Especially if many (or multiple slow) clients are connected it can happen that the dlt-daemon is running into multiple socket timeouts back to back.
If we would not trigger the systemd watchdog here we might run into a systemd watchdog timeout although the daemon is still working properly.
Triggering from the event handler alone is therefore not sufficient because the watchdog timeout could be over when we are back in the event handler function

* the watchdog interval is small and multiple timeouts occur back to back
*/
if (sd_notify(0, "WATCHDOG=1") < 0)
dlt_vlog(LOG_WARNING, "%s: Could not reset systemd watchdog\n", __func__);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why dlt_daemon_trigger_systemd_watchdog_if_necessary() was not used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is no timestamp to use as a reference time. We could call the function and always pass 0 as timestamp.
As you mentioned in the other comment this patch improves the watchdog behavior but it's far from perfect.
Especially in environment with congested networks between the clients and the daemon combined with heavy logging load.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.

@michael-methner
Copy link
Collaborator

Hi @michael-methner,

it's triggered from multiple places to prevent watchdog timeouts when we run into socket timeouts. Especially if many (or multiple slow) clients are connected it can happen that the dlt-daemon is running into multiple socket timeouts back to back. If we would not trigger the systemd watchdog here we might run into a systemd watchdog timeout although the daemon is still working properly. Triggering from the event handler alone is therefore not sufficient because the watchdog timeout could be over when we are back in the event handler function

Thanks for the explanation. As the change is definitely an improvement, I will merge it. But that slow clients may still cause watchdog timeouts worries me. But I understand that this will be part of a bigger rework.

@michael-methner michael-methner merged commit 48a867c into COVESA:master Feb 12, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants