systemd-watchdog: improve notifications to watchdog #595

alexmohr · 2024-02-02T10:34:00Z

systemd will now be notified on send timeouts
and when event handling is running for a long time

this prevents dlt-daemon from being killed by
systemd watchdog when processing log messages
would take longer than the watchdog timeout

To test we need the following things

dlt-daemon started via systemd with enabled watchdog
a lot of logging going into the daemon

Excecute the script below on your test system.
It spawns a lot of dlt-receive proceses, the goal is for dlt-daemon to survive.

end_time=$((SECONDS+180))

while [ $SECONDS -lt $end_time ]; do 
  dlt-receive -a 127.1 &
done 

killall -9 dlt-receive

if [[ $(uname) == "Linux" ]]; then
  systemctl status dlt.service
  echo Service should be running for > 3 minutes
else
  pidin -faAt | grep dlt-daemon | grep -v grep
  echo Service should be running at least since system boot
fi

The program was tested solely for our own use cases, which might differ from yours.
Licensed under Mozilla Public License Version 2.0

^{Alexander Mohr, alexander.m.mohr@mercedes-benz.com, Mercedes-Benz Tech Innovation GmbH, imprint}

system will now be notified on send timeouts and when event handling is running for a long time Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>

michael-methner · 2024-02-05T14:01:04Z

Hello @alexmohr ,
thanks for the patch. It looks good but I haven't touched the systemd_notify code yet.

One thing which I wonder about is that the watchdog is triggered at several places instead from one central place. E.g. in case the poll in dlt_daemon_handle_event() returns. Would this make more sense?

alexmohr · 2024-02-05T15:36:22Z

Hi @michael-methner,

it's triggered from multiple places to prevent watchdog timeouts when we run into socket timeouts.
Especially if many (or multiple slow) clients are connected it can happen that the dlt-daemon is running into multiple socket timeouts back to back.
If we would not trigger the systemd watchdog here we might run into a systemd watchdog timeout although the daemon is still working properly.
Triggering from the event handler alone is therefore not sufficient because the watchdog timeout could be over when we are back in the event handler function

michael-methner · 2024-02-08T14:37:38Z

src/daemon/dlt_daemon_socket.c

+             * the watchdog interval is small and multiple timeouts occur back to back
+             */
+            if (sd_notify(0, "WATCHDOG=1") < 0)
+                dlt_vlog(LOG_WARNING, "%s: Could not reset systemd watchdog\n", __func__);


Is there a reason why dlt_daemon_trigger_systemd_watchdog_if_necessary() was not used here?

Because there is no timestamp to use as a reference time. We could call the function and always pass 0 as timestamp.
As you mentioned in the other comment this patch improves the watchdog behavior but it's far from perfect.
Especially in environment with congested networks between the clients and the daemon combined with heavy logging load.

Got it. Thanks.

michael-methner · 2024-02-08T14:39:42Z

Hi @michael-methner,

it's triggered from multiple places to prevent watchdog timeouts when we run into socket timeouts. Especially if many (or multiple slow) clients are connected it can happen that the dlt-daemon is running into multiple socket timeouts back to back. If we would not trigger the systemd watchdog here we might run into a systemd watchdog timeout although the daemon is still working properly. Triggering from the event handler alone is therefore not sufficient because the watchdog timeout could be over when we are back in the event handler function

Thanks for the explanation. As the change is definitely an improvement, I will merge it. But that slow clients may still cause watchdog timeouts worries me. But I understand that this will be part of a bigger rework.

systemd-watchdog: improve notifications to watchdog

e66405b

system will now be notified on send timeouts and when event handling is running for a long time Signed-off-by: Alexander Mohr <alexander.m.mohr@mercedes-benz.com>

minminlittleshrimp self-assigned this Feb 5, 2024

minminlittleshrimp added the refactor label Feb 5, 2024

michael-methner reviewed Feb 8, 2024

View reviewed changes

michael-methner approved these changes Feb 12, 2024

View reviewed changes

michael-methner merged commit 48a867c into COVESA:master Feb 12, 2024
3 checks passed

alexmohr mentioned this pull request Apr 9, 2024

watchdog: fix another possible watchdog timeout #619

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

systemd-watchdog: improve notifications to watchdog #595

systemd-watchdog: improve notifications to watchdog #595

alexmohr commented Feb 2, 2024

michael-methner commented Feb 5, 2024

alexmohr commented Feb 5, 2024

michael-methner Feb 8, 2024

alexmohr Feb 8, 2024

michael-methner Feb 12, 2024

michael-methner commented Feb 8, 2024

systemd-watchdog: improve notifications to watchdog #595

systemd-watchdog: improve notifications to watchdog #595

Conversation

alexmohr commented Feb 2, 2024

michael-methner commented Feb 5, 2024

alexmohr commented Feb 5, 2024

michael-methner Feb 8, 2024

Choose a reason for hiding this comment

alexmohr Feb 8, 2024

Choose a reason for hiding this comment

michael-methner Feb 12, 2024

Choose a reason for hiding this comment

michael-methner commented Feb 8, 2024