The linux runner does not gracefully shutdown on SIGINT #2582

Felixoid · 2023-05-04T08:24:05Z

Describe the bug
According to #2190 (comment), the runner should wait for the job is finished and then stop. But it doesn't happen, and new tasks are assigned to the runner with an old PID.

To Reproduce
Steps to reproduce the behavior:

Go to a runner with a running job and execute pgrep -af Runner.Listener
Remember it and execute pkill -INT Runner.Listener
Wait for a one task finished, and another one is started on the same runner.
The pgrep -af Runner.Listener does not change/is still there
The pkill -INT run.sh does not work as well

Expected behavior
The runner should gracefully shutdown. If it does not by SIGINT, there must be another way to stop it right after the job is finished, before assigning a new one

Runner Version and Platform

Linux 2.298.2 amd64

What's not working?

Please include error messages and screenshots.

The text was updated successfully, but these errors were encountered:

Felixoid · 2023-05-04T14:32:34Z

The -SIGTERM doesn't work too. It instantly fails the running job

bryanmacfarlane · 2023-05-04T16:56:43Z

Are you running the runner interactively or as a service?

Felixoid · 2023-05-04T17:20:25Z

It's launched as run.sh in the cloud-init, not as a service. I wonder, how would it affect the result? According to the service template, it sends SIGTERM to the process.

update: it looks like svc.sh actually runs a completely different binary, bin/runsvc.sh, that uses the wrapper bin/RunnerService.js

But under the hood, it sends SIGINT to Runner.Listener. And it would be the same thing as using pkill -INT Runner.Listener. And it doesn't work for us, listener assigns more and more tasks disregarding the received signal.

update 2: ok, I see how it would make things even worse. After 30 seconds the service would send SIGKILL to the Runner.Listener https://github.com/actions/runner/blob/main/src/Misc/layoutbin/RunnerService.js#L38

I am not sure when 30 seconds is enough to finish something. Our jobs last for 30+ minutes on average, with sometimes up to hours. The function's name gracefulShutdown does not reflect the actual behavior, IMHO.

Felixoid · 2023-05-05T13:15:27Z

/home/ubuntu/actions-runner/bin/Runner.Listener run

√ Connected to GitHub

Current runner version: '2.304.0'
2023-05-05 13:11:14Z: Listening for Jobs
2023-05-05 13:11:34Z: Running job: CheckLabels (1, c)
^CExiting...
2023-05-05 13:11:41Z: Job CheckLabels (1, c) completed with result: Canceled

That's what I have when running Runner.Listener directly

./bin/runsvc.sh
.path=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
Starting Runner listener with startup type: service
Started listener process, pid: 10437
Started running service

√ Connected to GitHub

Current runner version: '2.304.0'
2023-05-05 13:12:58Z: Listening for Jobs
2023-05-05 13:13:16Z: Running job: CheckLabels (1, a)
^CShutting down runner listener
Sending SIGINT to runner listener to stop
Sending SIGKILL to runner listener
Exiting...
Exiting...
Shutting down runner listener
Sending SIGINT to runner listener to stop
Sending SIGKILL to runner listener
Exiting...
2023-05-05 13:13:19Z: Job CheckLabels (1, a) completed with result: Canceled
Runner listener exited with error code 0
Runner listener exit with 0 return code, stop the service, no retry needed.

And the latter for the bin/runsvc.sh. The code in https://github.com/actions/runner/blob/main/src/Runner.Listener/Runner.cs#L310 clearly does not process ^C gracefully, but cancels the current job

mochja · 2023-05-10T02:32:51Z

@Felixoid you should be able to get around this with ephemeral runners as the service will just exit after processing the job.

Felixoid · 2023-05-10T07:27:36Z

@mochja thank you for the hint, but ephemeral does not work for us at all

If we are talking about a fresh instance per runner, then we obviously lose a lot of cross-running caches. E.g., docker images and volumes on the host. It's a huge overhead, we launch around 10k jobs per day.

If we are talking about the ephemeral runner on the same host, but re-running after each job, then it doesn't solve our issue as well. Termination lambda proposes the runner to be terminated, but then the runner process is relaunched, and another freshly assigned job is failed.

We desperately need the bug with signal processing to be fixed.

mochja · 2023-05-10T14:59:04Z

@Felixoid you have control over if you relaunch the runner process or not, at-least for ephemeral runner.

Termination lambda proposes the runner to be terminated

then you can either wait for runner to stop and don't start new one and terminate, or if runner is not running a job, you can just kill it and terminate.

Felixoid · 2023-05-10T15:04:12Z

Once again, there's no guaranteed way to avoid racing there in case of restarting an ephemeral runner.

runner restarts, it's not instant. At the moment, it deletes itself from GH API, then it needs to be added, and after that, it receives the jobs
the termination lambda sees it and proposes it to be terminated

both happen at the same moment. It's unavoidable.

update

Or, let's put it like this. There's no simple way to identify from outside of the host, if it's restarting the runner, or died by some reason, in the timeframe of 2 seconds. Because termination policy lambda is as well strictly limited to 2s timeout.

Sure, it could be a complex system, considering many values, but arguing it is rather there's a way to work around the bug. Sure, maybe there are plenty of them.

I'd be happy to see it fixed because before reporting it, I've spent months to find a way to commission runners safely on scaling-in. That's where I am.

mochja · 2023-05-10T15:11:51Z

what is holding you from waiting till the job is processes on the runner? Do you have to terminate right away?

Jobs are received only when the runner get's online, when you register it the jobs are not being scheduled.

Felixoid · 2023-05-10T15:25:45Z

However, thank you for your attempt to help. Let's don't spam the ticket regarding the signal processing bug with a conversation about a theoretically possible workaround.

Felixoid · 2023-06-07T13:04:36Z

I'll repeat here the thing I've posted to the ticket with GH support.

I've done some tests to determine if the ephemeral runners would work for us.

The outcome is devastating, can't put it the other way around.

So, the loop that I've tested is here

The instances check many side information if they could shut down, including the external ASG status and rebalance signals.

The instance received the signal, that it should tear down because of the rebalance signal, and it step back

But the GitHub reports another job started and failed lately FunctionalStatelessTestS3Debug5 https://api.github.com/repos/ClickHouse/ClickHouse/actions/jobs/14071870401

Does it mean that even ephemeral mode does not guarantee a graceful shutdown?

To unfold the situation

# The instance finished a runner process, configured as
sudo -u ubuntu ./config.sh --url $RUNNER_URL --token "$RUNNER_TOKEN" --ephemeral --runnergroup Default --labels "$LABELS" --work _work --name "$INSTANCE_ID"
# then ran it as
sudo -u ubuntu \
          ACTIONS_RUNNER_HOOK_JOB_STARTED=/tmp/actions-hooks/pre-run.sh \
          ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/tmp/actions-hooks/post-run.sh \
          ./run.sh &
# And checked in another loop, that the process is alive. When the process finished with lines in logs:
#	2023-06-07T14:00:47.931+02:00	2023-06-07 12:00:47Z: Job FunctionalStatelessTestMsan5 completed with result: Succeeded
#	2023-06-07T14:00:47.931+02:00	√ Removed .credentials
#	2023-06-07T14:00:47.931+02:00	√ Removed .runner
#	2023-06-07T14:00:47.931+02:00	Runner listener exit with 0 return code, stop the service, no retry needed.
#	2023-06-07T14:00:48.182+02:00	Exiting runner...
#	2023-06-07T14:00:48.182+02:00	Got runner pid 
# it again ended up in the place to configure it and run again. But first it checked the conditions to step down
echo "Checking if the instance suppose to terminate"
check-terminating-metadata
#	2023-06-07T14:00:48.182+02:00	Checking if the instance suppose to terminate
#	2023-06-07T14:00:48.182+02:00	{"noticeTime":"2023-06-07T11:28:00Z"}The runner received rebalance recommendation, we are terminating
# And here it shutdown
#	2023-06-07T14:00:49.183+02:00	Going to terminate the runner's instance
#	2023-06-07T14:00:49.183+02:00	{ "TerminatingInstances": [ { "CurrentState": { "Code": 32, "Name": "shutting-down" }, "InstanceId": "i-048a1....

So:

No new processes were launched
GitHub actions report failure

asos-tommycouzens · 2023-06-15T13:14:18Z

Would also love this!

We are in the process of setting up self hosted runners, and would like a safe way to scale down runners without causing running jobs to fail. Without the SIGTERM functionality as requested in this issue this is not possible without us building a quite complex and fragile orchestrator.

We similarly do not use ephermeal runners because we want to make use of docker caching. The availability of docker caching was the primary reason we chose github actions over azure devops.

Felixoid · 2023-06-19T12:51:02Z

The situation with ephemeral runners is a bit better than with normal. See the discussion in ClickHouse/ClickHouse#49283. It's not necessary to shut down the host, the process could restart.

It still doesn't guarantee that you could tear down the runner process if there are no jobs assigned there for long, unfortunately. Imagine, we have a pool of 30 runners, and only 24 of them have running jobs. After some period, 60 seconds in our case, we shut down each one of them that still don't have things to do. And at this moment, GH reports there was a job assigned to one of these poor runners.

It looks like GH assigns the jobs to runners, and not a runner is assigning a job by connecting to the API. If so, no matter what, there will be killed jobs.

All described above is my own conclusion. It's based on the long time playing left and right with different schemes to get runners working reliably, but desperately failing again and again.

ergonab · 2024-02-20T13:53:03Z

Any updates on this?

pauldraper · 2024-06-28T14:56:25Z

A workaround is to remove the runner through the GitHub API.

If the runner is busy, the call fails. If it not busy, the runner is removed, and the process exits.

ergonab · 2024-06-28T15:06:46Z

@pauldraper Yes, that is what we are doing at the moment, but it's a hack. In order for a machine to be able to gracefully reboot, it needs to have API credentials stored on it. Or, alternatively, you need to set up some kind of privileged operator outside of the machine that receives the "wish" from the runner to be removed.

This is in contrast to GitLab's runner that simply stops accepting jobs from the manager if it wants to stop operation. No failed or stuck builds, if you want to reboot a machine in a fleet of runners.

Doesn't the GitHub Actions Kubernetes operator handle this gracefully as well? Why can't the stand-alone runner do the same?

blackliner · 2024-12-20T06:20:10Z

Still no way to gracefully shutdown a github runner that is executing a job?

Would config.sh remove be a workaround? At least chatGPT says it would deregister the runner, while continuing with any running jobs.

EDIT: Unfortunately not

./config.sh remove --token $GHA_REGISTRATION_TOKEN

# Runner removal

Failed: Removing runner from the server
Runner "i-06ae7a6627a184307" is still running a job"

https://github.com/orgs/community/discussions/102641 mentions a graceful shutdown using the service 🤔

Felixoid added the bug Something isn't working label May 4, 2023

Felixoid mentioned this issue May 4, 2023

Attempt to increase the general runners' survival rate ClickHouse/ClickHouse#49283

Merged

Demilivor mentioned this issue May 31, 2023

clickhouse-client: disallow usage of --query and --queries-file at the same time ClickHouse/ClickHouse#50210

Merged

This was referenced Sep 6, 2023

ARC pod deletion issue actions/actions-runner-controller#2561

Open

Runner Pods randomly receiving a sigterm and killing jobs actions/actions-runner-controller#2695

Closed

Felixoid mentioned this issue Sep 13, 2023

Kill the runner process with all subprocesses ClickHouse/ClickHouse#52277

Merged

Felixoid mentioned this issue Dec 18, 2023

GitHub Actions Dammaz Kron Felixoid/actions-experiments#9

Open

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The linux runner does not gracefully shutdown on SIGINT #2582

The linux runner does not gracefully shutdown on SIGINT #2582

Felixoid commented May 4, 2023 •

edited

Loading

Felixoid commented May 4, 2023

bryanmacfarlane commented May 4, 2023

Felixoid commented May 4, 2023 •

edited

Loading

Felixoid commented May 5, 2023 •

edited

Loading

mochja commented May 10, 2023

Felixoid commented May 10, 2023 •

edited

Loading

mochja commented May 10, 2023

Felixoid commented May 10, 2023 •

edited

Loading

mochja commented May 10, 2023

Felixoid commented May 10, 2023

Felixoid commented Jun 7, 2023

asos-tommycouzens commented Jun 15, 2023 •

edited

Loading

Felixoid commented Jun 19, 2023 •

edited

Loading

ergonab commented Feb 20, 2024

pauldraper commented Jun 28, 2024 •

edited

Loading

ergonab commented Jun 28, 2024

blackliner commented Dec 20, 2024 •

edited

Loading

The linux runner does not gracefully shutdown on SIGINT #2582

The linux runner does not gracefully shutdown on SIGINT #2582

Comments

Felixoid commented May 4, 2023 • edited Loading

Runner Version and Platform

What's not working?

Felixoid commented May 4, 2023

bryanmacfarlane commented May 4, 2023

Felixoid commented May 4, 2023 • edited Loading

Felixoid commented May 5, 2023 • edited Loading

mochja commented May 10, 2023

Felixoid commented May 10, 2023 • edited Loading

mochja commented May 10, 2023

Felixoid commented May 10, 2023 • edited Loading

update

mochja commented May 10, 2023

Felixoid commented May 10, 2023

Felixoid commented Jun 7, 2023

asos-tommycouzens commented Jun 15, 2023 • edited Loading

Felixoid commented Jun 19, 2023 • edited Loading

ergonab commented Feb 20, 2024

pauldraper commented Jun 28, 2024 • edited Loading

ergonab commented Jun 28, 2024

blackliner commented Dec 20, 2024 • edited Loading

Felixoid commented May 4, 2023 •

edited

Loading

Felixoid commented May 4, 2023 •

edited

Loading

Felixoid commented May 5, 2023 •

edited

Loading

Felixoid commented May 10, 2023 •

edited

Loading

Felixoid commented May 10, 2023 •

edited

Loading

asos-tommycouzens commented Jun 15, 2023 •

edited

Loading

Felixoid commented Jun 19, 2023 •

edited

Loading

pauldraper commented Jun 28, 2024 •

edited

Loading

blackliner commented Dec 20, 2024 •

edited

Loading