Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The linux runner does not gracefully shutdown on SIGINT #2582

Open
Felixoid opened this issue May 4, 2023 · 17 comments
Open

The linux runner does not gracefully shutdown on SIGINT #2582

Felixoid opened this issue May 4, 2023 · 17 comments
Labels
bug Something isn't working

Comments

@Felixoid
Copy link

Felixoid commented May 4, 2023

Describe the bug
According to #2190 (comment), the runner should wait for the job is finished and then stop. But it doesn't happen, and new tasks are assigned to the runner with an old PID.

To Reproduce
Steps to reproduce the behavior:

  1. Go to a runner with a running job and execute pgrep -af Runner.Listener
  2. Remember it and execute pkill -INT Runner.Listener
  3. Wait for a one task finished, and another one is started on the same runner.
  4. The pgrep -af Runner.Listener does not change/is still there
  5. The pkill -INT run.sh does not work as well

Expected behavior
The runner should gracefully shutdown. If it does not by SIGINT, there must be another way to stop it right after the job is finished, before assigning a new one

Runner Version and Platform

Linux 2.298.2 amd64

What's not working?

Please include error messages and screenshots.

@Felixoid
Copy link
Author

Felixoid commented May 4, 2023

The -SIGTERM doesn't work too. It instantly fails the running job

image

@bryanmacfarlane
Copy link
Member

Are you running the runner interactively or as a service?

@Felixoid
Copy link
Author

Felixoid commented May 4, 2023

It's launched as run.sh in the cloud-init, not as a service. I wonder, how would it affect the result? According to the service template, it sends SIGTERM to the process.

update: it looks like svc.sh actually runs a completely different binary, bin/runsvc.sh, that uses the wrapper bin/RunnerService.js

But under the hood, it sends SIGINT to Runner.Listener. And it would be the same thing as using pkill -INT Runner.Listener. And it doesn't work for us, listener assigns more and more tasks disregarding the received signal.

update 2: ok, I see how it would make things even worse. After 30 seconds the service would send SIGKILL to the Runner.Listener https://github.com/actions/runner/blob/main/src/Misc/layoutbin/RunnerService.js#L38

I am not sure when 30 seconds is enough to finish something. Our jobs last for 30+ minutes on average, with sometimes up to hours. The function's name gracefulShutdown does not reflect the actual behavior, IMHO.

@Felixoid
Copy link
Author

Felixoid commented May 5, 2023

/home/ubuntu/actions-runner/bin/Runner.Listener run

√ Connected to GitHub

Current runner version: '2.304.0'
2023-05-05 13:11:14Z: Listening for Jobs
2023-05-05 13:11:34Z: Running job: CheckLabels (1, c)
^CExiting...
2023-05-05 13:11:41Z: Job CheckLabels (1, c) completed with result: Canceled

That's what I have when running Runner.Listener directly

./bin/runsvc.sh
.path=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
Starting Runner listener with startup type: service
Started listener process, pid: 10437
Started running service

√ Connected to GitHub

Current runner version: '2.304.0'
2023-05-05 13:12:58Z: Listening for Jobs
2023-05-05 13:13:16Z: Running job: CheckLabels (1, a)
^CShutting down runner listener
Sending SIGINT to runner listener to stop
Sending SIGKILL to runner listener
Exiting...
Exiting...
Shutting down runner listener
Sending SIGINT to runner listener to stop
Sending SIGKILL to runner listener
Exiting...
2023-05-05 13:13:19Z: Job CheckLabels (1, a) completed with result: Canceled
Runner listener exited with error code 0
Runner listener exit with 0 return code, stop the service, no retry needed.

And the latter for the bin/runsvc.sh. The code in https://github.com/actions/runner/blob/main/src/Runner.Listener/Runner.cs#L310 clearly does not process ^C gracefully, but cancels the current job

image

@mochja
Copy link

mochja commented May 10, 2023

@Felixoid you should be able to get around this with ephemeral runners as the service will just exit after processing the job.

@Felixoid
Copy link
Author

Felixoid commented May 10, 2023

@mochja thank you for the hint, but ephemeral does not work for us at all

If we are talking about a fresh instance per runner, then we obviously lose a lot of cross-running caches. E.g., docker images and volumes on the host. It's a huge overhead, we launch around 10k jobs per day.

If we are talking about the ephemeral runner on the same host, but re-running after each job, then it doesn't solve our issue as well. Termination lambda proposes the runner to be terminated, but then the runner process is relaunched, and another freshly assigned job is failed.

We desperately need the bug with signal processing to be fixed.

@mochja
Copy link

mochja commented May 10, 2023

@Felixoid you have control over if you relaunch the runner process or not, at-least for ephemeral runner.

Termination lambda proposes the runner to be terminated

then you can either wait for runner to stop and don't start new one and terminate, or if runner is not running a job, you can just kill it and terminate.

@Felixoid
Copy link
Author

Felixoid commented May 10, 2023

Once again, there's no guaranteed way to avoid racing there in case of restarting an ephemeral runner.

  • runner restarts, it's not instant. At the moment, it deletes itself from GH API, then it needs to be added, and after that, it receives the jobs
  • the termination lambda sees it and proposes it to be terminated

both happen at the same moment. It's unavoidable.

update

Or, let's put it like this. There's no simple way to identify from outside of the host, if it's restarting the runner, or died by some reason, in the timeframe of 2 seconds. Because termination policy lambda is as well strictly limited to 2s timeout.

Sure, it could be a complex system, considering many values, but arguing it is rather there's a way to work around the bug. Sure, maybe there are plenty of them.

I'd be happy to see it fixed because before reporting it, I've spent months to find a way to commission runners safely on scaling-in. That's where I am.

@mochja
Copy link

mochja commented May 10, 2023

what is holding you from waiting till the job is processes on the runner? Do you have to terminate right away?

Jobs are received only when the runner get's online, when you register it the jobs are not being scheduled.

@Felixoid
Copy link
Author

However, thank you for your attempt to help. Let's don't spam the ticket regarding the signal processing bug with a conversation about a theoretically possible workaround.

@Felixoid
Copy link
Author

Felixoid commented Jun 7, 2023

I'll repeat here the thing I've posted to the ticket with GH support.

I've done some tests to determine if the ephemeral runners would work for us.

The outcome is devastating, can't put it the other way around.

So, the loop that I've tested is here

The instances check many side information if they could shut down, including the external ASG status and rebalance signals.

The instance received the signal, that it should tear down because of the rebalance signal, and it step back

image

But the GitHub reports another job started and failed lately FunctionalStatelessTestS3Debug5 https://api.github.com/repos/ClickHouse/ClickHouse/actions/jobs/14071870401

Does it mean that even ephemeral mode does not guarantee a graceful shutdown?

To unfold the situation

# The instance finished a runner process, configured as
sudo -u ubuntu ./config.sh --url $RUNNER_URL --token "$RUNNER_TOKEN" --ephemeral --runnergroup Default --labels "$LABELS" --work _work --name "$INSTANCE_ID"
# then ran it as
sudo -u ubuntu \
          ACTIONS_RUNNER_HOOK_JOB_STARTED=/tmp/actions-hooks/pre-run.sh \
          ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/tmp/actions-hooks/post-run.sh \
          ./run.sh &
# And checked in another loop, that the process is alive. When the process finished with lines in logs:
#	2023-06-07T14:00:47.931+02:00	2023-06-07 12:00:47Z: Job FunctionalStatelessTestMsan5 completed with result: Succeeded
#	2023-06-07T14:00:47.931+02:00	√ Removed .credentials
#	2023-06-07T14:00:47.931+02:00	√ Removed .runner
#	2023-06-07T14:00:47.931+02:00	Runner listener exit with 0 return code, stop the service, no retry needed.
#	2023-06-07T14:00:48.182+02:00	Exiting runner...
#	2023-06-07T14:00:48.182+02:00	Got runner pid 
# it again ended up in the place to configure it and run again. But first it checked the conditions to step down
echo "Checking if the instance suppose to terminate"
check-terminating-metadata
#	2023-06-07T14:00:48.182+02:00	Checking if the instance suppose to terminate
#	2023-06-07T14:00:48.182+02:00	{"noticeTime":"2023-06-07T11:28:00Z"}The runner received rebalance recommendation, we are terminating
# And here it shutdown
#	2023-06-07T14:00:49.183+02:00	Going to terminate the runner's instance
#	2023-06-07T14:00:49.183+02:00	{ "TerminatingInstances": [ { "CurrentState": { "Code": 32, "Name": "shutting-down" }, "InstanceId": "i-048a1....

So:

  1. No new processes were launched
  2. GitHub actions report failure

@asos-tommycouzens
Copy link

asos-tommycouzens commented Jun 15, 2023

Would also love this!

We are in the process of setting up self hosted runners, and would like a safe way to scale down runners without causing running jobs to fail. Without the SIGTERM functionality as requested in this issue this is not possible without us building a quite complex and fragile orchestrator.

We similarly do not use ephermeal runners because we want to make use of docker caching. The availability of docker caching was the primary reason we chose github actions over azure devops.

@Felixoid
Copy link
Author

Felixoid commented Jun 19, 2023

The situation with ephemeral runners is a bit better than with normal. See the discussion in ClickHouse/ClickHouse#49283. It's not necessary to shut down the host, the process could restart.

It still doesn't guarantee that you could tear down the runner process if there are no jobs assigned there for long, unfortunately. Imagine, we have a pool of 30 runners, and only 24 of them have running jobs. After some period, 60 seconds in our case, we shut down each one of them that still don't have things to do. And at this moment, GH reports there was a job assigned to one of these poor runners.

It looks like GH assigns the jobs to runners, and not a runner is assigning a job by connecting to the API. If so, no matter what, there will be killed jobs.

All described above is my own conclusion. It's based on the long time playing left and right with different schemes to get runners working reliably, but desperately failing again and again.

@ergonab
Copy link

ergonab commented Feb 20, 2024

Any updates on this?

@pauldraper
Copy link

pauldraper commented Jun 28, 2024

A workaround is to remove the runner through the GitHub API.

If the runner is busy, the call fails. If it not busy, the runner is removed, and the process exits.

@ergonab
Copy link

ergonab commented Jun 28, 2024

@pauldraper Yes, that is what we are doing at the moment, but it's a hack. In order for a machine to be able to gracefully reboot, it needs to have API credentials stored on it. Or, alternatively, you need to set up some kind of privileged operator outside of the machine that receives the "wish" from the runner to be removed.

This is in contrast to GitLab's runner that simply stops accepting jobs from the manager if it wants to stop operation. No failed or stuck builds, if you want to reboot a machine in a fleet of runners.

Doesn't the GitHub Actions Kubernetes operator handle this gracefully as well? Why can't the stand-alone runner do the same?

@blackliner
Copy link

blackliner commented Dec 20, 2024

Still no way to gracefully shutdown a github runner that is executing a job?

Would config.sh remove be a workaround? At least chatGPT says it would deregister the runner, while continuing with any running jobs.

EDIT: Unfortunately not

./config.sh remove --token $GHA_REGISTRATION_TOKEN

# Runner removal

Failed: Removing runner from the server
Runner "i-06ae7a6627a184307" is still running a job"

https://github.com/orgs/community/discussions/102641 mentions a graceful shutdown using the service 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants