-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The linux runner does not gracefully shutdown on SIGINT #2582
Comments
Are you running the runner interactively or as a service? |
It's launched as update: it looks like But under the hood, it sends SIGINT to update 2: ok, I see how it would make things even worse. After 30 seconds the service would send I am not sure when 30 seconds is enough to finish something. Our jobs last for 30+ minutes on average, with sometimes up to hours. The function's name |
That's what I have when running
And the latter for the |
@Felixoid you should be able to get around this with ephemeral runners as the service will just exit after processing the job. |
@mochja thank you for the hint, but If we are talking about a fresh instance per runner, then we obviously lose a lot of cross-running caches. E.g., docker images and volumes on the host. It's a huge overhead, we launch around 10k jobs per day. If we are talking about the ephemeral runner on the same host, but re-running after each job, then it doesn't solve our issue as well. Termination lambda proposes the runner to be terminated, but then the runner process is relaunched, and another freshly assigned job is failed. We desperately need the bug with signal processing to be fixed. |
@Felixoid you have control over if you relaunch the runner process or not, at-least for ephemeral runner.
then you can either wait for runner to stop and don't start new one and terminate, or if runner is not running a job, you can just kill it and terminate. |
Once again, there's no guaranteed way to avoid racing there in case of restarting an
both happen at the same moment. It's unavoidable. updateOr, let's put it like this. There's no simple way to identify from outside of the host, if it's restarting the runner, or died by some reason, in the timeframe of 2 seconds. Because termination policy lambda is as well strictly limited to 2s timeout. Sure, it could be a complex system, considering many values, but arguing it is rather I'd be happy to see it fixed because before reporting it, I've spent months to find a way to commission runners safely on scaling-in. That's where I am. |
what is holding you from waiting till the job is processes on the runner? Do you have to terminate right away? Jobs are received only when the runner get's online, when you register it the jobs are not being scheduled. |
However, thank you for your attempt to help. Let's don't spam the ticket regarding the signal processing bug with a conversation about a theoretically possible workaround. |
I'll repeat here the thing I've posted to the ticket with GH support. I've done some tests to determine if the ephemeral runners would work for us. The outcome is devastating, can't put it the other way around. So, the loop that I've tested is here The instances check many side information if they could shut down, including the external ASG status and rebalance signals. The instance received the signal, that it should tear down because of the rebalance signal, and it step back But the GitHub reports another job started and failed lately FunctionalStatelessTestS3Debug5 https://api.github.com/repos/ClickHouse/ClickHouse/actions/jobs/14071870401 Does it mean that even ephemeral mode does not guarantee a graceful shutdown? To unfold the situation # The instance finished a runner process, configured as
sudo -u ubuntu ./config.sh --url $RUNNER_URL --token "$RUNNER_TOKEN" --ephemeral --runnergroup Default --labels "$LABELS" --work _work --name "$INSTANCE_ID"
# then ran it as
sudo -u ubuntu \
ACTIONS_RUNNER_HOOK_JOB_STARTED=/tmp/actions-hooks/pre-run.sh \
ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/tmp/actions-hooks/post-run.sh \
./run.sh &
# And checked in another loop, that the process is alive. When the process finished with lines in logs:
# 2023-06-07T14:00:47.931+02:00 2023-06-07 12:00:47Z: Job FunctionalStatelessTestMsan5 completed with result: Succeeded
# 2023-06-07T14:00:47.931+02:00 √ Removed .credentials
# 2023-06-07T14:00:47.931+02:00 √ Removed .runner
# 2023-06-07T14:00:47.931+02:00 Runner listener exit with 0 return code, stop the service, no retry needed.
# 2023-06-07T14:00:48.182+02:00 Exiting runner...
# 2023-06-07T14:00:48.182+02:00 Got runner pid
# it again ended up in the place to configure it and run again. But first it checked the conditions to step down
echo "Checking if the instance suppose to terminate"
check-terminating-metadata
# 2023-06-07T14:00:48.182+02:00 Checking if the instance suppose to terminate
# 2023-06-07T14:00:48.182+02:00 {"noticeTime":"2023-06-07T11:28:00Z"}The runner received rebalance recommendation, we are terminating
# And here it shutdown
# 2023-06-07T14:00:49.183+02:00 Going to terminate the runner's instance
# 2023-06-07T14:00:49.183+02:00 { "TerminatingInstances": [ { "CurrentState": { "Code": 32, "Name": "shutting-down" }, "InstanceId": "i-048a1.... So:
|
Would also love this! We are in the process of setting up self hosted runners, and would like a safe way to scale down runners without causing running jobs to fail. Without the SIGTERM functionality as requested in this issue this is not possible without us building a quite complex and fragile orchestrator. We similarly do not use ephermeal runners because we want to make use of docker caching. The availability of docker caching was the primary reason we chose github actions over azure devops. |
The situation with ephemeral runners is a bit better than with normal. See the discussion in ClickHouse/ClickHouse#49283. It's not necessary to shut down the host, the process could restart. It still doesn't guarantee that you could tear down the runner process if there are no jobs assigned there for long, unfortunately. Imagine, we have a pool of 30 runners, and only 24 of them have running jobs. After some period, 60 seconds in our case, we shut down each one of them that still don't have things to do. And at this moment, GH reports there was a job assigned to one of these poor runners. It looks like GH assigns the jobs to runners, and not a runner is assigning a job by connecting to the API. If so, no matter what, there will be killed jobs. All described above is my own conclusion. It's based on the long time playing left and right with different schemes to get runners working reliably, but desperately failing again and again. |
Any updates on this? |
A workaround is to remove the runner through the GitHub API. If the runner is busy, the call fails. If it not busy, the runner is removed, and the process exits. |
@pauldraper Yes, that is what we are doing at the moment, but it's a hack. In order for a machine to be able to gracefully reboot, it needs to have API credentials stored on it. Or, alternatively, you need to set up some kind of privileged operator outside of the machine that receives the "wish" from the runner to be removed. This is in contrast to GitLab's runner that simply stops accepting jobs from the manager if it wants to stop operation. No failed or stuck builds, if you want to reboot a machine in a fleet of runners. Doesn't the GitHub Actions Kubernetes operator handle this gracefully as well? Why can't the stand-alone runner do the same? |
Still no way to gracefully shutdown a github runner that is executing a job? Would EDIT: Unfortunately not
https://github.com/orgs/community/discussions/102641 mentions a graceful shutdown using the service 🤔 |
Describe the bug
According to #2190 (comment), the runner should wait for the job is finished and then stop. But it doesn't happen, and new tasks are assigned to the runner with an old PID.
To Reproduce
Steps to reproduce the behavior:
pgrep -af Runner.Listener
pkill -INT Runner.Listener
pgrep -af Runner.Listener
does not change/is still therepkill -INT run.sh
does not work as wellExpected behavior
The runner should gracefully shutdown. If it does not by SIGINT, there must be another way to stop it right after the job is finished, before assigning a new one
Runner Version and Platform
Linux 2.298.2 amd64
What's not working?
Please include error messages and screenshots.
The text was updated successfully, but these errors were encountered: