Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.1.6 regression (rootless, cgroup v2): container's cgroup is not empty: 5 process(es) found #3828

Closed
AkihiroSuda opened this issue Apr 13, 2023 · 8 comments

Comments

@AkihiroSuda
Copy link
Member

Description

nerdctl test suite no longer passes with runc v1.1.6 (rootless, cgroup v2), due to an error like container's cgroup is not empty: 5 process(es) found

Steps to reproduce the issue

https://github.com/containerd/nerdctl/blob/v1.3.0/cmd/nerdctl/container_run_restart_linux_test.go#L33

go test -run '^TestRunRestart$' -test.kill-daemon -v ./cmd/nerdctl

Describe the results you received and expected

runc v1.1.5: passes
runc v1.1.6: fails

...
        Apr 13 22:06:31 suda-ws01 containerd-rootless.sh[14887]: time="2023-04-13T22:06:31.341228362+09:00" level=error msg="copy shim log" error="read /proc/self/fd/13: file already closed" namespace=nerdctl-test
        Apr 13 22:06:31 suda-ws01 containerd-rootless.sh[14887]: time="2023-04-13T22:06:31.341503657+09:00" level=error msg="apply change" error="failed to create shim task: OCI runtime create failed: runc create failed: container's cgroup is not empty: 5 process(es) found: unknown"
        
    container_run_restart_linux_test.go:88: ==========
    container_run_restart_linux_test.go:89: the container does not seem to be restarted
--- FAIL: TestRunRestart (99.06s)
FAIL
exit status 1
FAIL    github.com/containerd/nerdctl/cmd/nerdctl       99.094s

What version of runc are you using?

runc version 1.1.6
commit: v1.1.6-0-g0f48801a
spec: 1.0.2-dev
go: go1.20.2
libseccomp: 2.5.4

Host OS information

PRETTY_NAME="Ubuntu 22.10"
NAME="Ubuntu"
VERSION_ID="22.10"
VERSION="22.10 (Kinetic Kudu)"
VERSION_CODENAME=kinetic
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=kinetic
LOGO=ubuntu-logo

Host kernel information

Linux suda-ws01 5.19.0-38-generic #39-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 17 17:33:16 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
@AkihiroSuda
Copy link
Member Author

This is caused by 12f2f03 .
This was made on purpose, but probably too much breaking as a patch release.

@kolyshkin
Copy link
Contributor

This is caused by 12f2f03 . This was made on purpose, but probably too much breaking as a patch release.

So, if the cgroup exists and is not empty, and the systemd cgroup driver is used, this almost always means that the systemd unit also exists. Previously, this was not causing any visible issues because systemd cgroup driver used to ignore UnitExists.

But under the hood, the new container was started without adding its PID into the proper cgroup, and without setting properties for systemd unit. This is a very critical bug, which runc 1.1.6 fixes.

The side effect is, if a systemd cgroup driver is used, it errors out. To me, this is way better than running a container without placing it into a proper cgroup.

Now, if it errors out, it means something is very wrong.

@AkihiroSuda I am not familiar with nerdctl codebase, but can take a look early next week.

@kolyshkin
Copy link
Contributor

The side effect is, if a systemd cgroup driver is used, it errors out. To me, this is way better than running a container without placing it into a proper cgroup.

Now, if it errors out, it means something is very wrong.

In other words, I had a choice between:

  1. Refusing a non-empty cgroup.
  2. Fixing an issue of not adding the container into a proper systemd unit and cgroup.

I chose 2 because otherwise it's a huge issue.

@AkihiroSuda
Copy link
Member Author

can take a look early next week.

Thanks, help is appreciated.
The error is happening during the daemon is being killed, restarted (via systemd), and is restarting --restart=always containers.

@kolyshkin
Copy link
Contributor

Reproduced locally.

So, what happens is, the containerd is killed by the test, but the container is still running. Here's an excerpt from the logs:

Event 1: container is created:

Apr 20 15:54:35 kir-rhat systemd[2239]: Started nerdctl-a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0.scope - libcontainer container a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0.

Event 2: containerd is killed and restarted:

Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sent signal SIGKILL to main process 10999 (rootlesskit) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11011 (exe) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11033 (slirp4netns) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11040 (containerd) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12508 (containerd-shim) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12519 (nerdctl) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11016 (rootlesskit) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11056 (containerd) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11057 (containerd) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11058 (containerd) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11059 (containerd) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11060 (containerd) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11061 (containerd) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11062 (containerd) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 11063 (containerd) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12511 (containerd-shim) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12513 (n/a) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12514 (n/a) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12516 (containerd-shim) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12520 (n/a) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12521 (nerdctl) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12522 (nerdctl) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12523 (nerdctl) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12524 (nerdctl) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12525 (nerdctl) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Sending signal SIGKILL to process 12526 (nerdctl) on client request.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Main process exited, code=killed, status=9/KILL
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Killing process 11011 (exe) with signal SIGKILL.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Killing process 12519 (nerdctl) with signal SIGKILL.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Killing process 11040 (containerd) with signal SIGKILL.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Killing process 11056 (n/a) with signal SIGKILL.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Killing process 11057 (containerd) with signal SIGKILL.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Killing process 12511 (containerd-shim) with signal SIGKILL.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Killing process 12521 (nerdctl) with signal SIGKILL.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Failed with result 'signal'.
Apr 20 15:54:35 kir-rhat systemd[2239]: containerd.service: Consumed 3.791s CPU time.
Apr 20 15:54:37 kir-rhat systemd[2239]: containerd.service: Scheduled restart job, restart counter is at 3.
Apr 20 15:54:37 kir-rhat systemd[2239]: Stopped containerd.service - containerd (Rootless).
Apr 20 15:54:37 kir-rhat systemd[2239]: containerd.service: Consumed 3.791s CPU time.
Apr 20 15:54:37 kir-rhat systemd[2239]: Started containerd.service - containerd (Rootless).
Apr 20 15:54:38 kir-rhat containerd-rootless.sh[12735]: time="2023-04-20T15:54:38.077599640-07:00" level=info msg="starting containerd" revision= version=1.6.19
....

Event 4: containerd sees the dead shim, tries to clean it up and restart the container:

Apr 20 15:54:38 kir-rhat containerd-rootless.sh[12735]: time="2023-04-20T15:54:38.135309966-07:00" level=info msg="starting signal loop" namespace=nerdctl-test path=/run/containerd/io.containerd.runtime.v2.task/nerdctl-test/a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0 pid=12769 runtime=io.containerd.runc.v2
Apr 20 15:54:38 kir-rhat containerd-rootless.sh[12735]: time="2023-04-20T15:54:38.199988577-07:00" level=info msg="shim disconnected" id=a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0
Apr 20 15:54:38 kir-rhat containerd-rootless.sh[12735]: time="2023-04-20T15:54:38.200028616-07:00" level=warning msg="cleaning up after shim disconnected" id=a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0 namespace=nerdctl-test
Apr 20 15:54:38 kir-rhat containerd-rootless.sh[12735]: time="2023-04-20T15:54:38.200037131-07:00" level=info msg="cleaning up dead shim"
Apr 20 15:54:38 kir-rhat containerd-rootless.sh[12735]: time="2023-04-20T15:54:38.209645321-07:00" level=warning msg="cleanup warnings time=\"2023-04-20T15:54:38-07:00\" level=info msg=\"starting signal loop\" namespace=nerdctl-test pid=12800 runtime=io.containerd.runc.v2\ntime=\"2023-04-20T15:54:38-07:00\" level=warning msg=\"failed to read init pid file\" error=\"open /run/containerd/io.containerd.runtime.v2.task/nerdctl-test/a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0/init.pid: no such file or directory\" runtime=io.containerd.runc.v2\n"
Apr 20 15:54:38 kir-rhat containerd-rootless.sh[12735]: time="2023-04-20T15:54:38.209896433-07:00" level=error msg="copy shim log" error="read /proc/self/fd/11: file already closed"
Apr 20 15:54:38 kir-rhat containerd-rootless.sh[12735]: time="2023-04-20T15:54:38.209985060-07:00" level=error msg="apply change" error="failed to create shim task: OCI runtime create failed: runc create failed: container's cgroup is not empty: 5 process(es) found: [12543 12687 12688 12689 12690]: unknown"
...

(this last line is from a modified runc which prints the PIDS found in the cgroup)

Now, if we take a look at these processes, and the container:

```console
[kir@kir-rhat runc]$ ps ax | grep 12543
  12543 ?        Ss     0:00 nginx: master process nginx -g daemon off;
  13199 pts/1    S+     0:00 grep --color=auto 12543
[kir@kir-rhat runc]$ ps ax | grep 12687
  12687 ?        S      0:00 nginx: worker process
  13277 pts/1    S+     0:00 grep --color=auto 12687
[kir@kir-rhat runc]$ ps ax | grep 12688
  12688 ?        S      0:00 nginx: worker process
  13295 pts/1    S+     0:00 grep --color=auto 12688
[kir@kir-rhat runc]$ ps ax | grep 12689
  12689 ?        S      0:00 nginx: worker process
  13313 pts/1    S+     0:00 grep --color=auto 12689
[kir@kir-rhat runc]$ ps ax | grep 12690
  12690 ?        S      0:00 nginx: worker process
  13380 pts/1    S+     0:00 grep --color=auto 12690
[kir@kir-rhat runc]$ cat /proc/12543/cgroup 
0::/user.slice/user-1000.slice/user@1000.service/user.slice/nerdctl-a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0.scope
[kir@kir-rhat runc]$ systemctl --user status nerdctl-a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0.scope
Warning: The unit file, source configuration file or drop-ins of nerdctl-a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0.scope changed on disk. Run 'systemctl --user daemon-reload' to reload uni>
● nerdctl-a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0.scope - libcontainer container a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0
     Loaded: loaded (/run/user/1000/systemd/transient/nerdctl-a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0.scope; transient)
  Transient: yes
    Drop-In: /run/user/1000/systemd/transient/nerdctl-a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0.scope.d
             └─50-DevicePolicy.conf, 50-DeviceAllow.conf
     Active: active (running) since Thu 2023-04-20 15:54:35 PDT; 1min 31s ago
         IO: 136.0K read, 12.0K written
      Tasks: 5 (limit: 18898)
     Memory: 3.9M
        CPU: 92ms
     CGroup: /user.slice/user-1000.slice/user@1000.service/user.slice/nerdctl-a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0.scope
             ├─12543 "nginx: master process nginx -g daemon off;"
             ├─12687 "nginx: worker process"
             ├─12688 "nginx: worker process"
             ├─12689 "nginx: worker process"
             └─12690 "nginx: worker process"

Apr 20 15:54:35 kir-rhat systemd[2239]: Started nerdctl-a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1b35f0.scope - libcontainer container a13929d83e8124cfeb69afbfad28c94e9c38fdc2cfc3179090e2702b8e1>

💡 what we see here is a running container, which was started before containerd was killed.

@kolyshkin
Copy link
Contributor

With some added debug, I see that containerd tries to delete --force the old/stale container, but runc fails:

RUNC --root /run/containerd/runc/nerdctl-test --log /run/containerd/io.containerd.runtime.v2.task/nerdctl-test/6dcaeef7b266e94e3ad31b86b2dd3533564e232ad5345c30918f0764393e1e80/log.json --log-format json delete --force 6dcaeef7b266e94e3ad31b86b2dd3533564e232ad5345c30918f0764393e1e80
DELETE container: <nil>, err: container does not exist

This error happens when container's state.json can not be found under its root directory.

@AkihiroSuda I don't know much about rootlesskit and nerdctl, but it seems that restarting containerd makes the runc root directory to disappear. I guess, what happens is rootlesskit is using a new mount namespace, which is lost once it's killed, so we lose all runc data.

For runc, if the container's state.json does not exist, the container does not exist. Meaning, runc delete --force is not working. The old container is still running, and containerd wants runc to start one more container in the same cgroup and systemd unit, which is not working (and never was).

For runc delete --force to work after we kill rootless containerd (and rootlesskit etc), runc per-container data should be preserved.

CC @AkihiroSuda

@AkihiroSuda
Copy link
Member Author

Thanks @kolyshkin !

@kolyshkin
Copy link
Contributor

To sum it up (and close the issue). To address the issue of NOT creating the systemd unit if it already exists, runc 1.1.6+ is rejecting to create a container if its cgroup is not empty.

For reasons described in #3132, sharing a cgroup between multiple containers is always a bad idea, and thus runc 1.2 will always reject such configuration. Runc 1.1.6+ does that already, but only if using systemd cgroup manager, as it is impossible to create a systemd unit otherwise.

For more details, see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants