-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: New cgroup driver: "none" (for Rootless) #1084
Conversation
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Welcome @AkihiroSuda! |
Hi @AkihiroSuda. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: AkihiroSuda If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
## Alternatives | ||
|
||
* `/sys/fs/cgroup` could be faked with FUSE (which can be unprivileged since kernel 4.18), but it seems impossible to fake `/proc/$PID/cgroup` with FUSE. So anyway we would need to add some special trick to kubelet implementation. | ||
* [seccomp-trap-to-userspace (introduced in kernel 5.0)](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6a21cc50f0c7f87dae5259f6cfefe024412313f6) might be usable for faking `/sys/fs/cgroup` as well as `/proc/$PID/cgroup`, but it still seems difficult and too new. Also, its performance overhead needs to be evaluated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't really be reasonable to fake it through seccomp -- you can't dereference pointers in seccomp. The overhead would in theory be smaller than doing it with ptrace (since there's no ptrace-related overhead and only open
would be targeted), but would be significantly worse than something like FUSE IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you can hook fd syscalls and readlink /proc/PID/fd/N instead of dereferencing string pointer...though it might be extremely slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, hook all syscalls that would touch the fd? That would be significantly worse in terms of performance (and complexity) than just doing it for open(2) and would probably be incredibly difficult if not completely impractical (we'd have to fake sendmsg
for instance).
Anyway, this was just a nit -- we know we're not going to do this anyway. 😉
|
||
## Summary | ||
|
||
Add "none" cgroup driver to kubelet so that kubelet (as well as CRI/OCI runtimes) can be executed without write permissions to cgroups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a goal of "rootless" to remove all resource isolation? If not, can we instead work towards the end-state of supporting cgroupv2 rather than disabling all resource isolation?
* To allow users of shared machines (especially HPC) to run Kubernetes without the risk of breaking their colleagues' environments | ||
* Safe Kubernetes-on-Kubernetes | ||
|
||
Without cgroups, kubelet cannot support restricting and monitoring CPU/memory resources for pods, but pod processes can still spontaneously impose traditional `setrlimit` restrictions on themselves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a "rootless" kubernetes cluster with cgroups disabled intended to pass any node e2e or cluster e2e conformance tests?
([Fedora 31](https://www.phoronix.com/scan.php?page=news_item&px=Fedora-31-Cgroups-V2-Default) is already planning to use cgroup2 by default, but anyway Kubernetes users will need to | ||
change the GRUB configuration to use cgroup1, as Kubelet and CRI runtimes will not be able to support cgroup2 by the releae of Fedora 31.) | ||
|
||
Also, even after the migration to cgroup2, containerized Kubernetes might not get support for cgroup2, because it is likely that [cgroup2 will only be supported via systemd](https://github.com/containers/libpod/issues/1429#issuecomment-465420444). So the "none" cgroup driver will still remain useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not following this chain of thought. I prefer we work towards enabling cgroup v2 and rationalize what a cgroup manager role is in a cgroup v2 host environment. I know we have discussed enabling the kubelet to run in cgroupv1 tolerant or cgroup v2 mode (with maybe a reduced set of controllers). In general, I am apprehensive about disabling all resource isolation versus working towards an end-state that enables or supports it. I think most users who hear rootless kubernetes will assume it comes with no limitations.
|
||
### Risks and Mitigations | ||
|
||
Pods might be able to exhaust the host CPU/memory resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would apply to pids as well.
I am supportive of iterative improvements, but its not clear what the audience is for this capability. Are you targeting operators that prefer the security benefits of userns but are willing to sacrifice resource isolation? It would seem to me that the user that is running untrusted workloads would want rootless, but when also running untrusted workloads would desire resource isolation as well. Is this intended to be a subset of nodes in the cluster? Should a kubelet report back to the cluster scheduler that it does not respect resource isolation at all? How would this intersect with vertical pod autoscaling or horizontal pod autoscaling? I am having trouble thinking about this change holistically in the context of a production cluster, so understanding the persona and environment better is probably beneficial. |
Does this disable enforcement of What is the impact to metrics? And by extension eviction? |
As there is an on-going discussion for supporting cgroup2 in Kubernetes, I'm going to close this PR temporarily, but I'll reopen if we face an issue in supporting some usecases (especially |
Add "none" cgroup driver to kubelet so that kubelet (as well as CRI/OCI runtimes) can be executed without write permissions to cgroups.
Docker v19.03+, containerd/CRI 20190104+, and CRI-O v1.12+ already support running without cgroup permissions. So we only need to update kubelet.
Draft patch for kubelet is available in Usernetes project: https://github.com/rootless-containers/usernetes/blob/2fd0ba0594673e0c5ea04a894191bfaa3b117377/src/patches/kubernetes/0003-kubelet-new-feature-gate-SupportNoneCgroupDriver.patch
An old version of our patch has been also used by Rancher's k3s.