Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: New cgroup driver: "none" (for Rootless) #1084

Closed
wants to merge 1 commit into from

Conversation

AkihiroSuda
Copy link
Member

Add "none" cgroup driver to kubelet so that kubelet (as well as CRI/OCI runtimes) can be executed without write permissions to cgroups.

Docker v19.03+, containerd/CRI 20190104+, and CRI-O v1.12+ already support running without cgroup permissions. So we only need to update kubelet.

Draft patch for kubelet is available in Usernetes project: https://github.com/rootless-containers/usernetes/blob/2fd0ba0594673e0c5ea04a894191bfaa3b117377/src/patches/kubernetes/0003-kubelet-new-feature-gate-SupportNoneCgroupDriver.patch

An old version of our patch has been also used by Rancher's k3s.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
@k8s-ci-robot
Copy link
Contributor

Welcome @AkihiroSuda!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 4, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @AkihiroSuda. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 4, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AkihiroSuda
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: derekwaynecarr

If they are not already assigned, you can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@AkihiroSuda
Copy link
Member Author

## Alternatives

* `/sys/fs/cgroup` could be faked with FUSE (which can be unprivileged since kernel 4.18), but it seems impossible to fake `/proc/$PID/cgroup` with FUSE. So anyway we would need to add some special trick to kubelet implementation.
* [seccomp-trap-to-userspace (introduced in kernel 5.0)](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6a21cc50f0c7f87dae5259f6cfefe024412313f6) might be usable for faking `/sys/fs/cgroup` as well as `/proc/$PID/cgroup`, but it still seems difficult and too new. Also, its performance overhead needs to be evaluated.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't really be reasonable to fake it through seccomp -- you can't dereference pointers in seccomp. The overhead would in theory be smaller than doing it with ptrace (since there's no ptrace-related overhead and only open would be targeted), but would be significantly worse than something like FUSE IMHO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you can hook fd syscalls and readlink /proc/PID/fd/N instead of dereferencing string pointer...though it might be extremely slow

Copy link

@cyphar cyphar Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, hook all syscalls that would touch the fd? That would be significantly worse in terms of performance (and complexity) than just doing it for open(2) and would probably be incredibly difficult if not completely impractical (we'd have to fake sendmsg for instance).

Anyway, this was just a nit -- we know we're not going to do this anyway. 😉


## Summary

Add "none" cgroup driver to kubelet so that kubelet (as well as CRI/OCI runtimes) can be executed without write permissions to cgroups.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a goal of "rootless" to remove all resource isolation? If not, can we instead work towards the end-state of supporting cgroupv2 rather than disabling all resource isolation?

* To allow users of shared machines (especially HPC) to run Kubernetes without the risk of breaking their colleagues' environments
* Safe Kubernetes-on-Kubernetes

Without cgroups, kubelet cannot support restricting and monitoring CPU/memory resources for pods, but pod processes can still spontaneously impose traditional `setrlimit` restrictions on themselves.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a "rootless" kubernetes cluster with cgroups disabled intended to pass any node e2e or cluster e2e conformance tests?

([Fedora 31](https://www.phoronix.com/scan.php?page=news_item&px=Fedora-31-Cgroups-V2-Default) is already planning to use cgroup2 by default, but anyway Kubernetes users will need to
change the GRUB configuration to use cgroup1, as Kubelet and CRI runtimes will not be able to support cgroup2 by the releae of Fedora 31.)

Also, even after the migration to cgroup2, containerized Kubernetes might not get support for cgroup2, because it is likely that [cgroup2 will only be supported via systemd](https://github.com/containers/libpod/issues/1429#issuecomment-465420444). So the "none" cgroup driver will still remain useful.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following this chain of thought. I prefer we work towards enabling cgroup v2 and rationalize what a cgroup manager role is in a cgroup v2 host environment. I know we have discussed enabling the kubelet to run in cgroupv1 tolerant or cgroup v2 mode (with maybe a reduced set of controllers). In general, I am apprehensive about disabling all resource isolation versus working towards an end-state that enables or supports it. I think most users who hear rootless kubernetes will assume it comes with no limitations.


### Risks and Mitigations

Pods might be able to exhaust the host CPU/memory resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would apply to pids as well.

@derekwaynecarr
Copy link
Member

I am supportive of iterative improvements, but its not clear what the audience is for this capability.

Are you targeting operators that prefer the security benefits of userns but are willing to sacrifice resource isolation? It would seem to me that the user that is running untrusted workloads would want rootless, but when also running untrusted workloads would desire resource isolation as well. Is this intended to be a subset of nodes in the cluster?

Should a kubelet report back to the cluster scheduler that it does not respect resource isolation at all? How would this intersect with vertical pod autoscaling or horizontal pod autoscaling? I am having trouble thinking about this change holistically in the context of a production cluster, so understanding the persona and environment better is probably beneficial.

@sjenning
Copy link
Contributor

sjenning commented Jun 5, 2019

Does this disable enforcement of kube-reserved and system-reserved?

What is the impact to metrics? And by extension eviction?

@AkihiroSuda
Copy link
Member Author

As there is an on-going discussion for supporting cgroup2 in Kubernetes, I'm going to close this PR temporarily, but I'll reopen if we face an issue in supporting some usecases (especially kind and microk8s-like one) with cgroup2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants