From da807d0c770d048ec3f98d5a3551772d24dc40ce Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Fri, 17 Jul 2020 16:51:57 -0500 Subject: [PATCH 01/89] keps/127: Support User Namespaces --- .../127-usernamespaces-support/README.md | 788 ++++++++++++++++++ .../127-usernamespaces-support/kep.yaml | 19 + 2 files changed, 807 insertions(+) create mode 100644 keps/sig-node/127-usernamespaces-support/README.md create mode 100644 keps/sig-node/127-usernamespaces-support/kep.yaml diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md new file mode 100644 index 00000000000..6a4f9aa771f --- /dev/null +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -0,0 +1,788 @@ + +# KEP-127: Support User Namespaces + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Volumes Support](#volumes-support) + - [Container Runtime Support](#container-runtime-support) + - [Risks and Mitigations](#risks-and-mitigations) + - [Breaking Existing Workloads](#breaking-existing-workloads) + - [Duplicated Snapshots of Container Images](#duplicated-snapshots-of-container-images) +- [Implementation Phases](#implementation-phases) + - [Phase 1](#phase-1) + - [Phase 2](#phase-2) + - [Phase 2+](#phase-2-1) +- [Design Details](#design-details) + - [Summary of the Proposed Changes](#summary-of-the-proposed-changes) + - [PodSpec Changes](#podspec-changes) + - [CRI API Changes](#cri-api-changes) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Alpha -> Beta](#alpha---beta) + - [Beta -> GA](#beta---ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Differences with Previous Proposal](#differences-with-previous-proposal) + - [Default Value for userNamespaceMode](#default-value-for-usernamespacemode) + - [Host Defaulting Mechanishm](#host-defaulting-mechanishm) +- [References](#references) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [ ] (R) Graduation criteria is in place +- [ ] (R) Production readiness review completed +- [ ] Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +Container security consists of many different kernel features that work together +to make containers secure. User namespaces isolate user and group IDs by +allowing processes to run with different IDs in the container and in the host. +Specially, a process running as privileged in a container can be seen as +unprivileged in the host. This gives more capabilities to the containers and +protects the host from malicious or compromised containers. + +This KEP is a continuation of the work initiated in the [Support Node-Level User +Namespaces +Remapping](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-usernamespace-remapping.md) +proposal. + +## Motivation + +From [user_namespaces(7)](https://man7.org/linux/man-pages/man7/user_namespaces.7.html): +> User namespaces isolate security-related identifiers and attributes, in +particular, user IDs and group IDs, the root directory, keys, and capabilities. +A process's user and group IDs can be different inside and outside a user +namespace. In particular, a process can have a normal unprivileged user ID +outside a user namespace while at the same time having a user ID of 0 inside +the namespace; in other words, the process has full privileges for operations +inside the user namespace, but is unprivileged for operations outside the +namespace. + +The goal of supporting user namespaces in Kubernetes is to be able to run +processes in pods with a different user and group IDs than in the host. +Speficically, a privileged process in the pod runs as an unprivileged process in the +host. If such a process is able to break into the host, it'll have limited +impact as it'll be running as an unprivileged user there. + +There have been some security vulnerabilities that could have been mitigated by +user namespaces. Some examples are: +- CVE-2016-8867: Privilege escalation inside containers + - https://github.com/moby/moby/issues/27590 +- CVE-2018-15664: TOCTOU race attack that allows to read/write files in the host + - https://github.com/moby/moby/pull/39252 +- CVE-2019-5736: Host runc binary can be overwritten from container + - https://github.com/opencontainers/runc/commit/0a8e4117e7f715d5fbeef398405813ce8e88558b + +### Goals + +- Increase node to pod isolation in Kubernetes by mapping user and group IDs + inside the container to different IDs in the host. In particular, mapping root + inside the container to unprivileged user and group IDs in the node. + +### Non-Goals + +TODO(Mauricio) + +## Proposal + +This proposal aims to support user namespaces in Kubernetes by extending the pod +specification with a new `userNamespaceMode` field. This proposal aims to +support three modes. + +- **Host**: + The pods are placed in the host user namespace, this is the current Kubernetes + behaviour. This mode is intended for pods that only work in the root (host) + user namespace. It is the default mode when `userNamespaceMode` field is not + set. + +- **Cluster**: + All the pods in the cluster are placed in a different user namespace but they + use the same ID mappings. This mode doesn't provide full pod-to-pod isolation + as all the pods with `Cluster` mode have the same effective IDs on the host. + It provides pod-to-host isolation as the IDs are different inside the + container and in the host. This mode is intended for pods sharing volumes as + they will run with the same effective IDs on the host, allowing them to read + and write files in the shared storage. + +- **Pod**: + Each pod is placed in a different user namespace and has a different and + non-overlapping ID mappings. This mode is intended for stateless pods, i.e. + pods using only ephemeral volumes like `configMap,` `downwardAPI`, `secret`, + `projected` and `emptyDir`. This mode not only provides host-to-pod isolation + but also pod-to-pod isolation as each pod has a different range of effective + IDs in the host. + +### User Stories + +#### Story 1 + +As a cluster admin, I want run some pods with privileged capabilities because +the applications in the pods require it (e.g. `CAP_SYS_ADMIN` to mount a FUSE +filesystem or `CAP_NET_ADMIN` to setup a VPN) but I don't want this extra +capability to give any extra privilege on the host. + +#### Story 2 + +As a cluster admin, I want to allow some pods to share the host user namespace +if they need a feature only available in such user namespace. + +### Notes/Constraints/Caveats + +#### Volumes Support + +The Linux kernel uses the effective user and group IDs (the ones the host) to +check the file access permissions. Since with user namespaces IDs are mapped to +a different value on the host, this could cause issues accessing volumes if the +pod is run with a different mapping, i.e. the effective user and group IDs on +the host change. + +This proposal supports volume without changing the user and group IDs and leaves +that problem to the user to manage. Future Linux kernel features such as shiftfs +could allow to different pods to see a volume with its own IDs but it is out of +scope of this proposal. Among the possible future kernel solutions, we can list: + +- [shiftfs: uid/gid shifting filesystem](https://lwn.net/Articles/757650/) +- [A new API for mounting filesystems](https://lwn.net/Articles/753473/) +- [user_namespace: introduce fsid mappings](https://lwn.net/Articles/812221/) + +In regard to this proposal, volumes can be divided in ephemeral and non-ephemeral. + +Ephemeral volumes are associated to a **single** pod and their lifecyle is +dependent on that pod. These are `configMap`, `secret`, `emptyDir`, +`downwardAPI`, etc. These kind of volumes are easy to handle as they are not +shared by different pods and hence all the process accessig those volumes have +the same effective user and group IDs. Kubelet creates the files for those +volumes and it can easily set the file ownership too. + +Non-ephemeral volumes more difficult to support since they can be persistent and +also can be shared by multiple pods. This proposal supports volumes with two +different strategies: +- The `Cluster` makes it easier for pods to share files using volumes when those + don't have access permissions for `others` because the effective user and + group IDs on the host are the same for all the pods. +- The semantics of semantics of `fsGroup` are respected, if it's specified it's + assumed to be the correct GID in the host and an 1-to-1 mapping entry for the + `fsGroup` is added to the GID mappings for the pod. + +There are some cases that require special attention from the user. For instance, +a process inside a pod will not be able to access files with mode `0700` and +owned by a user different than the effective user of that process in a volume +that doesn't support the semantics of `fsGroup` (doesn't support +[`SetVolumeOwnership`](https://github.com/kubernetes/kubernetes/blob/00da04ba23d755d02a78d5021996489ace96aa4d/pkg/volume/volume_linux.go#L42) +that updates permissions and ownership of the files to be accesible by the +`fsGroup` group ID). These pods should be run in `Host` mode. + +#### Container Runtime Support + +- **Docker**: + It only supports a [single IDs + mapping](https://docs.docker.com/engine/security/userns-remap/) shared by all + containers running in the host. There is not support for [multiple IDs + mapping](https://github.com/moby/moby/issues/28593) yet. Dockershim runtime is + only compatible with pods running in `Host` and `Cluster` modes. The user has + to guarantee that the ID mappings configured in Docker through the + `userns-remap` and the cluster-wide range configured in the Kubelet are the + same. The dockershim implementation includes a check to verify that the IDs + mapping received from the Kubelet are equal to the ones configured in Docker, + returning an error otherwise. +- **containerd**: + It's quite straigtforward to implement the CRI changes proposed below in + containerd/cri, we did it in + [this](https://github.com/kinvolk/containerd-cri/commits/mauricio/userns_poc) + PoC. +- **cri-o**: + It recently [added](https://github.com/cri-o/cri-o/pull/3944) support for + user namespaces through a pod annotation. The extensions to make it work with + the current CRI changes are small. +- TODO(Mauricio): gVisor, katacontainers? + +containerd and cri-o will provide support for the 3 possible values of `userNamespaceMode`. + +### Risks and Mitigations + +#### Breaking Existing Workloads + +Some features that could not work when the host user namespace is not shared are: + +- **Some Capabilities**: + The Linux kernel takes into consideration the user namespace a process is + running in while performing the capabilities check. There are some + capabilities that are only available in the root (host) user namespace such as + `CAP_SYS_TIME`, `CAP_SYS_MODULE` & `CAP_MKNOD`. + +- **Sharing Host Namespaces**: + There are some limitations in the Linux kernel and in the runtimes that + prevents sharing other host namespaces when the host user namespace is not + shared. + TODO(Mauricio): Put links to those limitations? + +In order to avoid breaking existing workloads `Host` is the default value of `userNamespaceMode`. + +#### Duplicated Snapshots of Container Images + +The owners of the files of a container image have to been set accordingly to the +ID mappings used for that container. The runtimes perform a `chown` operation +over the image snapshot when it's pulled. This presents a risk as it potentially +increases the time and the storage needed to handle the container images. + +There is not immediate mitigation for it, [we +talked](https://lists.linuxfoundation.org/pipermail/containers/2020-September/042230.html) +to the Linux kernel community and [they +replied](https://lists.linuxfoundation.org/pipermail/containers/2020-September/042230.html) +they are working on a solution for it. + +Another risk is exausting the disk space on the nodes if pods are repeativily +started and stopped while using `Pod` mode. Since `Pod` mode is planned for +phase 2 we haven't consider a mitigation for this case. + +## Implementation Phases + +The implemenation of this KEP in a single phase is complicated as there are many +discussions to be done. We learned from previous attempts to bring this support in +that it should be done in small steps to avoid losing the focus on the +discussion. It's also true that a full plan should be agreed at the beginning to +avoid changing the implementation drastically in further phases. + +This proposal implementation aims to be divided in the following phases: + +### Phase 1 + +This first phase includes: + - Extend the PodSpec with the `userNamespaceMode` field. + - Extend the CRI with user and ID mappings fields. + - Implement support for `Host` and `Cluster` user namespace modes. + +The goal of this phase is to implement some initial user namespace support +providing pod-to-host isolation and supporting volumes. The implementation of +the `Pod` mode is out of scope in this phase because it requires a non +negligible amount of work and we could risk losing the focus failing to deliver +this feature. + +### Phase 2 + +This phase aims to implement the `Pod` mode. After this phase is completed the +full advantanges of user namespaces could be used in some cases (stateless +workloads). + +### Phase 2+ + +The default value of `userNamespaceMode` should be set to `Pod` so pods that +don't set this field can also get the security benefits of user namespaces. It's +not clear yet what should be the process to make this happen as this is a +potentially non backwards compatible change. It's specially relevant for +workloads not compatible with user namespaces. + +A [host defaulting mechanishm](#host-defaulting-mechanishm) could help to make +this transiction smoother, but this proposal doesn't go into details as it +mainly focuses in phase 1. + +TODO(Mauricio): +- Should we describe that once the default is Pod we should implement a control for it on PSP / OPA? +- Should we mention that it's possible that users in future would be able to set the mappings? + +## Design Details + +This section only focuses on phase 1 as specified above. + +### Summary of the Proposed Changes + +- Extend the CRI to have a user namespace mode and the user and group ID mappings. +- Add a `userNamespaceMode` field to the pod spec. +- Add the cluster-wide ID mappings to the kubelet configuration file. +- Add a `UserNamespacesSupport` feature flag to enable / disable the user. + namespaces support. +- Update owner of ephemeral volumes populated by the kubelet. + +### PodSpec Changes + +`v1.PodSpec` is extended with a new `UserNamesapceMode` field: + +``` +const ( + UserNamespaceModeHost PodUserNamespaceMode = "Host" + UserNamespaceModeCluster PodUserNamespaceMode = "Cluster" + UserNamespaceModePod PodUserNamespaceMode = "Pod" +) + +type PodSpec struct { +... + // UserNamespaceMode controls how user namespaces are used for this Pod. + // Three modes are supported: + // "Host": The pod shares the host user namespace. (default value). + // "Cluster": The pod uses a cluster-wide configured ID mappings. + // "Pod": The pod gets a non-overlapping ID mappings range. + // +k8s:conversion-gen=false + // +optional + UserNamespaceMode PodUserNamespaceMode `json:"userNamespaceMode,omitempty" protobuf:"bytes,36,opt name=userNamespaceMode"` +... +``` + +### CRI API Changes + +The CRI has to be extended to allow kubelet to specify the user namespace mode +and the ID mappings for a pod. +[`NamespaceOption`](https://github.com/kubernetes/cri-api/blob/1eae59a7c4dee45a900f54ea2502edff7e57fd68/pkg/apis/runtime/v1alpha2/api.proto#L228) +is extended with two new fields: +- A `user` `NamespaceMode` that defines if the pod should run in an independent + user namespace (`POD`) or if it should share the host user namespace + (`NODE`). +- The ID mappings to be used if the user namespace mode is `POD`. + +``` +// LinuxIDMapping represents a single user namespace mapping in Linux. +message LinuxIDMapping { + // container_id is the starting ID for the mapping inside the container. + uint32 container_id = 1; + // host_id is the starting ID for the mapping on the host. + uint32 host_id = 2; + // number of IDs in this mapping. + uint32 length = 3; +} + +// LinuxUserNamespaceConfig represents the user and group ID mapping in Linux. +message LinuxUserNamespaceConfig { + // uid_mappings is an array of user ID mappings. + repeated LinuxIDMapping uid_mappings = 1; + // gid_mappings is an array of group ID mappings. + repeated LinuxIDMapping gid_mappings = 2; +} + +// NamespaceOption provides options for Linux namespaces. +message NamespaceOption { + ... + // User namespace for this container/sandbox. + // Note: There is currently no way to set CONTAINER scoped user namespace in the Kubernetes API. + // Namespaces currently set by the kubelet: POD, NODE + Namespace user = 5; + // ID mappings to use when the user namespace mode is POD. + LinuxUserNamespaceConfig mappings = 6; +} +``` + +### Test Plan + +TBD + + + +### Graduation Criteria + +TBD +Mauricio: Should we require Pod mode to be implemented to switch to Beta? + +#### Alpha -> Beta + +- Future Complete: + - `Pod` mode implemented + +#### Beta -> GA + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + +The container runtime will have to be updated in the nodes to support this feature. + +The new `user` field in the `NamespaceOption` will be ignored by an old runtime +without user namespaces support. The container will be placed in the host user +namespace. It's a responsibility of the user to guarante that a runtime +supporting user namespaces is used. + +If an old version of kubelet without user namespaces support could cause some +issues too. In this case the runtime could wrongly infer that the `user` field +is set to `POD` in the `NamespaceOption` message. To avoid this problem the +runtime should check if the `mappings` field contains any mappings, an error +should be raised otherwise. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + +* **How can this feature be enabled / disabled in a live cluster?** + - [X] Feature gate + - Feature gate name: UserNamespacesSupport + - Components depending on the feature gate: kubelet + +* **Does enabling the feature change any default behavior?** + The default mode for usernamespaces is `Host`, so the default behaviour is not changed. + +* **Can the feature be disabled once it has been enabled (i.e. can we roll back + the enablement)?** + Yes, by disabling the `UserNamespacesSupport` feature gate. + The effective user and group IDs of the process in the host would be different + before and after disabling the feature for pods running in `Cluster` and `Pod` + modes. This could cause access issues to pods accessing files saved in + volumes. + +* **What happens if we reenable the feature if it was previously rolled back?** + The situation is very similar to the described above. The pod will be able to + access the files written when the feature was enabled but could have issues to + access those files written while the feature was disabled. + +* **Are there any tests for feature enablement/disablement?** + TBD + +### Rollout, Upgrade and Rollback Planning + +Will be added before transition to beta. + +* **How can a rollout fail? Can it impact already running workloads?** + +* **What specific metrics should inform a rollback?** + +* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?** + +* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, +fields of API types, flags, etc.?** + + +### Monitoring Requirements + +Will be added before transition to beta. + +* **How can an operator determine if the feature is in use by workloads?** + +* **What are the SLIs (Service Level Indicators) an operator can use to determine +the health of the service?** + +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** + +* **Are there any missing metrics that would be useful to have to improve observability +of this feature?** + +### Dependencies + +* **Does this feature depend on any specific services running in the cluster?**: No. + +### Scalability + +* **Will enabling / using this feature result in any new API calls?** No. + +* **Will enabling / using this feature result in introducing new API types?** No. + +* **Will enabling / using this feature result in any new calls to the cloud +provider?** No. + +* **Will enabling / using this feature result in increasing size or count of +the existing API objects?** Yes. The PodSpec will be increased. TODO(Mauricio): what is the increased size? + +* **Will enabling / using this feature result in increasing time taken by any +operations covered by [existing SLIs/SLOs]?** + Yes. The runtime has to set correct ownership for the container image + before starting it. + TODO(Mauricio): check what are those SLIs/SLOs and if this case actually applies. + +* **Will enabling / using this feature result in non-negligible increase of +resource usage (CPU, RAM, disk, IO, ...) in any components?**: No. + +### Troubleshooting + +Will be added before transition to beta. + +* **How does this feature react if the API server and/or etcd is unavailable?** + +* **What are other known failure modes?** + +* **What steps should be taken if SLOs are not being met to determine the problem?** + +## Implementation History + + + +## Drawbacks + + + +TBD: +Some ideas +- another configuration knob is added +- user namespaces could make troubleshooting difficult +- volumes are really trickly to handle +- any performance issues? + +## Alternatives + +### Differences with Previous Proposal +Even if this KEP is heavily based on the previous [Support Node-Level User +Namespaces +Remapping](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-usernamespace-remapping.md) +proposal there are some big differences: +- The previous proposal aimed to configure the ID mappings in the container + runtime instead of the kubelet. In this proposal this decision is made in the + kubelet because: + - It has knowledge of Kubernetes elements like volumes, pods, etc. + - Runtimes will be more simple as they don't have to implement logic to + allocate non-overlapping ID mappings. + - We keep the behaviour consistent among runtimes as kubelet will be the one + ordering what to do. +- That proposal didn't consider having different ID mappings for each pod. Even + if it's not planned for the first phase, this KEP takes that into + consideration performing the needed changes in the CRI from the beginning. + +### Default Value for userNamespaceMode + +This proposal intends to have `Host` instead of `Pod` as default value for the +user namespace mode. The rationale behind this decision is that it avoids +breaking existing workloads that don't work with user namespaces. We are aware +that this decision has the drawback that pods that have the `userNamespaceMode` +set will not have the security advantages of user namespaces but we consider +it's more important to keep compatibility with previous workloads. + +### Host Defaulting Mechanishm + +Previous proposals like [Node-Level UserNamespace +implementation](https://github.com/kubernetes/kubernetes/pull/64005) had a +mechanism to default to the host user namespace when the pod specification includes +features that could be not compatible with user namespaces (similar to [Default host user +namespace via experimental +flag](https://github.com/kubernetes/kubernetes/pull/31169)). + +This proposal doesn't require a similar mechanishm given that the default mode +is `Host` that works with all current existing workloads. + +## References + +- Support Node-Level User Namespaces Remapping design proposal document. + - https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-usernamespace-remapping.md +- Node-Level UserNamespace implementation + - https://github.com/kubernetes/kubernetes/pull/64005 +- Support node-level user namespace remapping + - https://github.com/kubernetes/enhancements/issues/127 +- Default host user namespace via experimental flag + - https://github.com/kubernetes/kubernetes/pull/31169 +- Add support for experimental-userns-remap-root-uid and + experimental-userns-remap-root-gid options to match the remapping used by the + container runtime + - https://github.com/kubernetes/kubernetes/pull/55707 +- Track Linux User Namespaces in the pod Security Policy + - https://github.com/kubernetes/kubernetes/issues/59152 diff --git a/keps/sig-node/127-usernamespaces-support/kep.yaml b/keps/sig-node/127-usernamespaces-support/kep.yaml new file mode 100644 index 00000000000..111dc604659 --- /dev/null +++ b/keps/sig-node/127-usernamespaces-support/kep.yaml @@ -0,0 +1,19 @@ +title: Support User Namespaces +kep-number: 127 +authors: + - "@mauriciovasquezbernal" + - "@rata" + - "@alban" +owning-sig: sig-node +participating-sigs: [] +status: provisional +creation-date: 2020-07-21 +reviewers: + - "@mrunalp" +approvers: + - TBD +feature-gates: + - name: UserNamespacesSupport + components: + - kubelet +disable-supported: true From dec612d4498b8d9f5c7a4340298cd7de71d424ea Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 07:46:06 -0500 Subject: [PATCH 02/89] IDs mapping -> ID mappings --- keps/sig-node/127-usernamespaces-support/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 6a4f9aa771f..309d19910e5 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -287,10 +287,10 @@ that updates permissions and ownership of the files to be accesible by the #### Container Runtime Support - **Docker**: - It only supports a [single IDs - mapping](https://docs.docker.com/engine/security/userns-remap/) shared by all - containers running in the host. There is not support for [multiple IDs - mapping](https://github.com/moby/moby/issues/28593) yet. Dockershim runtime is + It only supports a [single ID + mappings](https://docs.docker.com/engine/security/userns-remap/) shared by all + containers running in the host. There is not support for [multiple ID + mappings](https://github.com/moby/moby/issues/28593) yet. Dockershim runtime is only compatible with pods running in `Host` and `Cluster` modes. The user has to guarantee that the ID mappings configured in Docker through the `userns-remap` and the cluster-wide range configured in the Kubelet are the From 7e07f8593fdc6f76f12aa2616fd432e29769bc46 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 07:53:36 -0500 Subject: [PATCH 03/89] grammar --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 309d19910e5..63216cb1ff4 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -345,7 +345,7 @@ they are working on a solution for it. Another risk is exausting the disk space on the nodes if pods are repeativily started and stopped while using `Pod` mode. Since `Pod` mode is planned for -phase 2 we haven't consider a mitigation for this case. +phase 2 we haven't considered a mitigation for this case. ## Implementation Phases From ffcf42a04a46e197fe54c96cfbd7f1934b214166 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 08:15:04 -0500 Subject: [PATCH 04/89] Speficically -> Specifically --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 63216cb1ff4..52e11001323 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -172,7 +172,7 @@ namespace. The goal of supporting user namespaces in Kubernetes is to be able to run processes in pods with a different user and group IDs than in the host. -Speficically, a privileged process in the pod runs as an unprivileged process in the +Specifically, a privileged process in the pod runs as an unprivileged process in the host. If such a process is able to break into the host, it'll have limited impact as it'll be running as an unprivileged user there. From 35cb5eaff692c2e53e162fd84834429ba809ea4e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 08:39:51 -0500 Subject: [PATCH 05/89] some wording changes --- .../127-usernamespaces-support/README.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 52e11001323..a5f1b8951df 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -198,8 +198,7 @@ TODO(Mauricio) ## Proposal This proposal aims to support user namespaces in Kubernetes by extending the pod -specification with a new `userNamespaceMode` field. This proposal aims to -support three modes. +specification with a new `userNamespaceMode` field. This field can have 3 values: - **Host**: The pods are placed in the host user namespace, this is the current Kubernetes @@ -290,13 +289,13 @@ that updates permissions and ownership of the files to be accesible by the It only supports a [single ID mappings](https://docs.docker.com/engine/security/userns-remap/) shared by all containers running in the host. There is not support for [multiple ID - mappings](https://github.com/moby/moby/issues/28593) yet. Dockershim runtime is - only compatible with pods running in `Host` and `Cluster` modes. The user has - to guarantee that the ID mappings configured in Docker through the - `userns-remap` and the cluster-wide range configured in the Kubelet are the - same. The dockershim implementation includes a check to verify that the IDs - mapping received from the Kubelet are equal to the ones configured in Docker, - returning an error otherwise. + mappings](https://github.com/moby/moby/issues/28593) yet. Dockershim runtime + is only compatible with pods running in `Host` and `Cluster` modes. The user + has to guarantee that the ID mappings configured in Docker through the + `userns-remap` parameter and the cluster-wide range configured in the Kubelet + are the same. The dockershim implementation includes a check to verify that + the IDs mapping received from the Kubelet are equal to the ones configured in + Docker, returning an error otherwise. - **containerd**: It's quite straigtforward to implement the CRI changes proposed below in containerd/cri, we did it in From a6291070b56c4e8b0be2f659f39b43a93eaabebc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 08:40:02 -0500 Subject: [PATCH 06/89] add high uid images risk --- keps/sig-node/127-usernamespaces-support/README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index a5f1b8951df..b40668eefb5 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -346,6 +346,19 @@ Another risk is exausting the disk space on the nodes if pods are repeativily started and stopped while using `Pod` mode. Since `Pod` mode is planned for phase 2 we haven't considered a mitigation for this case. +#### Container Images with High IDs + +There are container images designed to run with high user and group IDs. It's +possible that the IDs range assigned to the pod is not big enough to accomodate +these IDs, in this case they will be mapped to the `nobody` user in the host. + +It's not a big problem in the `Cluster` case, the users have to be sure that +they provide a range accomodating these IDs. It's more difficult to handle in +the `Pod` case as the logic to allocate the ranges for each pod has to take this +information in consideration. It's likely that this requires some changes to the +CRI and kubelet so the runtimes can inform the kubelet what are the IDs present +on a specific container image. + ## Implementation Phases The implemenation of this KEP in a single phase is complicated as there are many From 5fca05592a700556c6fee82a9ae10ae6f7189a66 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 08:45:13 -0500 Subject: [PATCH 07/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: Alban Crequy --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index b40668eefb5..52a6e30b257 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -149,7 +149,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release* Container security consists of many different kernel features that work together to make containers secure. User namespaces isolate user and group IDs by allowing processes to run with different IDs in the container and in the host. -Specially, a process running as privileged in a container can be seen as +Specially, a process running as privileged in a container can be unprivileged in the host. This gives more capabilities to the containers and protects the host from malicious or compromised containers. From 471392e0ad13721b34827351976843ee39d54e63 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 08:45:57 -0500 Subject: [PATCH 08/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: Alban Crequy --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 52a6e30b257..ec26dd6af29 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -150,7 +150,7 @@ Container security consists of many different kernel features that work together to make containers secure. User namespaces isolate user and group IDs by allowing processes to run with different IDs in the container and in the host. Specially, a process running as privileged in a container can be -unprivileged in the host. This gives more capabilities to the containers and +unprivileged in the host. This makes it possible to give more capabilities to the containers and protects the host from malicious or compromised containers. This KEP is a continuation of the work initiated in the [Support Node-Level User From 58c29b887636d7d61dd6fc76152f2824bd6fdcf7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 08:47:05 -0500 Subject: [PATCH 09/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: Alban Crequy --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index ec26dd6af29..e0d8baef7af 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -177,7 +177,7 @@ host. If such a process is able to break into the host, it'll have limited impact as it'll be running as an unprivileged user there. There have been some security vulnerabilities that could have been mitigated by -user namespaces. Some examples are: +user namespaces and it is expected that using user namespaces would mitigate against some of the future vulnerabilities. Some examples are: - CVE-2016-8867: Privilege escalation inside containers - https://github.com/moby/moby/issues/27590 - CVE-2018-15664: TOCTOU race attack that allows to read/write files in the host From 6f6e745341d2306733c9c98337d5753b4c86a9f4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 08:48:10 -0500 Subject: [PATCH 10/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: Alban Crequy --- keps/sig-node/127-usernamespaces-support/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index e0d8baef7af..c0ec92bd55f 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -190,6 +190,8 @@ user namespaces and it is expected that using user namespaces would mitigate aga - Increase node to pod isolation in Kubernetes by mapping user and group IDs inside the container to different IDs in the host. In particular, mapping root inside the container to unprivileged user and group IDs in the node. +- Making it possible to run workloads that need "dangerous" capabilities such as `CAP_SYS_ADMIN` without impacting the host. +- Benefit from the security hardening that user namespaces are expected to provide against some of the future unknown runtime vulnerabilities ### Non-Goals From 2c77cd427ddb380fd7e0e76d0630304d6ae67816 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 08:57:40 -0500 Subject: [PATCH 11/89] add non-goals --- keps/sig-node/127-usernamespaces-support/README.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index c0ec92bd55f..2289dbc2d4a 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -195,7 +195,15 @@ user namespaces and it is expected that using user namespaces would mitigate aga ### Non-Goals -TODO(Mauricio) +- Provide a way to run the Kubelet process or container runtimes as an + unprivileged process. Although initiatives like + [usernetes](https://github.com/rootless-containers/usernetes) and this KEP + both make use of user namespaces, it is a different implementation for a + different purpose. +- Mounting volumes in pods with a user ID mapping. Although the authors of this + KEP would like to have this feature in the future, this is out of scope of + this KEP. The complexity of this would probably require to write a separate + KEP. ## Proposal From afa14a10aba35b187edf000abded49a57b11fda4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 08:58:12 -0500 Subject: [PATCH 12/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: Alban Crequy --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 2289dbc2d4a..50afca2bc35 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -245,7 +245,7 @@ capability to give any extra privilege on the host. #### Story 2 As a cluster admin, I want to allow some pods to share the host user namespace -if they need a feature only available in such user namespace. +if they need a feature only available in such user namespace, such as loading a kernel module with `CAP_SYS_MODULE`. ### Notes/Constraints/Caveats From 95b79624fc150ef98f022a87e475eaac425f4d69 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 09:01:34 -0500 Subject: [PATCH 13/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: Alban Crequy --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 50afca2bc35..0851d7b864d 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -227,7 +227,7 @@ specification with a new `userNamespaceMode` field. This field can have 3 values - **Pod**: Each pod is placed in a different user namespace and has a different and - non-overlapping ID mappings. This mode is intended for stateless pods, i.e. + non-overlapping ID mapping. This mode is intended for stateless pods, i.e. pods using only ephemeral volumes like `configMap,` `downwardAPI`, `secret`, `projected` and `emptyDir`. This mode not only provides host-to-pod isolation but also pod-to-pod isolation as each pod has a different range of effective From c2b1dc71c730e4d07388bf5c16f1bef40f75ff68 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 09:29:36 -0500 Subject: [PATCH 14/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: Alban Crequy --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 0851d7b864d..1f68b245523 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -271,7 +271,7 @@ In regard to this proposal, volumes can be divided in ephemeral and non-ephemera Ephemeral volumes are associated to a **single** pod and their lifecyle is dependent on that pod. These are `configMap`, `secret`, `emptyDir`, `downwardAPI`, etc. These kind of volumes are easy to handle as they are not -shared by different pods and hence all the process accessig those volumes have +shared by different pods and hence all the processes accessing those volumes have the same effective user and group IDs. Kubelet creates the files for those volumes and it can easily set the file ownership too. From 81285d2d689a2e4bfd277a05a5c33c19c6815ec2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 09:30:29 -0500 Subject: [PATCH 15/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: Alban Crequy --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 1f68b245523..2159bcd170c 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -282,7 +282,7 @@ different strategies: don't have access permissions for `others` because the effective user and group IDs on the host are the same for all the pods. - The semantics of semantics of `fsGroup` are respected, if it's specified it's - assumed to be the correct GID in the host and an 1-to-1 mapping entry for the + assumed to be the correct GID in the host and a 1-to-1 mapping entry for the `fsGroup` is added to the GID mappings for the pod. There are some cases that require special attention from the user. For instance, From bec362eb278ddf123742dfa905f7cb721af9542d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 09:51:57 -0500 Subject: [PATCH 16/89] add some notes about special capabilities in user namespaces --- keps/sig-node/127-usernamespaces-support/README.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 2159bcd170c..cf2b13f7f8c 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -331,6 +331,17 @@ Some features that could not work when the host user namespace is not shared are capabilities that are only available in the root (host) user namespace such as `CAP_SYS_TIME`, `CAP_SYS_MODULE` & `CAP_MKNOD`. + If a pod is given one of those capabilities it will still be deployed, but the + capability will be ineffective and processes using those capabilities will + fail. This is not impacting the implementation in Kubernetes. If users need + the capability to be effective, they should use `userNamespaceMode=Host`. + + The list of such capabilities is likely to change from one Linux version to + another. For example, Linux now has [time + namespaces](https://man7.org/linux/man-pages/man7/time_namespaces.7.html) and + there are ways to make `CAP_SYS_TIME` work inside a user namespace. There are + also discussions to make `CAP_MKNOD` work in user namespaces. + - **Sharing Host Namespaces**: There are some limitations in the Linux kernel and in the runtimes that prevents sharing other host namespaces when the host user namespace is not From 33e1bf415aa36e8953821abc6019bff7960de888 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 09:57:21 -0500 Subject: [PATCH 17/89] add details about sharing host namespaces --- keps/sig-node/127-usernamespaces-support/README.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index cf2b13f7f8c..89421cd7de3 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -344,9 +344,17 @@ Some features that could not work when the host user namespace is not shared are - **Sharing Host Namespaces**: There are some limitations in the Linux kernel and in the runtimes that - prevents sharing other host namespaces when the host user namespace is not + prevent sharing other host namespaces when the host user namespace is not shared. - TODO(Mauricio): Put links to those limitations? + - Mounting `mqueue` (`/dev/mqueue`) is not allowed from a process in a user + namespace that does not own the IPC namespace. Pods with `hostIPC=true` and + `userNamespaceMode=Pod|Cluster` can fail. + - Mounting `procfs` (`/proc`) is not allowed from a process in a user namespace + that does not own the PID namespace. Pods with `hostPID=true` and + `userNamespaceMode=Pod|Cluster` can fail. + - Mounting `sysfs` (`/sys`) is not allowed from a process in a user namespace + that does not own the network namespace. Impact: pods with + `hostNetwork=true` and `userNamespaceMode=Pod|Cluster` can fail. In order to avoid breaking existing workloads `Host` is the default value of `userNamespaceMode`. From 5fb56c995ff1ebb7f7a44203b4e924bf9f965f85 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 10:08:43 -0500 Subject: [PATCH 18/89] add note about image duplication --- keps/sig-node/127-usernamespaces-support/README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 89421cd7de3..264f9e156d8 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -371,6 +371,10 @@ to the Linux kernel community and [they replied](https://lists.linuxfoundation.org/pipermail/containers/2020-September/042230.html) they are working on a solution for it. +If the Linux kernel provides a solution for this problem, that would be +something that container runtimes should use. It does not impact the kubelet nor +the CRI gRPC spec. + Another risk is exausting the disk space on the nodes if pods are repeativily started and stopped while using `Pod` mode. Since `Pod` mode is planned for phase 2 we haven't considered a mitigation for this case. From 6a120947ec091d233e6cc1da8106c9402dcca65c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 10:08:43 -0500 Subject: [PATCH 19/89] add note about image duplication --- keps/sig-node/127-usernamespaces-support/README.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 89421cd7de3..54f34a2a98e 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -361,15 +361,21 @@ In order to avoid breaking existing workloads `Host` is the default value of `us #### Duplicated Snapshots of Container Images The owners of the files of a container image have to been set accordingly to the -ID mappings used for that container. The runtimes perform a `chown` operation -over the image snapshot when it's pulled. This presents a risk as it potentially -increases the time and the storage needed to handle the container images. +ID mappings used for that container. For example, if the user 0 in the container +is mapped to the host user 100000, then the `/root` directory has to be owned by +user ID 100000 in the host, so it appears to belong to root in the container. +The current implementation in container runtimes is to recursively perform a +`chown` operation over the image snapshot when it's pulled. This presents a risk +as it potentially increases the time and the storage needed to handle the +container images. There is not immediate mitigation for it, [we talked](https://lists.linuxfoundation.org/pipermail/containers/2020-September/042230.html) to the Linux kernel community and [they replied](https://lists.linuxfoundation.org/pipermail/containers/2020-September/042230.html) -they are working on a solution for it. +they are working on a solution for it. If the Linux kernel provides a solution +for this problem, that would be something that container runtimes should use. It +does not impact the kubelet nor the CRI gRPC spec. Another risk is exausting the disk space on the nodes if pods are repeativily started and stopped while using `Pod` mode. Since `Pod` mode is planned for From 19d29349c1d439c67a49d9137dc6a900a3f4a3af Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 10:54:16 -0500 Subject: [PATCH 20/89] future phases: add some of the open points for those phases. --- .../127-usernamespaces-support/README.md | 47 ++++++++++++------- 1 file changed, 30 insertions(+), 17 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 6c67cfd8ee5..0b0f695b4e3 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -421,27 +421,40 @@ the `Pod` mode is out of scope in this phase because it requires a non negligible amount of work and we could risk losing the focus failing to deliver this feature. -### Phase 2 +### Future Phases -This phase aims to implement the `Pod` mode. After this phase is completed the +These phase aims to implement the `Pod` mode. After this phase is completed the full advantanges of user namespaces could be used in some cases (stateless workloads). -### Phase 2+ - -The default value of `userNamespaceMode` should be set to `Pod` so pods that -don't set this field can also get the security benefits of user namespaces. It's -not clear yet what should be the process to make this happen as this is a -potentially non backwards compatible change. It's specially relevant for -workloads not compatible with user namespaces. - -A [host defaulting mechanishm](#host-defaulting-mechanishm) could help to make -this transiction smoother, but this proposal doesn't go into details as it -mainly focuses in phase 1. - -TODO(Mauricio): -- Should we describe that once the default is Pod we should implement a control for it on PSP / OPA? -- Should we mention that it's possible that users in future would be able to set the mappings? +There are some things that have to be studied with more detail for these +phase(s) but are not needed for phase 1, hence they are not discussed in detail: + +- **Pod Default Mode**: + It's not clear yet what should be the process to make this happen as this is a + potentially non backwards compatible change. It's specially relevant for + workloads not compatible with user namespaces. A [host defaulting + mechanishm](#host-defaulting-mechanishm) could help to make this transiction + smoother. +- **Duplicated Snapshots of Container Images**: + It's not clear when and how this support will land in the Linux Kernel. +- **ID Mappings Allocation Algorithm** + The `Pod` mode requires to have each pod in different and non-overlapping ID mapping. It requires to implement an algorithm that performs that allocation. There are some open questions about it: + - What should be the length of the mapping assigned to each Pod? + - How to get the ID mapping range of a running Pod when kubelet crashes? + - Can the user specify the ID mappings for a pod? +- **High IDs in Container Images**: + The IDs present on the image are not available as image metadata. The runtimes + would have to perform an image check, like analysing the `/etc/password` file, + to discover what those IDs are. The kubelet and the CRI will require some + changes to make this information available to the ID mappings allocator + algorithm. It would have to be sure that the allocated mappings include those + IDs and should have some logic to protect against special crafted images to + perform a kind of DOS allocating too many IDs for a given container. +- **Security Considerations**: + Once `Pod` is the default mode, it is needed to control who can use `Host` and + `Cluster` modes. This could be done through Pod Security Policies if they are + available at that time. ## Design Details From 11f90bc25ef8d5d1a74393a4d85ad96d5be237ce Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 10:55:51 -0500 Subject: [PATCH 21/89] spelling --- keps/sig-node/127-usernamespaces-support/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 0b0f695b4e3..6d9ca49d874 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -388,11 +388,11 @@ phase 2 we haven't considered a mitigation for this case. #### Container Images with High IDs There are container images designed to run with high user and group IDs. It's -possible that the IDs range assigned to the pod is not big enough to accomodate +possible that the IDs range assigned to the pod is not big enough to accommodate these IDs, in this case they will be mapped to the `nobody` user in the host. It's not a big problem in the `Cluster` case, the users have to be sure that -they provide a range accomodating these IDs. It's more difficult to handle in +they provide a range accommodating these IDs. It's more difficult to handle in the `Pod` case as the logic to allocate the ranges for each pod has to take this information in consideration. It's likely that this requires some changes to the CRI and kubelet so the runtimes can inform the kubelet what are the IDs present From a574784d404db9a84b2af232159c95e29d0264d9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 10:56:03 -0500 Subject: [PATCH 22/89] update toc --- keps/sig-node/127-usernamespaces-support/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 6d9ca49d874..19960b52549 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -76,10 +76,10 @@ SIG Architecture for cross-cutting KEPs). - [Risks and Mitigations](#risks-and-mitigations) - [Breaking Existing Workloads](#breaking-existing-workloads) - [Duplicated Snapshots of Container Images](#duplicated-snapshots-of-container-images) + - [Container Images with High IDs](#container-images-with-high-ids) - [Implementation Phases](#implementation-phases) - [Phase 1](#phase-1) - - [Phase 2](#phase-2) - - [Phase 2+](#phase-2-1) + - [Future Phases](#future-phases) - [Design Details](#design-details) - [Summary of the Proposed Changes](#summary-of-the-proposed-changes) - [PodSpec Changes](#podspec-changes) From 55b87162dc5ae55fbae1711422e47ec4a88dbdd8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 11:16:46 -0500 Subject: [PATCH 23/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: rata --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 19960b52549..49a23d4aca1 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -296,7 +296,7 @@ that updates permissions and ownership of the files to be accesible by the #### Container Runtime Support - **Docker**: - It only supports a [single ID +Docker only supports a [single ID mappings](https://docs.docker.com/engine/security/userns-remap/) shared by all containers running in the host. There is not support for [multiple ID mappings](https://github.com/moby/moby/issues/28593) yet. Dockershim runtime From acd89eb238c2f361beb7388eb15c27a5227b0134 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 11:17:32 -0500 Subject: [PATCH 24/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: rata --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 49a23d4aca1..15cf9c00c0b 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -312,7 +312,7 @@ Docker only supports a [single ID [this](https://github.com/kinvolk/containerd-cri/commits/mauricio/userns_poc) PoC. - **cri-o**: - It recently [added](https://github.com/cri-o/cri-o/pull/3944) support for +CRI-O recently [added](https://github.com/cri-o/cri-o/pull/3944) support for user namespaces through a pod annotation. The extensions to make it work with the current CRI changes are small. - TODO(Mauricio): gVisor, katacontainers? From 53c42eac18c7fc5e5fc909572842c46ac0de1260 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 11:18:18 -0500 Subject: [PATCH 25/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: rata --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 15cf9c00c0b..865a2b49ce6 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -253,7 +253,7 @@ if they need a feature only available in such user namespace, such as loading a The Linux kernel uses the effective user and group IDs (the ones the host) to check the file access permissions. Since with user namespaces IDs are mapped to -a different value on the host, this could cause issues accessing volumes if the +a different value on the host, this causes issues accessing volumes if the pod is run with a different mapping, i.e. the effective user and group IDs on the host change. From 0f57423b5f379bdb80d9e3a23c05efde65ad07b5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 11:21:42 -0500 Subject: [PATCH 26/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: rata --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 865a2b49ce6..0833edcc60b 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -495,7 +495,7 @@ type PodSpec struct { ### CRI API Changes -The CRI has to be extended to allow kubelet to specify the user namespace mode +The CRI is extended to (optionally) specify the user namespace mode and the ID mappings for a pod. [`NamespaceOption`](https://github.com/kubernetes/cri-api/blob/1eae59a7c4dee45a900f54ea2502edff7e57fd68/pkg/apis/runtime/v1alpha2/api.proto#L228) is extended with two new fields: From 36d9084f1a66c0e98d0baecc8a8e640906fb093e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 11:30:56 -0500 Subject: [PATCH 27/89] add template comments back --- .../127-usernamespaces-support/README.md | 137 ++++++++++++++++-- 1 file changed, 126 insertions(+), 11 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 0833edcc60b..dcc7a2525bc 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -60,6 +60,22 @@ SIG Architecture for cross-cutting KEPs). --> # KEP-127: Support User Namespaces + + + + - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) @@ -146,6 +162,25 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary + + Container security consists of many different kernel features that work together to make containers secure. User namespaces isolate user and group IDs by allowing processes to run with different IDs in the container and in the host. @@ -160,6 +195,15 @@ proposal. ## Motivation + + From [user_namespaces(7)](https://man7.org/linux/man-pages/man7/user_namespaces.7.html): > User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs, the root directory, keys, and capabilities. @@ -187,6 +231,11 @@ user namespaces and it is expected that using user namespaces would mitigate aga ### Goals + + - Increase node to pod isolation in Kubernetes by mapping user and group IDs inside the container to different IDs in the host. In particular, mapping root inside the container to unprivileged user and group IDs in the node. @@ -195,6 +244,11 @@ user namespaces and it is expected that using user namespaces would mitigate aga ### Non-Goals + + - Provide a way to run the Kubelet process or container runtimes as an unprivileged process. Although initiatives like [usernetes](https://github.com/rootless-containers/usernetes) and this KEP @@ -207,6 +261,14 @@ user namespaces and it is expected that using user namespaces would mitigate aga ## Proposal + + This proposal aims to support user namespaces in Kubernetes by extending the pod specification with a new `userNamespaceMode` field. This field can have 3 values: @@ -235,6 +297,13 @@ specification with a new `userNamespaceMode` field. This field can have 3 values ### User Stories + + #### Story 1 As a cluster admin, I want run some pods with privileged capabilities because @@ -249,6 +318,13 @@ if they need a feature only available in such user namespace, such as loading a ### Notes/Constraints/Caveats + + #### Volumes Support The Linux kernel uses the effective user and group IDs (the ones the host) to @@ -321,6 +397,18 @@ containerd and cri-o will provide support for the 3 possible values of `userName ### Risks and Mitigations + + #### Breaking Existing Workloads Some features that could not work when the host user namespace is not shared are: @@ -458,6 +546,14 @@ phase(s) but are not needed for phase 1, hence they are not discussed in detail: ## Design Details + + + This section only focuses on phase 1 as specified above. ### Summary of the Proposed Changes @@ -537,8 +633,6 @@ message NamespaceOption { ### Test Plan -TBD - -### Graduation Criteria - TBD -Mauricio: Should we require Pod mode to be implemented to switch to Beta? -#### Alpha -> Beta - -- Future Complete: - - `Pod` mode implemented - -#### Beta -> GA +### Graduation Criteria +TBD +Mauricio: Should we require Pod mode to be implemented to switch to Beta? + +#### Alpha -> Beta + +- Future Complete: + - `Pod` mode implemented + +#### Beta -> GA + ### Upgrade / Downgrade Strategy + The container runtime will have to be updated in the nodes to support this feature. The new `user` field in the `NamespaceOption` will be ignored by an old runtime @@ -796,6 +905,12 @@ Some ideas ## Alternatives + + ### Differences with Previous Proposal Even if this KEP is heavily based on the previous [Support Node-Level User Namespaces From e08023ce8794fa2e545d19a5d6aa41b538bad846 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 11:58:25 -0500 Subject: [PATCH 28/89] clarify behaviour on volumes and different user namespaces modes --- .../127-usernamespaces-support/README.md | 21 ++++++++++++------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index dcc7a2525bc..a7c7bc22b97 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -346,14 +346,15 @@ In regard to this proposal, volumes can be divided in ephemeral and non-ephemera Ephemeral volumes are associated to a **single** pod and their lifecyle is dependent on that pod. These are `configMap`, `secret`, `emptyDir`, -`downwardAPI`, etc. These kind of volumes are easy to handle as they are not -shared by different pods and hence all the processes accessing those volumes have -the same effective user and group IDs. Kubelet creates the files for those -volumes and it can easily set the file ownership too. - -Non-ephemeral volumes more difficult to support since they can be persistent and -also can be shared by multiple pods. This proposal supports volumes with two -different strategies: +`downwardAPI`, etc. These kind of volumes can work with any of the three +different modes of `userNamespaceMode` as they are not shared by different pods +and hence all the processes accessing those volumes have the same effective user +and group IDs. Kubelet creates the files for those volumes and it can easily set +the file ownership too. + +Non-ephemeral volumes are more difficult to support since they can be persistent +and shared by multiple pods. This proposal supports volumes with two different +strategies: - The `Cluster` makes it easier for pods to share files using volumes when those don't have access permissions for `others` because the effective user and group IDs on the host are the same for all the pods. @@ -361,6 +362,10 @@ different strategies: assumed to be the correct GID in the host and a 1-to-1 mapping entry for the `fsGroup` is added to the GID mappings for the pod. +This KEP doesn't impose any restriction on the different volumes and +`userNamespaceMode` combinations and leaves it to users to chose the correct +combinations based on their specific needs. + There are some cases that require special attention from the user. For instance, a process inside a pod will not be able to access files with mode `0700` and owned by a user different than the effective user of that process in a volume From 5ea31b17aaa5b1acc8d5c328c3f2eb3615c28d5b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 12:05:59 -0500 Subject: [PATCH 29/89] grammager: could -> can in some places --- .../127-usernamespaces-support/README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index a7c7bc22b97..1c332a2a665 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -335,7 +335,7 @@ the host change. This proposal supports volume without changing the user and group IDs and leaves that problem to the user to manage. Future Linux kernel features such as shiftfs -could allow to different pods to see a volume with its own IDs but it is out of +could allow different pods to see a volume with its own IDs but it is out of scope of this proposal. Among the possible future kernel solutions, we can list: - [shiftfs: uid/gid shifting filesystem](https://lwn.net/Articles/757650/) @@ -517,7 +517,7 @@ this feature. ### Future Phases These phase aims to implement the `Pod` mode. After this phase is completed the -full advantanges of user namespaces could be used in some cases (stateless +full advantanges of user namespaces can be used in some cases (stateless workloads). There are some things that have to be studied with more detail for these @@ -527,7 +527,7 @@ phase(s) but are not needed for phase 1, hence they are not discussed in detail: It's not clear yet what should be the process to make this happen as this is a potentially non backwards compatible change. It's specially relevant for workloads not compatible with user namespaces. A [host defaulting - mechanishm](#host-defaulting-mechanishm) could help to make this transiction + mechanishm](#host-defaulting-mechanishm) can help to make this transiction smoother. - **Duplicated Snapshots of Container Images**: It's not clear when and how this support will land in the Linux Kernel. @@ -546,7 +546,7 @@ phase(s) but are not needed for phase 1, hence they are not discussed in detail: perform a kind of DOS allocating too many IDs for a given container. - **Security Considerations**: Once `Pod` is the default mode, it is needed to control who can use `Host` and - `Cluster` modes. This could be done through Pod Security Policies if they are + `Cluster` modes. This can be done through Pod Security Policies if they are available at that time. ## Design Details @@ -761,8 +761,8 @@ without user namespaces support. The container will be placed in the host user namespace. It's a responsibility of the user to guarante that a runtime supporting user namespaces is used. -If an old version of kubelet without user namespaces support could cause some -issues too. In this case the runtime could wrongly infer that the `user` field +An old version of kubelet without user namespaces support can cause some +issues too. In this case the runtime can wrongly infer that the `user` field is set to `POD` in the `NamespaceOption` message. To avoid this problem the runtime should check if the `mappings` field contains any mappings, an error should be raised otherwise. @@ -808,12 +808,12 @@ you need any help or guidance. Yes, by disabling the `UserNamespacesSupport` feature gate. The effective user and group IDs of the process in the host would be different before and after disabling the feature for pods running in `Cluster` and `Pod` - modes. This could cause access issues to pods accessing files saved in + modes. This can cause access issues to pods accessing files saved in volumes. * **What happens if we reenable the feature if it was previously rolled back?** The situation is very similar to the described above. The pod will be able to - access the files written when the feature was enabled but could have issues to + access the files written when the feature was enabled but can have issues to access those files written while the feature was disabled. * **Are there any tests for feature enablement/disablement?** From cf4fe58e253e0bde94d36cd43f6c8e47c78ffe79 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 12:07:13 -0500 Subject: [PATCH 30/89] remove duplicated sentence --- keps/sig-node/127-usernamespaces-support/README.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 1c332a2a665..b117160654a 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -470,10 +470,6 @@ they are working on a solution for it. If the Linux kernel provides a solution for this problem, that would be something that container runtimes should use. It does not impact the kubelet nor the CRI gRPC spec. -If the Linux kernel provides a solution for this problem, that would be -something that container runtimes should use. It does not impact the kubelet nor -the CRI gRPC spec. - Another risk is exausting the disk space on the nodes if pods are repeativily started and stopped while using `Pod` mode. Since `Pod` mode is planned for phase 2 we haven't considered a mitigation for this case. From ca679dd9816683b3198023970971f74454c4c87a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 12:14:09 -0500 Subject: [PATCH 31/89] add clarification about shared host namespace --- keps/sig-node/127-usernamespaces-support/README.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index b117160654a..5e6c58a7c79 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -449,6 +449,12 @@ Some features that could not work when the host user namespace is not shared are that does not own the network namespace. Impact: pods with `hostNetwork=true` and `userNamespaceMode=Pod|Cluster` can fail. + If users specify `userNamespaceMode=Pod|Cluster` and one of these + `host{IPC,PID,Network}=true` options, runc will currently fail to start the + container. The kubelet does **not** try to prevent that combination of options, + in case runc or the kernel make it possible in the future to use that + combination. + In order to avoid breaking existing workloads `Host` is the default value of `userNamespaceMode`. #### Duplicated Snapshots of Container Images @@ -512,7 +518,7 @@ this feature. ### Future Phases -These phase aims to implement the `Pod` mode. After this phase is completed the +These phases aim to implement the `Pod` mode. After this phase is completed the full advantanges of user namespaces can be used in some cases (stateless workloads). From 143f822cf17da61921869f64b061b41512eb62fb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 12:15:01 -0500 Subject: [PATCH 32/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: Alban Crequy --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 5e6c58a7c79..90f41af6f4d 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -760,7 +760,7 @@ The container runtime will have to be updated in the nodes to support this featu The new `user` field in the `NamespaceOption` will be ignored by an old runtime without user namespaces support. The container will be placed in the host user -namespace. It's a responsibility of the user to guarante that a runtime +namespace. It's a responsibility of the user to guarantee that a runtime supporting user namespaces is used. An old version of kubelet without user namespaces support can cause some From 515460d9e396892da38eb68577fca458fa5279b9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 13:09:02 -0500 Subject: [PATCH 33/89] plural vs singular --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 5e6c58a7c79..902b585c307 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -518,7 +518,7 @@ this feature. ### Future Phases -These phases aim to implement the `Pod` mode. After this phase is completed the +These phases aim to implement the `Pod` mode. After these phases are completed the full advantanges of user namespaces can be used in some cases (stateless workloads). From 517b8178c993c214080e6043c3332df54e2bb119 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 13:15:17 -0500 Subject: [PATCH 34/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: rata --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index d170a98983b..dcd73ff4fac 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -416,7 +416,7 @@ Consider including folks who also work outside the SIG or subproject. #### Breaking Existing Workloads -Some features that could not work when the host user namespace is not shared are: +Some features that don't work when the host user namespace is not shared are: - **Some Capabilities**: The Linux kernel takes into consideration the user namespace a process is From 742abd2e6134a1042506ed5cf53d6e8a8df478ef Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 13:57:46 -0500 Subject: [PATCH 35/89] Improve summary --- keps/sig-node/127-usernamespaces-support/README.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index d170a98983b..e5de8a5d299 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -186,12 +186,18 @@ to make containers secure. User namespaces isolate user and group IDs by allowing processes to run with different IDs in the container and in the host. Specially, a process running as privileged in a container can be unprivileged in the host. This makes it possible to give more capabilities to the containers and -protects the host from malicious or compromised containers. - -This KEP is a continuation of the work initiated in the [Support Node-Level User +protects the host and other containers from malicious or compromised containers. + +This KEP adds a new `userNamespaceMode` field to `pod.Spec`. It allows users to +place pods in different user namespaces increasing the pod-to-pod and +pod-to-host isolation. This extra isolation increases the cluster security as it +protects the host and other pods from malicious or compromised processes inside +containers that are able to break into the host. This KEP proposes three +different modes: `Host` keeps the current behaviour, `Cluster` uses the same +ID mapping for all the pods (very similar to the previous [Support Node-Level User Namespaces Remapping](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-usernamespace-remapping.md) -proposal. +proposal) and `Pod` increases pod-to-pod isolation. ## Motivation From b0fa8468bb363656aab28a9788b5e5204a5c6cc8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 14:00:43 -0500 Subject: [PATCH 36/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: rata --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 6a11b1c5783..ea0590117d6 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -226,7 +226,7 @@ Specifically, a privileged process in the pod runs as an unprivileged process in host. If such a process is able to break into the host, it'll have limited impact as it'll be running as an unprivileged user there. -There have been some security vulnerabilities that could have been mitigated by +The following security vulnerabilities are mitigated with user namespaces and it is expected that using user namespaces would mitigate against some of the future vulnerabilities. Some examples are: - CVE-2016-8867: Privilege escalation inside containers - https://github.com/moby/moby/issues/27590 From d1d275369270ce5eb797b93d61c52cb5f713941a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 14:02:34 -0500 Subject: [PATCH 37/89] small reword --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index ea0590117d6..f22ce286f1e 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -227,7 +227,7 @@ host. If such a process is able to break into the host, it'll have limited impact as it'll be running as an unprivileged user there. The following security vulnerabilities are mitigated with -user namespaces and it is expected that using user namespaces would mitigate against some of the future vulnerabilities. Some examples are: +user namespaces and it is expected that using them would mitigate against some of the future vulnerabilities. - CVE-2016-8867: Privilege escalation inside containers - https://github.com/moby/moby/issues/27590 - CVE-2018-15664: TOCTOU race attack that allows to read/write files in the host From a638e1518ef1956d5662505ed5f64961fddbe62f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 14:35:56 -0500 Subject: [PATCH 38/89] making -> make --- keps/sig-node/127-usernamespaces-support/README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index f22ce286f1e..e7ebaf581e3 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -245,8 +245,10 @@ know that this has succeeded? - Increase node to pod isolation in Kubernetes by mapping user and group IDs inside the container to different IDs in the host. In particular, mapping root inside the container to unprivileged user and group IDs in the node. -- Making it possible to run workloads that need "dangerous" capabilities such as `CAP_SYS_ADMIN` without impacting the host. -- Benefit from the security hardening that user namespaces are expected to provide against some of the future unknown runtime vulnerabilities +- Make it possible to run workloads that need "dangerous" capabilities such as + `CAP_SYS_ADMIN` without impacting the host. +- Benefit from the security hardening that user namespaces are expected to + provide against some of the future unknown runtime vulnerabilities ### Non-Goals From 7910c4eab8b3e124b1ea64ef33bebe5095af8d64 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 14:45:26 -0500 Subject: [PATCH 39/89] Kubelet -> kubelet --- keps/sig-node/127-usernamespaces-support/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index e7ebaf581e3..dcb16b0068a 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -257,7 +257,7 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion and make progress. --> -- Provide a way to run the Kubelet process or container runtimes as an +- Provide a way to run the kubelet process or container runtimes as an unprivileged process. Although initiatives like [usernetes](https://github.com/rootless-containers/usernetes) and this KEP both make use of user namespaces, it is a different implementation for a @@ -391,9 +391,9 @@ Docker only supports a [single ID mappings](https://github.com/moby/moby/issues/28593) yet. Dockershim runtime is only compatible with pods running in `Host` and `Cluster` modes. The user has to guarantee that the ID mappings configured in Docker through the - `userns-remap` parameter and the cluster-wide range configured in the Kubelet + `userns-remap` parameter and the cluster-wide range configured in the kubelet are the same. The dockershim implementation includes a check to verify that - the IDs mapping received from the Kubelet are equal to the ones configured in + the IDs mapping received from the kubelet are equal to the ones configured in Docker, returning an error otherwise. - **containerd**: It's quite straigtforward to implement the CRI changes proposed below in From bab57f5651a4cf2aad9f85ce255bb000199ffecc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 28 Sep 2020 15:01:26 -0500 Subject: [PATCH 40/89] typos and small rewording --- keps/sig-node/127-usernamespaces-support/README.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index dcb16b0068a..fd717de31e5 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -248,7 +248,7 @@ know that this has succeeded? - Make it possible to run workloads that need "dangerous" capabilities such as `CAP_SYS_ADMIN` without impacting the host. - Benefit from the security hardening that user namespaces are expected to - provide against some of the future unknown runtime vulnerabilities + provide against some of the future unknown runtime vulnerabilities. ### Non-Goals @@ -503,7 +503,7 @@ on a specific container image. ## Implementation Phases -The implemenation of this KEP in a single phase is complicated as there are many +The implementation of this KEP in a single phase is complicated as there are many discussions to be done. We learned from previous attempts to bring this support in that it should be done in small steps to avoid losing the focus on the discussion. It's also true that a full plan should be agreed at the beginning to @@ -543,8 +543,8 @@ phase(s) but are not needed for phase 1, hence they are not discussed in detail: It's not clear when and how this support will land in the Linux Kernel. - **ID Mappings Allocation Algorithm** The `Pod` mode requires to have each pod in different and non-overlapping ID mapping. It requires to implement an algorithm that performs that allocation. There are some open questions about it: - - What should be the length of the mapping assigned to each Pod? - - How to get the ID mapping range of a running Pod when kubelet crashes? + - What should be the length of the mapping assigned to each pod? + - How to get the ID mapping range of a running pod when kubelet crashes? - Can the user specify the ID mappings for a pod? - **High IDs in Container Images**: The IDs present on the image are not available as image metadata. The runtimes @@ -576,7 +576,7 @@ This section only focuses on phase 1 as specified above. - Extend the CRI to have a user namespace mode and the user and group ID mappings. - Add a `userNamespaceMode` field to the pod spec. - Add the cluster-wide ID mappings to the kubelet configuration file. -- Add a `UserNamespacesSupport` feature flag to enable / disable the user. +- Add a `UserNamespacesSupport` feature flag to enable / disable the user namespaces support. - Update owner of ephemeral volumes populated by the kubelet. @@ -769,7 +769,7 @@ The container runtime will have to be updated in the nodes to support this featu The new `user` field in the `NamespaceOption` will be ignored by an old runtime without user namespaces support. The container will be placed in the host user namespace. It's a responsibility of the user to guarantee that a runtime -supporting user namespaces is used. +supporting user namespaces is used when this feature is enabled. An old version of kubelet without user namespaces support can cause some issues too. In this case the runtime can wrongly infer that the `user` field @@ -960,7 +960,6 @@ mechanism to default to the host user namespace when the pod specification inclu features that could be not compatible with user namespaces (similar to [Default host user namespace via experimental flag](https://github.com/kubernetes/kubernetes/pull/31169)). - This proposal doesn't require a similar mechanishm given that the default mode is `Host` that works with all current existing workloads. From c7cff4ff42e0ebe6079482f1547910472414bf01 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Tue, 29 Sep 2020 08:06:01 -0500 Subject: [PATCH 41/89] add id mapping sentence to pod mode --- keps/sig-node/127-usernamespaces-support/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index fd717de31e5..294f5b0b0d6 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -197,7 +197,8 @@ different modes: `Host` keeps the current behaviour, `Cluster` uses the same ID mapping for all the pods (very similar to the previous [Support Node-Level User Namespaces Remapping](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-usernamespace-remapping.md) -proposal) and `Pod` increases pod-to-pod isolation. +proposal) and `Pod` increases pod-to-pod isolation by giving each pod a +different and non-overlapping ID mapping. ## Motivation From c9ab05e0d8047b699a09282bb87d23b3d61ef625 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Tue, 29 Sep 2020 08:09:30 -0500 Subject: [PATCH 42/89] out of the container to the host --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 294f5b0b0d6..42924e0258c 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -224,7 +224,7 @@ namespace. The goal of supporting user namespaces in Kubernetes is to be able to run processes in pods with a different user and group IDs than in the host. Specifically, a privileged process in the pod runs as an unprivileged process in the -host. If such a process is able to break into the host, it'll have limited +host. If such a process is able to break out of the container to the host, it'll have limited impact as it'll be running as an unprivileged user there. The following security vulnerabilities are mitigated with From 70090673584c38e4c69bc294f4509b9301cd11a1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Tue, 29 Sep 2020 08:10:15 -0500 Subject: [PATCH 43/89] highly privileged --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 42924e0258c..9c3b2f8151a 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -246,7 +246,7 @@ know that this has succeeded? - Increase node to pod isolation in Kubernetes by mapping user and group IDs inside the container to different IDs in the host. In particular, mapping root inside the container to unprivileged user and group IDs in the node. -- Make it possible to run workloads that need "dangerous" capabilities such as +- Make it possible to run workloads that need highly privileged capabilities such as `CAP_SYS_ADMIN` without impacting the host. - Benefit from the security hardening that user namespaces are expected to provide against some of the future unknown runtime vulnerabilities. From 561838aa3572865f950c4e323ef3163e21e362e0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Tue, 29 Sep 2020 08:10:41 -0500 Subject: [PATCH 44/89] runtime and kernel --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 9c3b2f8151a..8bc806b226b 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -249,7 +249,7 @@ know that this has succeeded? - Make it possible to run workloads that need highly privileged capabilities such as `CAP_SYS_ADMIN` without impacting the host. - Benefit from the security hardening that user namespaces are expected to - provide against some of the future unknown runtime vulnerabilities. + provide against some of the future unknown runtime and kernel vulnerabilities. ### Non-Goals From 1825e647568ae297c5cb0d5d85ee5541c7d57e43 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Tue, 29 Sep 2020 08:22:38 -0500 Subject: [PATCH 45/89] clarify non-goal --- keps/sig-node/127-usernamespaces-support/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 8bc806b226b..fd2cf3e2cad 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -263,10 +263,10 @@ and make progress. [usernetes](https://github.com/rootless-containers/usernetes) and this KEP both make use of user namespaces, it is a different implementation for a different purpose. -- Mounting volumes in pods with a user ID mapping. Although the authors of this - KEP would like to have this feature in the future, this is out of scope of - this KEP. The complexity of this would probably require to write a separate - KEP. +- Supporting a shiftfs or similar solution once it's available in the kernel. + Although the authors of this KEP would like to support this feature once it's + available, this is out of scope of this KEP. The complexity of this would + probably require to write a separate KEP. ## Proposal From fc3eb24303d2af380c359edb4e94df8c63f4ab84 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Tue, 29 Sep 2020 08:27:41 -0500 Subject: [PATCH 46/89] give -> grant me --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index fd2cf3e2cad..6b071a2a01f 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -318,7 +318,7 @@ bogged down. As a cluster admin, I want run some pods with privileged capabilities because the applications in the pods require it (e.g. `CAP_SYS_ADMIN` to mount a FUSE filesystem or `CAP_NET_ADMIN` to setup a VPN) but I don't want this extra -capability to give any extra privilege on the host. +capability to grant me any extra privilege on the host. #### Story 2 From eebe2416080cf5a732927250a40b386c282630ee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Tue, 29 Sep 2020 08:28:25 -0500 Subject: [PATCH 47/89] These -> Such --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 6b071a2a01f..7d20c9429af 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -381,7 +381,7 @@ owned by a user different than the effective user of that process in a volume that doesn't support the semantics of `fsGroup` (doesn't support [`SetVolumeOwnership`](https://github.com/kubernetes/kubernetes/blob/00da04ba23d755d02a78d5021996489ace96aa4d/pkg/volume/volume_linux.go#L42) that updates permissions and ownership of the files to be accesible by the -`fsGroup` group ID). These pods should be run in `Host` mode. +`fsGroup` group ID). Such pods should be run in `Host` mode. #### Container Runtime Support From 994fbda1c87846c2c0dd9e367d507bcf9ecc23dd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Tue, 29 Sep 2020 08:58:10 -0500 Subject: [PATCH 48/89] improve SLI / SLO question --- keps/sig-node/127-usernamespaces-support/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 7d20c9429af..1a8da2f8a35 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -876,9 +876,9 @@ the existing API objects?** Yes. The PodSpec will be increased. TODO(Mauricio): * **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?** - Yes. The runtime has to set correct ownership for the container image - before starting it. - TODO(Mauricio): check what are those SLIs/SLOs and if this case actually applies. + Yes. The startup latency of both stateless and stateful pods is increased as + the rhe runtime has to set correct ownership for the container image before + starting them. * **Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?**: No. From ed58add071370ac8568d145ed35f0c13e6c0bd22 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Fri, 2 Oct 2020 08:46:49 -0500 Subject: [PATCH 49/89] reword goal --- keps/sig-node/127-usernamespaces-support/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 1a8da2f8a35..358c8a2ea94 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -246,8 +246,8 @@ know that this has succeeded? - Increase node to pod isolation in Kubernetes by mapping user and group IDs inside the container to different IDs in the host. In particular, mapping root inside the container to unprivileged user and group IDs in the node. -- Make it possible to run workloads that need highly privileged capabilities such as - `CAP_SYS_ADMIN` without impacting the host. +- Make it safer to run workloads that need highly privileged capabilities such as + `CAP_SYS_ADMIN`, reducing the risk of impacting the host. - Benefit from the security hardening that user namespaces are expected to provide against some of the future unknown runtime and kernel vulnerabilities. From 5eea25ea1b72daa928d9fef04e83b9207bc80212 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Fri, 2 Oct 2020 08:53:30 -0500 Subject: [PATCH 50/89] small changes - typos - some clarifications --- keps/sig-node/127-usernamespaces-support/README.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 358c8a2ea94..a6de5ede4fd 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -538,7 +538,7 @@ phase(s) but are not needed for phase 1, hence they are not discussed in detail: It's not clear yet what should be the process to make this happen as this is a potentially non backwards compatible change. It's specially relevant for workloads not compatible with user namespaces. A [host defaulting - mechanishm](#host-defaulting-mechanishm) can help to make this transiction + mechanishm](#host-defaulting-mechanishm) can help to make this transition smoother. - **Duplicated Snapshots of Container Images**: It's not clear when and how this support will land in the Linux Kernel. @@ -558,7 +558,7 @@ phase(s) but are not needed for phase 1, hence they are not discussed in detail: - **Security Considerations**: Once `Pod` is the default mode, it is needed to control who can use `Host` and `Cluster` modes. This can be done through Pod Security Policies if they are - available at that time. + available at thet time of implementing this phase. ## Design Details @@ -589,7 +589,6 @@ This section only focuses on phase 1 as specified above. const ( UserNamespaceModeHost PodUserNamespaceMode = "Host" UserNamespaceModeCluster PodUserNamespaceMode = "Cluster" - UserNamespaceModePod PodUserNamespaceMode = "Pod" ) type PodSpec struct { @@ -598,7 +597,6 @@ type PodSpec struct { // Three modes are supported: // "Host": The pod shares the host user namespace. (default value). // "Cluster": The pod uses a cluster-wide configured ID mappings. - // "Pod": The pod gets a non-overlapping ID mappings range. // +k8s:conversion-gen=false // +optional UserNamespaceMode PodUserNamespaceMode `json:"userNamespaceMode,omitempty" protobuf:"bytes,36,opt name=userNamespaceMode"` @@ -949,8 +947,8 @@ proposal there are some big differences: This proposal intends to have `Host` instead of `Pod` as default value for the user namespace mode. The rationale behind this decision is that it avoids breaking existing workloads that don't work with user namespaces. We are aware -that this decision has the drawback that pods that have the `userNamespaceMode` -set will not have the security advantages of user namespaces but we consider +that this decision has the drawback that pods that don't set the `userNamespaceMode` +will not have the security advantages of user namespaces, however we consider it's more important to keep compatibility with previous workloads. ### Host Defaulting Mechanishm @@ -962,7 +960,7 @@ features that could be not compatible with user namespaces (similar to [Default namespace via experimental flag](https://github.com/kubernetes/kubernetes/pull/31169)). This proposal doesn't require a similar mechanishm given that the default mode -is `Host` that works with all current existing workloads. +is `Host`. ## References From b6f5d7759378a0f4c4945b6cc82c1c114a9d1196 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Fri, 2 Oct 2020 09:09:37 -0500 Subject: [PATCH 51/89] swap CRI and PodSpec changes --- .../127-usernamespaces-support/README.md | 44 +++++++++---------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index a6de5ede4fd..7bcde5a396c 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -581,28 +581,6 @@ This section only focuses on phase 1 as specified above. namespaces support. - Update owner of ephemeral volumes populated by the kubelet. -### PodSpec Changes - -`v1.PodSpec` is extended with a new `UserNamesapceMode` field: - -``` -const ( - UserNamespaceModeHost PodUserNamespaceMode = "Host" - UserNamespaceModeCluster PodUserNamespaceMode = "Cluster" -) - -type PodSpec struct { -... - // UserNamespaceMode controls how user namespaces are used for this Pod. - // Three modes are supported: - // "Host": The pod shares the host user namespace. (default value). - // "Cluster": The pod uses a cluster-wide configured ID mappings. - // +k8s:conversion-gen=false - // +optional - UserNamespaceMode PodUserNamespaceMode `json:"userNamespaceMode,omitempty" protobuf:"bytes,36,opt name=userNamespaceMode"` -... -``` - ### CRI API Changes The CRI is extended to (optionally) specify the user namespace mode @@ -645,6 +623,28 @@ message NamespaceOption { } ``` +### PodSpec Changes + +`v1.PodSpec` is extended with a new `UserNamesapceMode` field: + +``` +const ( + UserNamespaceModeHost PodUserNamespaceMode = "Host" + UserNamespaceModeCluster PodUserNamespaceMode = "Cluster" +) + +type PodSpec struct { +... + // UserNamespaceMode controls how user namespaces are used for this Pod. + // Three modes are supported: + // "Host": The pod shares the host user namespace. (default value). + // "Cluster": The pod uses a cluster-wide configured ID mappings. + // +k8s:conversion-gen=false + // +optional + UserNamespaceMode PodUserNamespaceMode `json:"userNamespaceMode,omitempty" protobuf:"bytes,36,opt name=userNamespaceMode"` +... +``` + ### Test Plan -TBD -Mauricio: Should we require Pod mode to be implemented to switch to Beta? - -#### Alpha -> Beta - -- Future Complete: - - `Pod` mode implemented - -#### Beta -> GA +Will be added when targeting a release. ### Upgrade / Downgrade Strategy @@ -965,7 +957,7 @@ of this feature?** provider?** No. * **Will enabling / using this feature result in increasing size or count of -the existing API objects?** Yes. The PodSpec will be increased. TODO(Mauricio): what is the increased size? +the existing API objects?** Yes. The PodSpec will be increased. * **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?** From b820dd5da7ab752149fabffd33f00cc047373be1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Mon, 5 Oct 2020 08:10:10 -0500 Subject: [PATCH 61/89] remove drawbacks ideas They are only needed if the KEP is rejected --- keps/sig-node/127-usernamespaces-support/README.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 493057f1f64..4f12fd627ee 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -997,13 +997,6 @@ Major milestones might include: Why should this KEP _not_ be implemented? --> -TBD: -Some ideas -- another configuration knob is added -- user namespaces could make troubleshooting difficult -- volumes are really trickly to handle -- any performance issues? - ## Alternatives -Will be added when targeting a release. +#### Alpha + +- [ ] Support for `Cluster` and `Host` modes implemented. +- [ ] Support implemented in CRI-O. +- [ ] Support implemented in containerd. +- [ ] Unit test coverage. +- [ ] Support for `Pod` mode discused and implemented. + +#### Beta + +- [ ] Feedback from alpha is addressed. +- [ ] E2E test coverage. +- [ ] There are well-documented use cases of this feature. + +#### GA + +TDB ### Upgrade / Downgrade Strategy From ee4773065d479757d85e9928a3d7b1e492cb3c09 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Tue, 6 Oct 2020 13:16:04 -0500 Subject: [PATCH 73/89] update toc --- keps/sig-node/127-usernamespaces-support/README.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index b6fb6347914..a443166deed 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -99,13 +99,19 @@ tags, and then generate with `hack/update-toc.sh`. - [Design Details](#design-details) - [Summary of the Proposed Changes](#summary-of-the-proposed-changes) - [CRI API Changes](#cri-api-changes) - - [PodSpec Changes](#podspec-changes) + - [Add userNamespaceMode Field](#add-usernamespacemode-field) + - [Option 1: PodSpec](#option-1-podspec) + - [Option 2: PodSecurityContext](#option-2-podsecuritycontext) - [Configuring the Cluster ID Mappings](#configuring-the-cluster-id-mappings) + - [Option 1: Configure in Kubelet Configuration File](#option-1-configure-in-kubelet-configuration-file) + - [Option 2: Configure as a Cluster Parameter in kube-apiserver](#option-2-configure-as-a-cluster-parameter-in-kube-apiserver) + - [1-to-1 Mapping for fsGroup](#1-to-1-mapping-for-fsgroup) - [Updating Ownership of Ephemeral Volumes](#updating-ownership-of-ephemeral-volumes) - [Test Plan](#test-plan) - [Graduation Criteria](#graduation-criteria) - - [Alpha -> Beta](#alpha---beta) - - [Beta -> GA](#beta---ga) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) From 21c20be33f314a7fe128e88d8fc2324acb024904 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Tue, 6 Oct 2020 13:18:43 -0500 Subject: [PATCH 74/89] add unresolved tags --- keps/sig-node/127-usernamespaces-support/README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index a443166deed..44f199c002e 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -640,6 +640,8 @@ message NamespaceOption { The `userNamespaceMode` field can be added in two different places. This proposal presents the two possibilities to discuss with the community. +<<[UNRESOLVED where to put the userNamespaceMode field ]>> + #### Option 1: PodSpec Add it to `v1.PodSpec` following the rationale that other fields (`host{Network, @@ -684,12 +686,15 @@ type PodSecurityContext struct { UserNamespaceMode PodUserNamespaceMode `json:"userNamespaceMode,omitempty" protobuf:"bytes,11,opt name=userNamespaceMode"` ... ``` +<<[/UNRESOLVED]>> ### Configuring the Cluster ID Mappings This proposal considers two different ways to configure the ID mappings used for the `Cluster` mode. This is for discussion with the community and only one will be considered. +<<[UNRESOLVED where to configure the cluster wide ID mappings ]>> + #### Option 1: Configure in Kubelet Configuration File The ID mappings used for pods in `Cluster` mode are set in the kubelet @@ -738,6 +743,7 @@ This option considers setting this parameter on the kube-apiserver. - It's difficult to expose this parameter to the kubelet. - The parameter could not be available for the kubelet if the kube-apiserver is down. +<<[/UNRESOLVED]>> ### 1-to-1 Mapping for fsGroup From 2cc1ee38a82b95c01fa36e2ab9db9cbe6941d57e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Thu, 8 Oct 2020 09:24:33 -0500 Subject: [PATCH 75/89] clarify cluster mode uses different user namespaces --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 44f199c002e..ff8fcf207e1 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -201,7 +201,7 @@ place pods in different user namespaces increasing the pod-to-pod and pod-to-host isolation. This extra isolation increases the cluster security as it protects the host and other pods from malicious or compromised processes inside containers that are able to break into the host. This KEP proposes three -different modes: `Host` uses the host user namespace like the current behaviour, `Cluster` uses the same +different modes: `Host` uses the host user namespace like the current behaviour, `Cluster` uses a unique user namespace per pod but the same ID mapping for all the pods (very similar to the previous [Support Node-Level User Namespaces Remapping](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-usernamespace-remapping.md) From 2a4a6485c811a178115dff00b0984a2b55614522 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Thu, 8 Oct 2020 09:26:41 -0500 Subject: [PATCH 76/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: rata --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index ff8fcf207e1..64030880500 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -296,7 +296,7 @@ specification with a new `userNamespaceMode` field. This field can have 3 values set. - **Cluster**: - All the pods in the cluster are placed in a different user namespace but they + All the pods in the cluster are placed in a _unique_ user namespace but they use the same ID mappings. This mode doesn't provide full pod-to-pod isolation as all the pods with `Cluster` mode have the same effective IDs on the host. It provides pod-to-host isolation as the IDs are different inside the From b34fe36169ce6c57e25136ed42c5b49e3dd32e56 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Thu, 8 Oct 2020 09:26:53 -0500 Subject: [PATCH 77/89] Update keps/sig-node/127-usernamespaces-support/README.md Co-authored-by: rata --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 64030880500..34e5adb97d0 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -309,7 +309,7 @@ specification with a new `userNamespaceMode` field. This field can have 3 values non-overlapping ID mapping. This mode is intended for stateless pods, i.e. pods using only ephemeral volumes like `configMap,` `downwardAPI`, `secret`, `projected` and `emptyDir`. This mode not only provides host-to-pod isolation - but also pod-to-pod isolation as each pod has a different range of effective + but also pod-to-pod isolation as each pod has a different range of effective IDs in the host. ### User Stories From 8a42548e7cf977bd4c23397dec1f5c98516481ac Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Thu, 8 Oct 2020 09:29:16 -0500 Subject: [PATCH 78/89] change verb --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 34e5adb97d0..c4947349db4 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -783,7 +783,7 @@ Ephemeral volumes use [`AtomicWriter`](https://github.com/kinvolk/kubernetes/blob/master/pkg/volume/util/atomic_writer.go) to create the files that are mounted to the containers. This component [has some logic](https://github.com/kinvolk/kubernetes/blob/c94242a7b1d238cc27aea9b6d45ccb9963e814bb/pkg/volume/util/atomic_writer.go#L403) -to update the ownership of those files in some cases. It can be extended to take +to update the ownership of those files in some cases. It will be extended to take the ID mappings into consideration when the pod runs in `Cluster` mode. ### Test Plan From 2392a05f9e67ac4532d2d89762885ebcafe42476 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Thu, 8 Oct 2020 09:29:37 -0500 Subject: [PATCH 79/89] clarify version skew --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index c4947349db4..26b8a565f2e 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -919,7 +919,7 @@ without user namespaces support. The container will be placed in the host user namespace. It's a responsibility of the user to guarantee that a runtime supporting user namespaces is used when this feature is enabled. -An old version of kubelet without user namespaces support can cause some +An old version of kubelet (without user namespaces support) used with a new container runtime (with user namespaces support) can cause some issues too. In this case the runtime can wrongly infer that the `user` field is set to `POD` in the `NamespaceOption` message. To avoid this problem the runtime should check if the `mappings` field contains any mappings, an error From fd1a6da107ef54a4ed28bac47f5012f192482dd0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Thu, 8 Oct 2020 09:43:49 -0500 Subject: [PATCH 80/89] make it clear pods have to be recreated --- keps/sig-node/127-usernamespaces-support/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 26b8a565f2e..32de2917a75 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -963,8 +963,8 @@ you need any help or guidance. * **Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?** - Yes, by disabling the `UserNamespacesSupport` feature gate. - The effective user and group IDs of the process in the host would be different + Yes, the `UserNamespacesSupport` feature gate has to be disabled and pods running in `Cluster` and `Pod` mode have to be recreated. + The effective user and group IDs of the processes would be different before and after disabling the feature for pods running in `Cluster` and `Pod` modes. This can cause access issues to pods accessing files saved in volumes. From c368bb923e2bda4158833b6359189106908dc0fd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Thu, 8 Oct 2020 09:59:30 -0500 Subject: [PATCH 81/89] rewrap text --- .../127-usernamespaces-support/README.md | 180 +++++++++--------- 1 file changed, 95 insertions(+), 85 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 32de2917a75..29388981ad0 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -192,8 +192,8 @@ updates. Container security consists of many different kernel features that work together to make containers secure. User namespaces isolate user and group IDs by allowing processes to run with different IDs in the container and in the host. -Specially, a process running as privileged in a container can be -unprivileged in the host. This makes it possible to give more capabilities to the containers and +Specially, a process running as privileged in a container can be unprivileged in +the host. This makes it possible to give more capabilities to the containers and protects the host and other containers from malicious or compromised containers. This KEP adds a new `userNamespaceMode` field to `pod.Spec`. It allows users to @@ -201,9 +201,9 @@ place pods in different user namespaces increasing the pod-to-pod and pod-to-host isolation. This extra isolation increases the cluster security as it protects the host and other pods from malicious or compromised processes inside containers that are able to break into the host. This KEP proposes three -different modes: `Host` uses the host user namespace like the current behaviour, `Cluster` uses a unique user namespace per pod but the same -ID mapping for all the pods (very similar to the previous [Support Node-Level User -Namespaces +different modes: `Host` uses the host user namespace like the current behaviour, +`Cluster` uses a unique user namespace per pod but the same ID mapping for all +the pods (very similar to the previous [Support Node-Level User Namespaces Remapping](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-usernamespace-remapping.md) proposal) and `Pod` increases pod-to-pod isolation by giving each pod a different and non-overlapping ID mapping. @@ -219,24 +219,25 @@ demonstrate the interest in a KEP within the wider Kubernetes community. [experience reports]: https://github.com/golang/go/wiki/ExperienceReports --> -From [user_namespaces(7)](https://man7.org/linux/man-pages/man7/user_namespaces.7.html): +From +[user_namespaces(7)](https://man7.org/linux/man-pages/man7/user_namespaces.7.html): > User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs, the root directory, keys, and capabilities. A process's user and group IDs can be different inside and outside a user namespace. In particular, a process can have a normal unprivileged user ID -outside a user namespace while at the same time having a user ID of 0 inside -the namespace; in other words, the process has full privileges for operations -inside the user namespace, but is unprivileged for operations outside the -namespace. +outside a user namespace while at the same time having a user ID of 0 inside the +namespace; in other words, the process has full privileges for operations inside +the user namespace, but is unprivileged for operations outside the namespace. The goal of supporting user namespaces in Kubernetes is to be able to run processes in pods with a different user and group IDs than in the host. -Specifically, a privileged process in the pod runs as an unprivileged process in the -host. If such a process is able to break out of the container to the host, it'll have limited -impact as it'll be running as an unprivileged user there. +Specifically, a privileged process in the pod runs as an unprivileged process in +the host. If such a process is able to break out of the container to the host, +it'll have limited impact as it'll be running as an unprivileged user there. -The following security vulnerabilities were mitigated with -user namespaces and it is expected that using them would mitigate against some of the future vulnerabilities. +The following security vulnerabilities were mitigated with user namespaces and +it is expected that using them would mitigate against some of the future +vulnerabilities. - CVE-2016-8867: Privilege escalation inside containers - https://github.com/moby/moby/issues/27590 - CVE-2018-15664: TOCTOU race attack that allows to read/write files in the host @@ -254,10 +255,10 @@ know that this has succeeded? - Increase node to pod isolation in Kubernetes by mapping user and group IDs inside the container to different IDs in the host. In particular, mapping root inside the container to unprivileged user and group IDs in the node. -- Make it safer to run workloads that need highly privileged capabilities such as - `CAP_SYS_ADMIN`, reducing the risk of impacting the host. -- Benefit from the security hardening that user namespaces - provide against some of the future unknown runtime and kernel vulnerabilities. +- Make it safer to run workloads that need highly privileged capabilities such + as `CAP_SYS_ADMIN`, reducing the risk of impacting the host. +- Benefit from the security hardening that user namespaces provide against some + of the future unknown runtime and kernel vulnerabilities. ### Non-Goals @@ -287,7 +288,8 @@ nitty-gritty. --> This proposal aims to support user namespaces in Kubernetes by extending the pod -specification with a new `userNamespaceMode` field. This field can have 3 values: +specification with a new `userNamespaceMode` field. This field can have 3 +values: - **Host**: The pods are placed in the host user namespace, this is the current Kubernetes @@ -331,7 +333,8 @@ capability to grant me any extra privilege on the host. #### Story 2 As a cluster admin, I want to allow some pods to run in the host user namespace -if they need a feature only available in such user namespace, such as loading a kernel module with `CAP_SYS_MODULE`. +if they need a feature only available in such user namespace, such as loading a +kernel module with `CAP_SYS_MODULE`. ### Notes/Constraints/Caveats @@ -346,20 +349,22 @@ This might be a good place to talk about core concepts and how they relate. The Linux kernel uses the effective user and group IDs (the ones the host) to check the file access permissions. Since with user namespaces IDs are mapped to -a different value on the host, this causes issues accessing volumes if the -pod is run with a different mapping, i.e. the effective user and group IDs on -the host change. +a different value on the host, this causes issues accessing volumes if the pod +is run with a different mapping, i.e. the effective user and group IDs on the +host change. -This proposal supports volumes without changing the user and group IDs and leaves -that problem to the user to manage. Future Linux kernel features such as shiftfs -could allow different pods to see a volume with its own IDs but it is out of -scope of this proposal. Among the possible future kernel solutions, we can list: +This proposal supports volumes without changing the user and group IDs and +leaves that problem to the user to manage. Future Linux kernel features such as +shiftfs could allow different pods to see a volume with its own IDs but it is +out of scope of this proposal. Among the possible future kernel solutions, we +can list: - [shiftfs: uid/gid shifting filesystem](https://lwn.net/Articles/757650/) - [A new API for mounting filesystems](https://lwn.net/Articles/753473/) - [user_namespace: introduce fsid mappings](https://lwn.net/Articles/812221/) -In regard to this proposal, volumes can be divided in ephemeral and non-ephemeral. +In regard to this proposal, volumes can be divided in ephemeral and +non-ephemeral. Ephemeral volumes are associated to a **single** pod and their lifecyle is dependent on that pod. These are `configMap`, `secret`, `emptyDir`, @@ -372,21 +377,21 @@ the file ownership too. Non-ephemeral volumes are more difficult to support since they can be persistent and shared by multiple pods. This proposal supports volumes with two different strategies: -- The `Cluster` mode makes it easier for pods to share files using volumes when those - don't have access permissions for `others` because the effective user and - group IDs on the host are the same for all the pods. +- The `Cluster` mode makes it easier for pods to share files using volumes when + those don't have access permissions for `others` because the effective user + and group IDs on the host are the same for all the pods. - The semantics of semantics of `fsGroup` are respected, if it's specified it's assumed to be the correct GID in the host and a 1-to-1 mapping entry for the `fsGroup` is added to the GID mappings for the pod. This KEP doesn't impose any restriction on the different volumes and `userNamespaceMode` combinations and leaves it to users to chose the correct -combinations based on their specific needs. -For instance, if a pod access a shared volume containing files and folders with -permissions for `others`, it can run in `Pod` mode. On the other hand, a process -inside a pod will not be able to access files with mode `0700` and -owned by a user different than the effective user of that process in a volume -that doesn't support the semantics of `fsGroup` (doesn't support +combinations based on their specific needs. For instance, if a pod access a +shared volume containing files and folders with permissions for `others`, it can +run in `Pod` mode. On the other hand, a process inside a pod will not be able to +access files with mode `0700` and owned by a user different than the effective +user of that process in a volume that doesn't support the semantics of `fsGroup` +(doesn't support [`SetVolumeOwnership`](https://github.com/kubernetes/kubernetes/blob/00da04ba23d755d02a78d5021996489ace96aa4d/pkg/volume/volume_linux.go#L42) that updates permissions and ownership of the files to be accesible by the `fsGroup` group ID). Such pods should be run in `Host` mode. @@ -407,7 +412,7 @@ that updates permissions and ownership of the files to be accesible by the [this](https://github.com/kinvolk/containerd-cri/commits/mauricio/userns_poc) PoC. - **cri-o**: -CRI-O recently [added](https://github.com/cri-o/cri-o/pull/3944) support for + CRI-O recently [added](https://github.com/cri-o/cri-o/pull/3944) support for user namespaces through a pod annotation. The extensions to make it work with the CRI changes proposed here are small. - gVisor, katacontainers: It's still to investigate. @@ -435,8 +440,8 @@ Some features that don't work when the host user namespace is not shared are: - **Some Capabilities**: The Linux kernel takes into consideration the user namespace a process is running in while performing the capabilities check. There are some - capabilities that are only available in the initial (host) user namespace such as - `CAP_SYS_TIME`, `CAP_SYS_MODULE` & `CAP_MKNOD`. + capabilities that are only available in the initial (host) user namespace such + as `CAP_SYS_TIME`, `CAP_SYS_MODULE` & `CAP_MKNOD`. If a pod is given one of those capabilities it will still be deployed, but the capability will be ineffective and processes using those capabilities will @@ -456,8 +461,8 @@ Some features that don't work when the host user namespace is not shared are: - Mounting `mqueue` (`/dev/mqueue`) is not allowed from a process in a user namespace that does not own the IPC namespace. Pods with `hostIPC=true` and `userNamespaceMode=Pod|Cluster` can fail. - - Mounting `procfs` (`/proc`) is not allowed from a process in a user namespace - that does not own the PID namespace. Pods with `hostPID=true` and + - Mounting `procfs` (`/proc`) is not allowed from a process in a user + namespace that does not own the PID namespace. Pods with `hostPID=true` and `userNamespaceMode=Pod|Cluster` can fail. - Mounting `sysfs` (`/sys`) is not allowed from a process in a user namespace that does not own the network namespace. Impact: pods with @@ -465,11 +470,12 @@ Some features that don't work when the host user namespace is not shared are: If users specify `userNamespaceMode=Pod|Cluster` and one of these `host{IPC,PID,Network}=true` options, runc will currently fail to start the - container. The kubelet does **not** try to prevent that combination of options, - in case runc or the kernel make it possible in the future to use that + container. The kubelet does **not** try to prevent that combination of + options, in case runc or the kernel make it possible in the future to use that combination. -In order to avoid breaking existing workloads `Host` is the default value of `userNamespaceMode`. +In order to avoid breaking existing workloads `Host` is the default value of +`userNamespaceMode`. #### Duplicated Snapshots of Container Images @@ -509,11 +515,11 @@ on a specific container image. ## Implementation Phases -The implementation of this KEP in a single phase is complicated as there are many -discussions to be done. We learned from previous attempts to bring this support in -that it should be done in small steps to avoid losing the focus on the -discussion. It's also true that a full plan should be agreed at the beginning to -avoid changing the implementation drastically in further phases. +The implementation of this KEP in a single phase is complicated as there are +many discussions to be done. We learned from previous attempts to bring this +support in that it should be done in small steps to avoid losing the focus on +the discussion. It's also true that a full plan should be agreed at the +beginning to avoid changing the implementation drastically in further phases. This proposal implementation aims to be divided in the following phases: @@ -532,8 +538,8 @@ this feature. ### Future Phases -These phases aim to implement the `Pod` mode. After these phases are completed the -full advantanges of user namespaces can be used in some cases (stateless +These phases aim to implement the `Pod` mode. After these phases are completed +the full advantanges of user namespaces can be used in some cases (stateless workloads). There are some things that have to be studied with more detail for these @@ -554,7 +560,9 @@ phase(s) but are not needed for phase 1, hence they are not discussed in detail: algorithm has to be changed as image snapshots shold be deleted as soon as the container finihes. - **ID Mappings Allocation Algorithm** - The `Pod` mode requires to have each pod in different and non-overlapping ID mapping. It requires to implement an algorithm that performs that allocation. There are some open questions about it: + The `Pod` mode requires to have each pod in different and non-overlapping ID + mapping. It requires to implement an algorithm that performs that allocation. + There are some open questions about it: - What should be the length of the mapping assigned to each pod? - How to get the ID mapping range of a running pod when kubelet crashes? - Can the user specify the ID mappings for a pod? @@ -580,12 +588,12 @@ required) or even code snippets. If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them. --> - This section only focuses on phase 1 as specified above. ### Summary of the Proposed Changes -- Extend the CRI to have a user namespace mode and the user and group ID mappings. +- Extend the CRI to have a user namespace mode and the user and group ID + mappings. - Add a `userNamespaceMode` field to the pod spec. - Add the cluster-wide ID mappings to the kubelet configuration file. - Add a `UserNamespacesSupport` feature flag to enable / disable the user @@ -595,13 +603,12 @@ This section only focuses on phase 1 as specified above. ### CRI API Changes -The CRI is extended to (optionally) specify the user namespace mode -and the ID mappings for a pod. +The CRI is extended to (optionally) specify the user namespace mode and the ID +mappings for a pod. [`NamespaceOption`](https://github.com/kubernetes/cri-api/blob/1eae59a7c4dee45a900f54ea2502edff7e57fd68/pkg/apis/runtime/v1alpha2/api.proto#L228) is extended with two new fields: - A `user` `NamespaceMode` that defines if the pod should run in an independent - user namespace (`POD`) or if it should share the host user namespace - (`NODE`). + user namespace (`POD`) or if it should share the host user namespace (`NODE`). - The ID mappings to be used if the user namespace mode is `POD`. ``` @@ -690,8 +697,9 @@ type PodSecurityContext struct { ### Configuring the Cluster ID Mappings -This proposal considers two different ways to configure the ID mappings used for the `Cluster` mode. -This is for discussion with the community and only one will be considered. +This proposal considers two different ways to configure the ID mappings used for +the `Cluster` mode. This is for discussion with the community and only one will +be considered. <<[UNRESOLVED where to configure the cluster wide ID mappings ]>> @@ -741,7 +749,8 @@ This option considers setting this parameter on the kube-apiserver. **Cons**: - It's difficult to expose this parameter to the kubelet. - - The parameter could not be available for the kubelet if the kube-apiserver is down. + - The parameter could not be available for the kubelet if the kube-apiserver is + down. <<[/UNRESOLVED]>> @@ -783,8 +792,8 @@ Ephemeral volumes use [`AtomicWriter`](https://github.com/kinvolk/kubernetes/blob/master/pkg/volume/util/atomic_writer.go) to create the files that are mounted to the containers. This component [has some logic](https://github.com/kinvolk/kubernetes/blob/c94242a7b1d238cc27aea9b6d45ccb9963e814bb/pkg/volume/util/atomic_writer.go#L403) -to update the ownership of those files in some cases. It will be extended to take -the ID mappings into consideration when the pod runs in `Cluster` mode. +to update the ownership of those files in some cases. It will be extended to +take the ID mappings into consideration when the pod runs in `Cluster` mode. ### Test Plan @@ -912,18 +921,19 @@ enhancement: CRI or CNI may require updating that component before the kubelet. --> -The container runtime will have to be updated in the nodes to support this feature. +The container runtime will have to be updated in the nodes to support this +feature. The new `user` field in the `NamespaceOption` will be ignored by an old runtime without user namespaces support. The container will be placed in the host user namespace. It's a responsibility of the user to guarantee that a runtime supporting user namespaces is used when this feature is enabled. -An old version of kubelet (without user namespaces support) used with a new container runtime (with user namespaces support) can cause some -issues too. In this case the runtime can wrongly infer that the `user` field -is set to `POD` in the `NamespaceOption` message. To avoid this problem the -runtime should check if the `mappings` field contains any mappings, an error -should be raised otherwise. +An old version of kubelet (without user namespaces support) used with a new +container runtime (with user namespaces support) can cause some issues too. In +this case the runtime can wrongly infer that the `user` field is set to `POD` in +the `NamespaceOption` message. To avoid this problem the runtime should check if +the `mappings` field contains any mappings, an error should be raised otherwise. ## Production Readiness Review Questionnaire @@ -963,11 +973,11 @@ you need any help or guidance. * **Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?** - Yes, the `UserNamespacesSupport` feature gate has to be disabled and pods running in `Cluster` and `Pod` mode have to be recreated. - The effective user and group IDs of the processes would be different - before and after disabling the feature for pods running in `Cluster` and `Pod` - modes. This can cause access issues to pods accessing files saved in - volumes. + Yes, the `UserNamespacesSupport` feature gate has to be disabled and pods + running in `Cluster` and `Pod` mode have to be recreated. The effective user + and group IDs of the processes would be different before and after disabling + the feature for pods running in `Cluster` and `Pod` modes. This can cause + access issues to pods accessing files saved in volumes. * **What happens if we reenable the feature if it was previously rolled back?** The situation is very similar to the described above. The pod will be able to @@ -1089,20 +1099,20 @@ proposal there are some big differences: This proposal intends to have `Host` instead of `Pod` as default value for the user namespace mode. The rationale behind this decision is that it avoids breaking existing workloads that don't work with user namespaces. We are aware -that this decision has the drawback that pods that don't set the `userNamespaceMode` -will not have the security advantages of user namespaces, however we consider -it's more important to keep compatibility with previous workloads. +that this decision has the drawback that pods that don't set the +`userNamespaceMode` will not have the security advantages of user namespaces, +however we consider it's more important to keep compatibility with previous +workloads. ### Host Defaulting Mechanishm Previous proposals like [Node-Level UserNamespace implementation](https://github.com/kubernetes/kubernetes/pull/64005) had a -mechanism to default to the host user namespace when the pod specification includes -features that could be not compatible with user namespaces (similar to [Default host user -namespace via experimental -flag](https://github.com/kubernetes/kubernetes/pull/31169)). -This proposal doesn't require a similar mechanishm given that the default mode -is `Host`. +mechanism to default to the host user namespace when the pod specification +includes features that could be not compatible with user namespaces (similar to +[Default host user namespace via experimental +flag](https://github.com/kubernetes/kubernetes/pull/31169)). This proposal +doesn't require a similar mechanishm given that the default mode is `Host`. ## References From 957489795ecd8bd69c084e217779a657d5af8b8d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Thu, 8 Oct 2020 14:46:57 -0500 Subject: [PATCH 82/89] extend metacopy=on with specific crio info --- .../127-usernamespaces-support/README.md | 24 ++++++++++--------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 29388981ad0..9ef563efa52 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -488,17 +488,19 @@ The current implementation in container runtimes is to recursively perform a as it potentially increases the time and the storage needed to handle the container images. -Some runtimes, like cri-o, mitigate these problems by using the `metacopy` -option of overlayfs. This option avoids copying the whole file content when an -operation updating only the metadata, like `chown` or `chmod`, is performed. -This solution could be adopted by other runtimes until a more sophisticated -approach is implemented in the kernel. [We -talked](https://lists.linuxfoundation.org/pipermail/containers/2020-September/042230.html) -to the Linux kernel community and [they -replied](https://lists.linuxfoundation.org/pipermail/containers/2020-September/042230.html) -they are working on a solution for it. If the Linux kernel provides a solution -for this problem, that would be something that container runtimes should use. It -does not impact the kubelet nor the CRI gRPC spec. +[containers/storage](https://github.com/containers/storage/) used by CRI-O +mounts an overlay file system with the +[`metacopy=on`](https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html#metadata-only-copy-up) +flag set, it then chowns all of the lower files in the image to match the user +namespace to which the container will run. This operation is very quick compared +to standard chowning, since none of the files data has to be copied up. If a +second container runs on the same image with the same user namespace, then the +chowned image is shared, eliminating the need to chown again. + +More sophisticated approaches to this problem are being +[discussed](https://lists.linuxfoundation.org/pipermail/containers/2020-September/042230.html) +in the kernel community. This is something that container runtimes should use +once it's available and it does not impact the kubelet nor the CRI gRPC spec. #### Container Images with High IDs From d32d36516c8daf400c5d143a283382e9272dfa98 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Fri, 16 Oct 2020 07:30:12 -0500 Subject: [PATCH 83/89] to -> for --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 9ef563efa52..b7cbd5e5854 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -354,7 +354,7 @@ is run with a different mapping, i.e. the effective user and group IDs on the host change. This proposal supports volumes without changing the user and group IDs and -leaves that problem to the user to manage. Future Linux kernel features such as +leaves that problem for the user to manage. Future Linux kernel features such as shiftfs could allow different pods to see a volume with its own IDs but it is out of scope of this proposal. Among the possible future kernel solutions, we can list: From 94471d5f2317464133fda9143b4ba86aa531133d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Fri, 16 Oct 2020 07:33:35 -0500 Subject: [PATCH 84/89] reword around shitfts --- keps/sig-node/127-usernamespaces-support/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index b7cbd5e5854..3677f9b6956 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -354,9 +354,9 @@ is run with a different mapping, i.e. the effective user and group IDs on the host change. This proposal supports volumes without changing the user and group IDs and -leaves that problem for the user to manage. Future Linux kernel features such as +leaves that problem for the user to manage. Developing Linux kernel features such as shiftfs could allow different pods to see a volume with its own IDs but it is -out of scope of this proposal. Among the possible future kernel solutions, we +not available yet. Among the possible future kernel solutions, we can list: - [shiftfs: uid/gid shifting filesystem](https://lwn.net/Articles/757650/) From 066ab176cd3ad45ffea716bf504bff12a94ae8c4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Fri, 16 Oct 2020 07:35:05 -0500 Subject: [PATCH 85/89] straigtforward -> straightforward --- keps/sig-node/127-usernamespaces-support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 3677f9b6956..b4e4f5cd180 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -407,7 +407,7 @@ that updates permissions and ownership of the files to be accesible by the deprecated]((https://github.com/kubernetes/enhancements/pull/1985/)) and it offers a very limited support for user namespaces. - **containerd**: - It's quite straigtforward to implement the CRI changes proposed below in + It's quite straightforward to implement the CRI changes proposed below in containerd/cri, we did it in [this](https://github.com/kinvolk/containerd-cri/commits/mauricio/userns_poc) PoC. From 7dfec3326288197acfe142e71744c36f80c0a359 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Fri, 16 Oct 2020 07:36:47 -0500 Subject: [PATCH 86/89] fix some small problems around --- keps/sig-node/127-usernamespaces-support/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index b4e4f5cd180..dec7936435a 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -415,7 +415,7 @@ that updates permissions and ownership of the files to be accesible by the CRI-O recently [added](https://github.com/cri-o/cri-o/pull/3944) support for user namespaces through a pod annotation. The extensions to make it work with the CRI changes proposed here are small. -- gVisor, katacontainers: It's still to investigate. +- gVisor, katacontainers: Yet to be investigated. containerd and cri-o will provide support for the 3 possible values of `userNamespaceMode`. @@ -558,7 +558,7 @@ phase(s) but are not needed for phase 1, hence they are not discussed in detail: snapshots](#duplicated-snapshots-of-container-images) issue as it's possible that a pod uses a unique ID mapping each time it's scheduled. The different runtimes will have to use solutions like `metacopy` option of overlayfs or new - kernel features to overcome it. It's also likely that the garbage collection + kernel features to overcome it. It's also likely that the kubelet image garbage collection algorithm has to be changed as image snapshots shold be deleted as soon as the container finihes. - **ID Mappings Allocation Algorithm** @@ -1036,7 +1036,7 @@ the existing API objects?** Yes. The PodSpec will be increased. * **Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?** Yes. The startup latency of both stateless and stateful pods is increased as - the rhe runtime has to set correct ownership for the container image before + the runtime has to set correct ownership for the container image before starting them. * **Will enabling / using this feature result in non-negligible increase of From d282c793b294f6aeb3f9e776845635fba46c0ceb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Fri, 16 Oct 2020 07:51:28 -0500 Subject: [PATCH 87/89] remove shiftfs non-goal --- keps/sig-node/127-usernamespaces-support/README.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index dec7936435a..cadbc52d8c3 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -272,10 +272,6 @@ and make progress. [usernetes](https://github.com/rootless-containers/usernetes) and this KEP both make use of user namespaces, it is a different implementation for a different purpose. -- Supporting shiftfs or a similar solution once it's available in the kernel. - Although the authors of this KEP would like to support this feature once it's - available, this is out of scope of this KEP. The complexity of this would - probably require to write a separate KEP. ## Proposal From 2cf4d8badd320887ca021d1db18afffb6fdcd4f6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Fri, 16 Oct 2020 08:07:00 -0500 Subject: [PATCH 88/89] remove block comments --- .../127-usernamespaces-support/README.md | 306 +----------------- 1 file changed, 5 insertions(+), 301 deletions(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index cadbc52d8c3..2019eb07205 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -1,81 +1,5 @@ - # KEP-127: Support User Namespaces - - - - - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) @@ -132,20 +56,6 @@ tags, and then generate with `hack/update-toc.sh`. ## Release Signoff Checklist - - Items marked with (R) are required *prior to targeting to a milestone / release*. - [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) @@ -159,10 +69,6 @@ Items marked with (R) are required *prior to targeting to a milestone / release* - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes - - [kubernetes.io]: https://kubernetes.io/ [kubernetes/enhancements]: https://git.k8s.io/enhancements [kubernetes/kubernetes]: https://git.k8s.io/kubernetes @@ -170,25 +76,6 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary - - Container security consists of many different kernel features that work together to make containers secure. User namespaces isolate user and group IDs by allowing processes to run with different IDs in the container and in the host. @@ -210,15 +97,6 @@ different and non-overlapping ID mapping. ## Motivation - - From [user_namespaces(7)](https://man7.org/linux/man-pages/man7/user_namespaces.7.html): > User namespaces isolate security-related identifiers and attributes, in @@ -247,11 +125,6 @@ vulnerabilities. ### Goals - - - Increase node to pod isolation in Kubernetes by mapping user and group IDs inside the container to different IDs in the host. In particular, mapping root inside the container to unprivileged user and group IDs in the node. @@ -262,11 +135,6 @@ know that this has succeeded? ### Non-Goals - - - Provide a way to run the kubelet process or container runtimes as an unprivileged process. Although initiatives like [usernetes](https://github.com/rootless-containers/usernetes) and this KEP @@ -275,14 +143,6 @@ and make progress. ## Proposal - - This proposal aims to support user namespaces in Kubernetes by extending the pod specification with a new `userNamespaceMode` field. This field can have 3 values: @@ -312,13 +172,6 @@ values: ### User Stories - - #### Story 1 As a cluster admin, I want run some pods with privileged capabilities because @@ -334,13 +187,6 @@ kernel module with `CAP_SYS_MODULE`. ### Notes/Constraints/Caveats - - #### Volumes Support The Linux kernel uses the effective user and group IDs (the ones the host) to @@ -350,10 +196,10 @@ is run with a different mapping, i.e. the effective user and group IDs on the host change. This proposal supports volumes without changing the user and group IDs and -leaves that problem for the user to manage. Developing Linux kernel features such as -shiftfs could allow different pods to see a volume with its own IDs but it is -not available yet. Among the possible future kernel solutions, we -can list: +leaves that problem for the user to manage. Developing Linux kernel features +such as shiftfs could allow different pods to see a volume with its own IDs but +it is not available yet. Among the possible future kernel solutions, we can +list: - [shiftfs: uid/gid shifting filesystem](https://lwn.net/Articles/757650/) - [A new API for mounting filesystems](https://lwn.net/Articles/753473/) @@ -417,18 +263,6 @@ containerd and cri-o will provide support for the 3 possible values of `userName ### Risks and Mitigations - - #### Breaking Existing Workloads Some features that don't work when the host user namespace is not shared are: @@ -579,13 +413,6 @@ phase(s) but are not needed for phase 1, hence they are not discussed in detail: ## Design Details - - This section only focuses on phase 1 as specified above. ### Summary of the Proposed Changes @@ -817,61 +644,6 @@ TBD ### Graduation Criteria - - #### Alpha - [ ] Support for `Cluster` and `Host` modes implemented. @@ -892,33 +664,8 @@ TDB ### Upgrade / Downgrade Strategy - - ### Version Skew Strategy - - The container runtime will have to be updated in the nodes to support this feature. @@ -935,30 +682,6 @@ the `mappings` field contains any mappings, an error should be raised otherwise. ## Production Readiness Review Questionnaire - - ### Feature Enablement and Rollback * **How can this feature be enabled / disabled in a live cluster?** @@ -1050,32 +773,13 @@ Will be added before transition to beta. ## Implementation History - ## Drawbacks - - ## Alternatives - - ### Differences with Previous Proposal + Even if this KEP is heavily based on the previous [Support Node-Level User Namespaces Remapping](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-usernamespace-remapping.md) From 83c5674c05b88e662e73f1078a582c00aacc5410 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mauricio=20V=C3=A1squez?= Date: Fri, 16 Oct 2020 08:09:27 -0500 Subject: [PATCH 89/89] update creation dates --- keps/sig-node/127-usernamespaces-support/README.md | 1 + keps/sig-node/127-usernamespaces-support/kep.yaml | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/127-usernamespaces-support/README.md b/keps/sig-node/127-usernamespaces-support/README.md index 2019eb07205..6010d42d149 100644 --- a/keps/sig-node/127-usernamespaces-support/README.md +++ b/keps/sig-node/127-usernamespaces-support/README.md @@ -773,6 +773,7 @@ Will be added before transition to beta. ## Implementation History +- 2020-10-16: Initial proposal submitted. ## Drawbacks diff --git a/keps/sig-node/127-usernamespaces-support/kep.yaml b/keps/sig-node/127-usernamespaces-support/kep.yaml index 111dc604659..4290f9dc70a 100644 --- a/keps/sig-node/127-usernamespaces-support/kep.yaml +++ b/keps/sig-node/127-usernamespaces-support/kep.yaml @@ -7,7 +7,7 @@ authors: owning-sig: sig-node participating-sigs: [] status: provisional -creation-date: 2020-07-21 +creation-date: 2020-10-16 reviewers: - "@mrunalp" approvers: