nsenter: correctly handle pidns orphaning #976

cyphar · 2016-08-08T10:52:04Z

The main change here is to make sure that new processes that join a container are correctly reparented. Read this LWN article for more information about what the issue is and how the solution needs to be structured.

TODO:

Double-fork inside stage 2 to get reparented to the container's init (if we're an exec process).
- This is broken because we currently send the PID of the wrong process in stage 1.
runc exec doesn't exit after the subprocess exits because of this feature. We'll need to think up a new way of handling this...

Fixes #971

Signed-off-by: Aleksa Sarai asarai@suse.de

Note: This is based on ~~#950~~, #977 and #975.

This avoids us from running into cases where libcontainer thinks that a particular namespace file is a different type, and makes it a fatal error rather than causing broken functionality. Signed-off-by: Aleksa Sarai <asarai@suse.de>

Depending on your SELinux setup, the order in which you join namespaces can be important. In general, user namespaces should *always* be joined and unshared first because then the other namespaces are correctly pinned and you have the right priviliges within them. This also is very useful for rootless containers, as well as older kernels that had essentially broken unshare(2) and clone(2) implementations. This also includes huge refactorings in how we spawn processes for complicated reasons that I don't want to get into because it will make me spiral into a cloud of rage. The reasoning is in the giant comment in clone_parent. Have fun. In addition, because we now create multiple children with CLONE_PARENT, we cannot wait for them to SIGCHLD us in the case of a death. Thus, we have to resort to having a child kindly send us their exit code before they die. Hopefully this all works okay, but at this point there's not much more than we can do. Signed-off-by: Aleksa Sarai <asarai@suse.de>

In general, it is a bad idea to be unmapped inside a user namespace at any point (especially when euid=[kuid 0]) as it can lead to security vulnerabilities. Also, in certain SELinux setups you must also be mapped in your user namespace when unsharing other namespaces. Deal with all of this by parsing the {uid,gid}maps and then setresuid(2) to the right user before and after the critical unshare(CLONE_NEWUSER) (as well as dealing with setns(2) by changing user to the owner of the namespace file we're joining). Fixes: CVE-2015-8709 Reported-by: Andrey Vagin <avagin@virtuozzo.com> Reported-by: Mrunal Patel <mpatel@redhat.com> Signed-off-by: Aleksa Sarai <asarai@suse.de>

cyphar · 2016-10-04T13:55:26Z

@opencontainers/runc-maintainers This sort of works. However, because of the new feature (that exec processes are no longer children of the runC process) there's a lot of issues with waitProcess in the libcontainer integration tests that I can't quite crack. I've managed to sort-of fix the issues with runC (but it requires some not-very nice hacks where I call os.Exit from a goroutine).

Is there a nicer way of handling the exit of a child exec'd process other than just getting an EIO from the stdio of the process and then exiting blindly?

Similar to the already existing proc functions, these just get different fields from the relevant place in /proc/<pid>/stat. This is part of the nsenter rewrite patchset (namely the pidns-orphaning part). A nice extension to this would be to use inotify and make libcontainer/system/proc.go entirely chan based. Signed-off-by: Aleksa Sarai <asarai@suse.de>

Due to how parent processes are treated in relation to PID namespaces and zombie reaping[1], it is necessary for attaching (setns) processes to double-fork inside the PID namespace to orphan themselves. In addition, this also changes the PID passing code to use the PID namespace translation features of SCM_CREDENTIALS. [1]: https://lwn.net/Articles/532748/ Signed-off-by: Aleksa Sarai <asarai@suse.de>

Because we are no longer the parents of exec'd processes, we need to handle exiting and such things quite differently. There are two main cases we need to deal with: 1. Now that the exec'd process is no longer a child of runC, we cannot wait4 the process. This means that we lose some information (the exit code, and signals when it dies), which means that we have to explicitly handle the death by polling /proc/<pid>/stat. 2. In addition, because of how the above solution works, we also have to include a wait4 goroutine to clean up all of the nsenter processes which would be cleaned up by a simple process.Wait() but are no longer that simple in the new setup. All in all, I'm very sorry. I hope that people debugging this in the future will forgive our hubris. Signed-off-by: Aleksa Sarai <asarai@suse.de>

This ensures that we don't regress on the current setup, where we are correctly setting up processes using `runc exec` such that the process' parent is correctly reset to PID 1 inside the container. Signed-off-by: Aleksa Sarai <asarai@suse.de>

cyphar · 2016-10-12T10:51:54Z

Alright, I've fixed all of the issues through some fairly nasty code in 247d355 ([wip] libcontainer: handle exit of exec'd processes). Unfortunately, I don't think there's a clean way of fixing it -- we won't ever get a SIGCHLD so we just have to parse everything and poll.

EDIT: Scract the "all the issues" part. Looks like some other things are broken now because of my wait4 hack...

datawolf · 2016-10-13T08:51:33Z

libcontainer/system/proc.go

+	}
+
+	parts := strings.Split(string(data), " ")
+	// the state field is located in pos 4


The PID of the parent filed is located in pos4. so s/state/parent pid/.

Thanks, but note that this PR is in quite a lot of flux (it doesn't really work at the moment and I'm trying to figure out why the hell processes are so hard on Linux). So I'd recommend holding off on spending a lot of time reviewing this while I'm still writing this code. :D

datawolf · 2016-10-13T08:53:06Z

libcontainer/system/proc.go

+	return parts[3-1], nil // starts at 1
+}
+
+// GetProcessParent reads reads /proc/<pid>/stat to determine the parent of a


two reads words, should delete one.

datawolf · 2016-10-13T09:13:00Z

libcontainer/nsenter/nsexec.c

@@ -548,8 +584,13 @@ void nsexec(void)
 		bail("missing cloneflags");

 	/* Pipe so we can tell the child when we've finished setting up. */
-	if (socketpair(AF_LOCAL, SOCK_STREAM, 0, syncpipe) < 0)
+	if (socketpair(AF_LOCAL, SOCK_STREAM, 0, parentpipe) < 0)


The comment for function sendpid and recvpid using AF_UNIX. Although AF_LOCAL and AF_UNIX are equal, I think it should be consistent.

sandyskies · 2017-01-23T09:02:56Z

@cyphar I merge ur fixed with runc of docker 1.12.5 but it does't work . An error was reported:
docker run -ti 427deb629ee5 bash
docker: Error response from daemon: containerd: container not started.

cyphar · 2017-01-23T09:19:07Z

I'll be honest, I'm not sure that this code can ever be completely sane. I'll need to think about this some more, and I've been very busy with other things recently. The big issue is that the polling system for /proc/[pid]/stat is just not a very good solution for tracking the death of a process, and it's just not fit-for-purpose if we want to get exit codes...

dqminh · 2017-08-01T11:31:34Z

@cyphar is this still relevant ?

cyphar · 2017-08-01T11:33:43Z

@dqminh I believe it's still relevant (we still have pidns orphaning problems) but I think that the exit-code problem is not really resolvable (and polling /proc/$pid/stat is a horrible way of not completely solving the problem as well).

cyphar · 2017-10-10T13:57:07Z

This might be doable with cn_proc but this requires more kernel work to be possible as an unprivileged user. http://netsplit.com/the-proc-connector-and-socket-filters

cyphar · 2018-11-14T11:51:04Z

We need more kernel work for this.

AkihiroSuda · 2020-03-04T00:50:51Z

What's current status?

GordonTheTurtle added the status/0-triage label Aug 8, 2016

cyphar mentioned this pull request Aug 8, 2016

nsenter: major cleanups #950

Merged

2 tasks

This was referenced Aug 19, 2016

Joining existing pid ns not reparenting process #971

Open

Disable the subreaper on exec #993

Merged

nsenter: guarantee correct user namespace ordering #977

Merged

cyphar mentioned this pull request Sep 16, 2016

--pid=container:<id> does not reparent zombies to pid 1 moby/moby#25348

Open

cyphar added 3 commits October 4, 2016 16:17

nsenter: specify namespace type in setns()

ed053a7

This avoids us from running into cases where libcontainer thinks that a particular namespace file is a different type, and makes it a fatal error rather than causing broken functionality. Signed-off-by: Aleksa Sarai <asarai@suse.de>

cyphar added this to the 1.0.0 milestone Oct 4, 2016

cyphar added 4 commits October 12, 2016 21:51

datawolf reviewed Oct 13, 2016

View reviewed changes

dims mentioned this pull request Jan 24, 2017

[WIP] Trying PID Namespace sharing from Docker 1.12 release candidates kubernetes/kubernetes#29677

Closed

sandyskies mentioned this pull request Jul 14, 2017

Docker exec and Orphan Process problem moby/moby#29700

Closed

dims mentioned this pull request Jul 19, 2017

Shared PID and UTS namespaces kubernetes/kubernetes#1615

Closed

sandyskies mentioned this pull request Aug 30, 2017

fixed zombie process in exec process exiting. containerd/containerd#1430

Closed

cyphar mentioned this pull request Oct 10, 2017

psnotify: add exit code to ProcEventExit cloudfoundry/gosigar#40

Closed

cyphar removed this from the 1.0.0 milestone Nov 14, 2018

cyphar closed this May 20, 2021

cyphar deleted the nsenter-pidns-orphaning branch May 20, 2021 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsenter: correctly handle pidns orphaning #976

nsenter: correctly handle pidns orphaning #976

cyphar commented Aug 8, 2016 •

edited

Loading

cyphar commented Oct 4, 2016

cyphar commented Oct 12, 2016 •

edited

Loading

datawolf Oct 13, 2016

cyphar Oct 13, 2016 •

edited

Loading

datawolf Oct 13, 2016

datawolf Oct 13, 2016

sandyskies commented Jan 23, 2017

cyphar commented Jan 23, 2017

dqminh commented Aug 1, 2017

cyphar commented Aug 1, 2017

cyphar commented Oct 10, 2017

cyphar commented Nov 14, 2018

AkihiroSuda commented Mar 4, 2020

nsenter: correctly handle pidns orphaning #976

nsenter: correctly handle pidns orphaning #976

Conversation

cyphar commented Aug 8, 2016 • edited Loading

cyphar commented Oct 4, 2016

cyphar commented Oct 12, 2016 • edited Loading

datawolf Oct 13, 2016

Choose a reason for hiding this comment

cyphar Oct 13, 2016 • edited Loading

Choose a reason for hiding this comment

datawolf Oct 13, 2016

Choose a reason for hiding this comment

datawolf Oct 13, 2016

Choose a reason for hiding this comment

sandyskies commented Jan 23, 2017

cyphar commented Jan 23, 2017

dqminh commented Aug 1, 2017

cyphar commented Aug 1, 2017

cyphar commented Oct 10, 2017

cyphar commented Nov 14, 2018

AkihiroSuda commented Mar 4, 2020

cyphar commented Aug 8, 2016 •

edited

Loading

cyphar commented Oct 12, 2016 •

edited

Loading

cyphar Oct 13, 2016 •

edited

Loading