-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsenter: correctly handle pidns orphaning #976
Conversation
This avoids us from running into cases where libcontainer thinks that a particular namespace file is a different type, and makes it a fatal error rather than causing broken functionality. Signed-off-by: Aleksa Sarai <asarai@suse.de>
Depending on your SELinux setup, the order in which you join namespaces can be important. In general, user namespaces should *always* be joined and unshared first because then the other namespaces are correctly pinned and you have the right priviliges within them. This also is very useful for rootless containers, as well as older kernels that had essentially broken unshare(2) and clone(2) implementations. This also includes huge refactorings in how we spawn processes for complicated reasons that I don't want to get into because it will make me spiral into a cloud of rage. The reasoning is in the giant comment in clone_parent. Have fun. In addition, because we now create multiple children with CLONE_PARENT, we cannot wait for them to SIGCHLD us in the case of a death. Thus, we have to resort to having a child kindly send us their exit code before they die. Hopefully this all works okay, but at this point there's not much more than we can do. Signed-off-by: Aleksa Sarai <asarai@suse.de>
In general, it is a bad idea to be unmapped inside a user namespace at any point (especially when euid=[kuid 0]) as it can lead to security vulnerabilities. Also, in certain SELinux setups you must also be mapped in your user namespace when unsharing other namespaces. Deal with all of this by parsing the {uid,gid}maps and then setresuid(2) to the right user before and after the critical unshare(CLONE_NEWUSER) (as well as dealing with setns(2) by changing user to the owner of the namespace file we're joining). Fixes: CVE-2015-8709 Reported-by: Andrey Vagin <avagin@virtuozzo.com> Reported-by: Mrunal Patel <mpatel@redhat.com> Signed-off-by: Aleksa Sarai <asarai@suse.de>
@opencontainers/runc-maintainers This sort of works. However, because of the new feature (that Is there a nicer way of handling the exit of a child exec'd process other than just getting an |
Similar to the already existing proc functions, these just get different fields from the relevant place in /proc/<pid>/stat. This is part of the nsenter rewrite patchset (namely the pidns-orphaning part). A nice extension to this would be to use inotify and make libcontainer/system/proc.go entirely chan based. Signed-off-by: Aleksa Sarai <asarai@suse.de>
Due to how parent processes are treated in relation to PID namespaces and zombie reaping[1], it is necessary for attaching (setns) processes to double-fork inside the PID namespace to orphan themselves. In addition, this also changes the PID passing code to use the PID namespace translation features of SCM_CREDENTIALS. [1]: https://lwn.net/Articles/532748/ Signed-off-by: Aleksa Sarai <asarai@suse.de>
Because we are no longer the parents of exec'd processes, we need to handle exiting and such things quite differently. There are two main cases we need to deal with: 1. Now that the exec'd process is no longer a child of runC, we cannot wait4 the process. This means that we lose some information (the exit code, and signals when it dies), which means that we have to explicitly handle the death by polling /proc/<pid>/stat. 2. In addition, because of how the above solution works, we also have to include a wait4 goroutine to clean up all of the nsenter processes which would be cleaned up by a simple process.Wait() but are no longer that simple in the new setup. All in all, I'm very sorry. I hope that people debugging this in the future will forgive our hubris. Signed-off-by: Aleksa Sarai <asarai@suse.de>
This ensures that we don't regress on the current setup, where we are correctly setting up processes using `runc exec` such that the process' parent is correctly reset to PID 1 inside the container. Signed-off-by: Aleksa Sarai <asarai@suse.de>
Alright, I've fixed all of the issues through some fairly nasty code in 247d355 ( EDIT: Scract the "all the issues" part. Looks like some other things are broken now because of my |
} | ||
|
||
parts := strings.Split(string(data), " ") | ||
// the state field is located in pos 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PID of the parent filed is located in pos4. so s/state/parent pid/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but note that this PR is in quite a lot of flux (it doesn't really work at the moment and I'm trying to figure out why the hell processes are so hard on Linux). So I'd recommend holding off on spending a lot of time reviewing this while I'm still writing this code. :D
return parts[3-1], nil // starts at 1 | ||
} | ||
|
||
// GetProcessParent reads reads /proc/<pid>/stat to determine the parent of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two reads
words, should delete one.
@@ -548,8 +584,13 @@ void nsexec(void) | |||
bail("missing cloneflags"); | |||
|
|||
/* Pipe so we can tell the child when we've finished setting up. */ | |||
if (socketpair(AF_LOCAL, SOCK_STREAM, 0, syncpipe) < 0) | |||
if (socketpair(AF_LOCAL, SOCK_STREAM, 0, parentpipe) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment for function sendpid
and recvpid
using AF_UNIX
. Although AF_LOCAL
and AF_UNIX
are equal, I think it should be consistent.
@cyphar I merge ur fixed with runc of docker 1.12.5 but it does't work . An error was reported: |
I'll be honest, I'm not sure that this code can ever be completely sane. I'll need to think about this some more, and I've been very busy with other things recently. The big issue is that the polling system for |
@cyphar is this still relevant ? |
@dqminh I believe it's still relevant (we still have pidns orphaning problems) but I think that the exit-code problem is not really resolvable (and polling |
This might be doable with |
We need more kernel work for this. |
What's current status? |
The main change here is to make sure that new processes that join a container are correctly reparented. Read this LWN article for more information about what the issue is and how the solution needs to be structured.
TODO
:runc exec
doesn't exit after the subprocess exits because of this feature. We'll need to think up a new way of handling this...Fixes #971
Signed-off-by: Aleksa Sarai asarai@suse.de
Note: This is based on
#950, #977 and #975.