-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Linux Network Devices #4538
base: main
Are you sure you want to change the base?
Conversation
bdb31c1
to
0b771ca
Compare
07d3b0b
to
3833056
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We are also interested in this use case for our accelerator devices.
67f12e0
to
d114afe
Compare
ec90a02
to
f4f5d02
Compare
735f9d5
to
ce1f612
Compare
6262c5e
to
c530772
Compare
4380e86
to
f53d263
Compare
@aojea friendly reminder that this should be ready soon, we are cutting 1.3.0-rc.1 soon (maybe this week). The spec part is still not sure when it will be merged? |
@aojea I don't think it's much different from k8s or most projects. I think you need to drive alignment to merge the runtime-spec PR. But we can't merge this in runc if it depends on a spec change. |
that is why I thought, thanks for confirming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea I think the bug we were talking on slack is a bug in the validator, see the comments for more info.
if config.Namespaces.IsPrivate(configs.NEWNET) { | ||
return errors.New("unable to move network devices without a NET namespace") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be wrong. This checks if it has a Path
set. But this should work for any network namespace.
When you remove this, the rest of the code might fail if you need the netns fd. In that case, I think you should get the ns fs from /proc/<pid>/ns/net
(you open
that path and the fd you get is a netns fd), or even better if you can use the pidfd as that doesn't have races and those pains.
IMHO, we should implement both: use pidfd whenever possible and fallback to /proc/<pid>/ns/net
when that is not available, and handle the races properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be great if you can add a test that also uses a user namespace. See some examples here: tests/integration/userns.bats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with user namespaces but the topic was also brought up during the review and it seems was decided to not included them, is not the network different on those environments? https://github.com/rootless-containers/rootlesskit/blob/master/docs/network.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea where was it brought up, do you have links? I'd like to see why it wasn't included.
The link you mention is about rootless containers, that is something different. That is to NOT run the runtime as root on the host. Rootless containers uses userns too, that is why it is confusing.
I meant the container created has a user namespace, the same way it has a network, pid, etc. namespace. It doesn't change at all how the runtime runs, it is a namespace use in the container created. The container runtime runs as root, etc. It's a completely different thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhhh ok, then I will add those tests too
f74c032
to
17c7056
Compare
libcontainer/state_linux.go
Outdated
if nsPath == "" { | ||
if c.initProcess.pid() > 0 { | ||
nsPath = fmt.Sprintf("/proc/%d/ns/net", c.initProcess.pid()) | ||
} else { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rata I need help here, I can not find the best place to restore the network devices when the namespace is created by runc, it seems it has to be at one point where the init process is still running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea what do you mean "the init process is still running"? Which process?
state_linux.go is probably to know the state of running containers, you should check to standard_init_linux.go, for example
9f4b286
to
20f65c5
Compare
@@ -48,6 +49,14 @@ func destroy(c *Container) error { | |||
// Likely to fail when c.config.RootlessCgroups is true | |||
_ = signalAllProcesses(c.cgroupManager, unix.SIGKILL) | |||
} | |||
|
|||
// Likely to fail if the init process no longer exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only found this place to hook for restoring these devices, but since the start process may not exist , when the network namespace is not provided it is not able to find the network namespace ... please advice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this: #4538 (comment)
state_linux.go is not the right place, for sure. What I say in the comment seems to have a better chance. But I don't know the details of moving the network interfaces to know exactly where is the right place to hook it up, sorry :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any point where we can get the network namespace and the initial process has not started yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you refer to with "the initial process" ? The container process? Or the runc init processes?
All runc code runs before the container process, once we exec the container runc is out of the picture.
Assuming you mean the container process, the namespaces are created in nsexec.c, there are a lot of things done after that in go too. Like the mountToRootfs(), etc.
I have not look in detail to suggest where to do it, but take a look and just add it somewhere. People will review and it can be moved in the worst case :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it, the spec has two subcommands that are relevant here: create
and start
.
creates creates all the ns, etc. and start starts the container process. I don't know when CNI is being called by upper container runtimes (like containerd or crio), but I guess we should move the interface at that point (whatever it is, create or start).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know when CNI is being called by upper container runtimes (like containerd or crio), but I guess we should move the interface at that point (whatever it is, create or start)
in containerd the network namespace is created and the CNI is called to provide IPs and the network interface, once that is set the containers starts to be created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the setup seems to be in the correct place now, before the Prestart hook
Change-Id: Id54c22cb5652c77b4a3c3bd4e46e96388be1cc11 Signed-off-by: Antonio Ojea <aojea@google.com>
Implement support for passing Linux Network Devices to the container network namespace. The network device is passed during the creation of the container, before the process is started. It implements the logic defined in the OCI runtime specification. Change-Id: I190d5c444f03e65bbbfe5b4bc90809c0ad0a2017 Signed-off-by: Antonio Ojea <aojea@google.com>
Draft implementation of opencontainers/runtime-spec#1271
It implements the new proposal to the OCI spec to be able to specify Network Devices that get attached/detached from the containers