Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Linux Network Devices #4538

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

aojea
Copy link

@aojea aojea commented Nov 21, 2024

Draft implementation of opencontainers/runtime-spec#1271

It implements the new proposal to the OCI spec to be able to specify Network Devices that get attached/detached from the containers

@aojea aojea force-pushed the netdevices branch 2 times, most recently from 07d3b0b to 3833056 Compare December 2, 2024 15:40
Copy link

@kad kad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We are also interested in this use case for our accelerator devices.

@aojea aojea force-pushed the netdevices branch 2 times, most recently from 67f12e0 to d114afe Compare December 12, 2024 07:15
@aojea aojea force-pushed the netdevices branch 2 times, most recently from ec90a02 to f4f5d02 Compare December 20, 2024 12:11
@aojea aojea force-pushed the netdevices branch 2 times, most recently from 735f9d5 to ce1f612 Compare January 14, 2025 09:47
@aojea aojea force-pushed the netdevices branch 4 times, most recently from 6262c5e to c530772 Compare February 6, 2025 09:55
@aojea aojea force-pushed the netdevices branch 5 times, most recently from 4380e86 to f53d263 Compare February 10, 2025 21:27
@rata
Copy link
Member

rata commented Feb 25, 2025

@aojea friendly reminder that this should be ready soon, we are cutting 1.3.0-rc.1 soon (maybe this week). The spec part is still not sure when it will be merged?

@aojea
Copy link
Author

aojea commented Feb 28, 2025

@aojea friendly reminder that this should be ready soon, we are cutting 1.3.0-rc.1 soon (maybe this week). The spec part is still not sure when it will be merged?

@rata I'm not familiar with the processes in this ecosystem, should I do something? I'm waiting for the spec to merge

@rata
Copy link
Member

rata commented Mar 3, 2025

@aojea I don't think it's much different from k8s or most projects. I think you need to drive alignment to merge the runtime-spec PR. But we can't merge this in runc if it depends on a spec change.

@aojea
Copy link
Author

aojea commented Mar 3, 2025

@aojea I don't think it's much different from k8s or most projects. I think you need to drive alignment to merge the runtime-spec PR. But we can't merge this in runc if it depends on a spec change.

that is why I thought, thanks for confirming

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea I think the bug we were talking on slack is a bug in the validator, see the comments for more info.

Comment on lines 95 to 97
if config.Namespaces.IsPrivate(configs.NEWNET) {
return errors.New("unable to move network devices without a NET namespace")
}
Copy link
Member

@rata rata Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be wrong. This checks if it has a Path set. But this should work for any network namespace.

When you remove this, the rest of the code might fail if you need the netns fd. In that case, I think you should get the ns fs from /proc/<pid>/ns/net (you open that path and the fd you get is a netns fd), or even better if you can use the pidfd as that doesn't have races and those pains.

IMHO, we should implement both: use pidfd whenever possible and fallback to /proc/<pid>/ns/net when that is not available, and handle the races properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be great if you can add a test that also uses a user namespace. See some examples here: tests/integration/userns.bats

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with user namespaces but the topic was also brought up during the review and it seems was decided to not included them, is not the network different on those environments? https://github.com/rootless-containers/rootlesskit/blob/master/docs/network.md

Copy link
Member

@rata rata Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea where was it brought up, do you have links? I'd like to see why it wasn't included.

The link you mention is about rootless containers, that is something different. That is to NOT run the runtime as root on the host. Rootless containers uses userns too, that is why it is confusing.

I meant the container created has a user namespace, the same way it has a network, pid, etc. namespace. It doesn't change at all how the runtime runs, it is a namespace use in the container created. The container runtime runs as root, etc. It's a completely different thing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhh ok, then I will add those tests too

@aojea aojea force-pushed the netdevices branch 2 times, most recently from f74c032 to 17c7056 Compare March 3, 2025 22:09
Comment on lines 80 to 89
if nsPath == "" {
if c.initProcess.pid() > 0 {
nsPath = fmt.Sprintf("/proc/%d/ns/net", c.initProcess.pid())
} else {
return nil
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rata I need help here, I can not find the best place to restore the network devices when the namespace is created by runc, it seems it has to be at one point where the init process is still running

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea what do you mean "the init process is still running"? Which process?

state_linux.go is probably to know the state of running containers, you should check to standard_init_linux.go, for example

@aojea aojea force-pushed the netdevices branch 2 times, most recently from 9f4b286 to 20f65c5 Compare March 3, 2025 22:37
@@ -48,6 +49,14 @@ func destroy(c *Container) error {
// Likely to fail when c.config.RootlessCgroups is true
_ = signalAllProcesses(c.cgroupManager, unix.SIGKILL)
}

// Likely to fail if the init process no longer exist.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only found this place to hook for restoring these devices, but since the start process may not exist , when the network namespace is not provided it is not able to find the network namespace ... please advice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this: #4538 (comment)

state_linux.go is not the right place, for sure. What I say in the comment seems to have a better chance. But I don't know the details of moving the network interfaces to know exactly where is the right place to hook it up, sorry :(

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any point where we can get the network namespace and the initial process has not started yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you refer to with "the initial process" ? The container process? Or the runc init processes?

All runc code runs before the container process, once we exec the container runc is out of the picture.

Assuming you mean the container process, the namespaces are created in nsexec.c, there are a lot of things done after that in go too. Like the mountToRootfs(), etc.

I have not look in detail to suggest where to do it, but take a look and just add it somewhere. People will review and it can be moved in the worst case :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, the spec has two subcommands that are relevant here: create and start.

creates creates all the ns, etc. and start starts the container process. I don't know when CNI is being called by upper container runtimes (like containerd or crio), but I guess we should move the interface at that point (whatever it is, create or start).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea ^

Copy link
Author

@aojea aojea Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know when CNI is being called by upper container runtimes (like containerd or crio), but I guess we should move the interface at that point (whatever it is, create or start)

in containerd the network namespace is created and the CNI is called to provide IPs and the network interface, once that is set the containers starts to be created

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the setup seems to be in the correct place now, before the Prestart hook

aojea added 2 commits March 7, 2025 14:44
Change-Id: Id54c22cb5652c77b4a3c3bd4e46e96388be1cc11
Signed-off-by: Antonio Ojea <aojea@google.com>
Implement support for passing Linux Network Devices to the container
network namespace.

The network device is passed during the creation of the container,
before the process is started.

It implements the logic defined in the OCI runtime specification.

Change-Id: I190d5c444f03e65bbbfe5b4bc90809c0ad0a2017
Signed-off-by: Antonio Ojea <aojea@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants