Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define Linux Network Devices #1271

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions config-linux.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,44 @@ In addition to any devices configured with this setting, the runtime MUST also s
* [`/dev/ptmx`][pts.4].
A [bind-mount or symlink of the container's `/dev/pts/ptmx`][devpts].

## <a name="configLinuxNetworkDevices" />Network Devices

Linux network devices are entities that send and receive data packets.
They are not represented as files in the /dev directory, unlike block devices, network devices are represented with the [`net_device`][net_device] data structure in the Linux kernel.
Network devices can belong to only one network namespace and use a set of operations distinct from regular file operations. Examples of network devices include Ethernet cards, loopback devices, and virtual devices like bridges, VLANs, and MACVLANs.

This schema focuses solely on moving existing network devices identified by name from the host network namespace into the container network namespace. It does not cover the complexities of network device creation or network configuration, such as IP address assignment, routing, and DNS setup.

**`netDevices`** (object, OPTIONAL) set of network devices that MUST be made available in the container. The runtime is responsible for providing these devices; the underlying mechanism is implementation-defined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here is array of objects, and each object has a string field with name name.

You want an array at some point, for the net devices, but here you say object and later you say string. I think here you want to say array of object, like here: https://github.com/opencontainers/runtime-spec/pull/1271/files#diff-048d23d864e15683f516d2c1768965d546e87f8a59b2606cf2f2d52500ba5a32R127

OHH, you want two names, the host network interface and the name to assign in the container, right? Maybe you want array of objects and each object has two fields, the name of the interface of the host and the name to assign to the container interface?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, you want an array and none of the types you list (not the free-text, but the name of the field with the type in parenthesis) is an array. I think some array is missing :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you see the previous proposal, there were multiple attributes before , leaving this as an object allows to extend the API in a backwards compatible way

    "linux": {
        "netDevices": {
            "eth0": {
                "name": "container_eth0",
                 "new_field_here": "asdasd",
            },
            "ens4": {},
            "ens5": {}
        }
    }

I still think the object to represent a dictionary is better than an array because avoids duplicate names on the configuration that can cause runtime errors #1271 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I missed this was a "set" ("set of network devices ...")


The runtime MUST check that is possible to move the network interface to the container namespace and MUST [generate an error](runtime.md#errors) if the check fails.

The runtime MUST preserve the existing network interface attributes, like MTU, MAC and IP addresses, enabling users to preconfigure the interfaces.

The runtime MUST set the network device state to "up" after moving it to the network namespace to allow the container to send and receive network traffic through that device.

For proper container termination, the runtime must first set the device's state to "down" and then move it out of the namespace before the namespace is deleted. This ensures the device is inactive and avoids conflicts. If the container abnormally terminates and the runtime does not participate in the termination process, these steps might be skipped, and the kernel will handle the process, described in [network_namespaces(7)][net_namespaces.7] "When a network namespace is freed (i.e., when the last process in the namespace terminates), its physical network devices are moved back to the initial network namespace" . Notice that after deleting a network namespace, all its migratable network devices are moved to the default network namespace, but virtual devices (veth, macvlan, ...) are destroyed.

The name of the network device is the entry key.
Entry values are objects with the following properties:

* **`name`** *(string, OPTIONAL)* - the name of the network device inside the container namespace. If not specified, the host name is used. The network device name is unique per network namespace, if an existing network device with the same name exists that rename operation will fail. The runtime MAY check that the name is unique before the rename operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, as I'm not very familiar with NRI and I don't know if this concern makes sense, please let me know. How can NRI plugins using this decide on container interface name to use? I mean choose one that won't clash with the ones set by potentially other plugins? Can they see what has been done so far by previous plugins? Or this is not an issue at all (in that case, can you explain briefly why? I'm curious :))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In kubernetes both main runtimes, containerd and crio, the name of the interface inside the container is always eth0, so for 95% of the cases in kubernetes the problem is easy to solve.
There are cases where people add additional interfaces with out of band mechanisms as in #1271 (comment), in that case, there are several options:

  • add a random generated name with enough entropy
  • inspect the network namespace and check for duplicates
  • fail with a collision name error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, but you can't inspect the netns because it hasn't been created yet. So, how can those tools, befor choosing a name for the interface inside the container, check which names were used by others? E.g if NRI has several plugins and more than one adds a interface, how can they the second plugin know eth1 is added and avoid using that name?

The random generated would be an option, but it will be nice to understand if that is needed or if people can just choose names that avoids collisions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In kubernetes the network namespace is created by the runtime and there will be only an eth0 interface,
If there are more interface is because some component is adding them via an out of band process, that will have exactly the same problem. This works today because cluster administrators only set up one component to add additional interfaces.

This reinforces my point in #1271 (comment) , using a well defined specification will help multiple implementations to be able to synchronize, and we need thhis primitive to standardize these behaviors, to build higher level APIs ... we are already doing it for Network Status kubernetes/enhancements#4817 , we need to do it for configuration based on this.

Copy link
Member

@rata rata Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we are talking about different things. Let's assume this PR is in, implemented, etc. How a NRI plugin chooses a network interface name without collisions with a network interface added by another plugin?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not into the internal details of CDI of NRI, but I think those modify the OCI spec, so any plugin will be able to check in the OCI spec the transformations of all the other plugins, included the interface names

The runtime, when participating on the container termination, must revert back the original name to guarantee the idempotence of operations, so a container that moves an interface and renames it can be created and destroyed multiple times with the same result.

### Example

#### Moving a device with a renamed interface inside the container:

```json
"netDevices": {
"eth0" : {
"name": "container_eth0"
}
}
```

This configuration will move the device named "eth0" from the host into the container's network namespace. Inside the container, the device will be named "container_eth0".

## <a name="configLinuxControlGroups" />Control groups

Also known as cgroups, they are used to restrict resource usage for a container and handle device access.
Expand Down Expand Up @@ -975,6 +1013,8 @@ subset of the available options.
[mknod.1]: https://man7.org/linux/man-pages/man1/mknod.1.html
[mknod.2]: https://man7.org/linux/man-pages/man2/mknod.2.html
[namespaces.7_2]: https://man7.org/linux/man-pages/man7/namespaces.7.html
[net_device]: https://docs.kernel.org/networking/netdevices.html
[net_namespaces.7]: https://man7.org/linux/man-pages/man7/network_namespaces.7.html
[null.4]: https://man7.org/linux/man-pages/man4/null.4.html
[personality.2]: https://man7.org/linux/man-pages/man2/personality.2.html
[pts.4]: https://man7.org/linux/man-pages/man4/pts.4.html
Expand Down
14 changes: 14 additions & 0 deletions features-linux.md
Original file line number Diff line number Diff line change
Expand Up @@ -228,3 +228,17 @@ Irrelevant to the availability of Intel RDT on the host operating system.
}
}
```

## <a name="linuxFeaturesNetDevices" />NetDevices

**`netDevices`** (object, OPTIONAL) represents the runtime's implementation status of Linux network devices.

* **`enabled`** (bool, OPTIONAL) represents whether the runtime supports the capability to move Linux network devices into the container's network namespace.

### Example

```json
"netDevices": {
"enabled": true
}
```
6 changes: 6 additions & 0 deletions schema/config-linux.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@
"$ref": "defs-linux.json#/definitions/Device"
}
},
"netDevices": {
"type": "object",
"additionalProperties": {
"$ref": "defs-linux.json#/definitions/NetDevice"
}
},
"uidMappings": {
"type": "array",
"items": {
Expand Down
8 changes: 8 additions & 0 deletions schema/defs-linux.json
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,14 @@
}
}
},
"NetDevice": {
"type": "object",
"properties": {
"name": {
"type": "string"
}
}
},
"weight": {
"$ref": "defs.json#/definitions/uint16"
},
Expand Down
8 changes: 8 additions & 0 deletions schema/features-linux.json
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,14 @@
}
}
}
},
"netDevices": {
"type": "object",
"properties": {
"enabled": {
"type": "boolean"
}
}
}
}
}
Expand Down
13 changes: 13 additions & 0 deletions schema/test/config/bad/linux-netdevice.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"ociVersion": "1.0.0",
"root": {
"path": "rootfs"
},
"linux": {
"netDevices": {
"eth0": {
"name": 23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is 23 here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an integer to cause a bad config since it is expecting a string

}
}
}
}
15 changes: 15 additions & 0 deletions schema/test/config/good/linux-netdevice.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"ociVersion": "1.0.0",
"root": {
"path": "rootfs"
},
"linux": {
"netDevices": {
"eth0": {
"name": "container_eth0"
},
"ens4": {},
"ens5": {}
}
}
}
3 changes: 3 additions & 0 deletions schema/test/features/good/runc.json
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,9 @@
},
"selinux": {
"enabled": true
},
"netDevices": {
"enabled": true
}
},
"annotations": {
Expand Down
8 changes: 8 additions & 0 deletions specs-go/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,8 @@ type Linux struct {
Namespaces []LinuxNamespace `json:"namespaces,omitempty"`
// Devices are a list of device nodes that are created for the container
Devices []LinuxDevice `json:"devices,omitempty"`
// NetDevices are key-value pairs, keyed by network device name on the host, moved to the container's network namespace.
NetDevices map[string]LinuxNetDevice `json:"netDevices,omitempty"`
// Seccomp specifies the seccomp security settings for the container.
Seccomp *LinuxSeccomp `json:"seccomp,omitempty"`
// RootfsPropagation is the rootfs mount propagation mode for the container.
Expand Down Expand Up @@ -491,6 +493,12 @@ type LinuxDevice struct {
GID *uint32 `json:"gid,omitempty"`
}

// LinuxNetDevice represents a single network device to be added to the container's network namespace
type LinuxNetDevice struct {
// Name of the device in the container namespace
Name string `json:"name,omitempty"`
}

// LinuxDeviceCgroup represents a device rule for the devices specified to
// the device controller
type LinuxDeviceCgroup struct {
Expand Down
8 changes: 8 additions & 0 deletions specs-go/features/features.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ type Linux struct {
Selinux *Selinux `json:"selinux,omitempty"`
IntelRdt *IntelRdt `json:"intelRdt,omitempty"`
MountExtensions *MountExtensions `json:"mountExtensions,omitempty"`
NetDevices *NetDevices `json:"netDevices,omitempty"`
}

// Cgroup represents the "cgroup" field.
Expand Down Expand Up @@ -143,3 +144,10 @@ type IDMap struct {
// Nil value means "unknown", not "false".
Enabled *bool `json:"enabled,omitempty"`
}

// NetDevices represents the "netDevices" field.
type NetDevices struct {
// Enabled is true if network devices support is compiled in.
// Nil value means "unknown", not "false".
Enabled *bool `json:"enabled,omitempty"`
}