Skip to content

Commit

Permalink
netlink: Add wrapper for functions that fail with ErrDumpInterrupted
Browse files Browse the repository at this point in the history
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. cilium#32099), the logged error message can still cause CI
to fail (e.g. cilium#35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
  • Loading branch information
gandro committed Oct 29, 2024
1 parent 8afaf2b commit a9344f1
Show file tree
Hide file tree
Showing 4 changed files with 587 additions and 0 deletions.
1 change: 1 addition & 0 deletions CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -594,6 +594,7 @@ Makefile* @cilium/build
/pkg/resiliency @cilium/sig-agent
/pkg/revert/ @cilium/sig-agent
/pkg/safeio @cilium/sig-agent
/pkg/safenetlink @cilium/sig-datapath
/pkg/safetime/ @cilium/sig-agent
/pkg/service @cilium/sig-lb
/pkg/shortener @cilium/sig-foundations @cilium/sig-k8s
Expand Down
Loading

0 comments on commit a9344f1

Please sign in to comment.