Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
netlink: Add wrapper for functions that fail with
ErrDumpInterrupted
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
- Loading branch information