Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prog: avoid verifier loop of death #1693

Merged
merged 3 commits into from
Mar 13, 2025
Merged

Conversation

ti-mo
Copy link
Collaborator

@ti-mo ti-mo commented Feb 20, 2025

Supersedes #1679.

prog: factor retry decision into separate testable function

This commit special-cases LogDisabled into a clause without a retry loop since
it doesn't need a buffer allocated. It also simplifies the logic somewhat.

A follow-up commit will add another separate clause for a new LogSizeOverride
CollectionOption to factor out even more complexity related to buffer sizing.

The main purpose of this commit is to extract the decision-making logic around
retries. Instead of directly manipulating loop control flow, a function is
introduced that issues a retry verdict, and it can be tested indendently of
the program load action.
prog: give up when max verifier log size is reached

Previously, loading a program would loop forever if the log didn't fit
in the kernel's max log size.
prog: prevent prog_load loop of death

To limit the damage of any future regressions, limit the attempts we'll take
to load a program.

@lmb I eventually decided against deprecating LogSizeStart because it has utility beyond just break-glass cases, for example to reduce the amount of attempts the Cilium verifier tests take to obtain the full buffer, which is often quite large. We don't want to hardcode a buffer size there because the size varies significantly across programs and the environment is somewhat resource-constrained. It also ended up being more code; as of this proposal it's quite concise and testable, and I'm happy with it.

Copy link
Contributor

@brycekahle brycekahle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested and did not receive the same infinite loop we were seeing before.

@lmb
Copy link
Collaborator

lmb commented Feb 24, 2025

I eventually decided against deprecating LogSizeStart because it has utility beyond just break-glass cases, for example to reduce the amount of attempts the Cilium verifier tests take to obtain the full buffer, which is often quite large.

That tells me that the current approach just doesn't work very well and we should try to do better. I have a firm belief that making log size at all configurable was a mistake from the get go.

So: how large is "often quite large"?

lmb
lmb previously requested changes Feb 24, 2025

// Ensure the size doesn't overflow.
const factor = 2
if attr.LogSize >= maxVerifierLogSize/factor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this whole factor thing not very intuitive. This is to avoid overflow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an alternative suggestion? The intention is for this to keep working if/when maxVerifierLogSize gets bumped to MaxUint32. Currently, we can't rely on a u32 overflow happening, detecting it, and clamping the value to maxVerifierLogSize. Either way, that's more code and more obtuse than this solution. I'll go ahead with the current version for now.

// ENOSPC means the log was enabled on the previous iteration, so we only
// need to grow the buffer.
if errors.Is(err, unix.ENOSPC) {
if attr.LogTrueSize != 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't preserve the behaviour that we don't retry if LogSize >= LogTrueSize. I'm also not sure why this only happens on ENOSPC? LogTrueSize should always be populated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogTrueSize is not available on older kernels

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the kernel returns ENOSPC, LogSize is smaller than LogTrueSize by definition, and we just need to allocate whatever LogTrueSize specified and retry. I don't think what I wrote here will result in a retry if LogSize > LogTrueSize. What am I missing?

@brycekahle
Copy link
Contributor

Where do we stand on this? Anything you need me to do? We'd like to update to get some other fixes/improvements, but not having this is a blocker.

@ti-mo
Copy link
Collaborator Author

ti-mo commented Mar 12, 2025

Just got back from PTO, I'll try to get this in tomorrow.

ti-mo added 3 commits March 13, 2025 09:54
This commit special-cases LogDisabled into a clause without a retry loop since
it doesn't need a buffer allocated. It also simplifies the logic somewhat.

A follow-up commit will add another separate clause for a new LogSizeOverride
CollectionOption to factor out even more complexity related to buffer sizing.

The main purpose of this commit is to extract the decision-making logic around
retries. Instead of directly manipulating loop control flow, a function is
introduced that issues a retry verdict, and it can be tested indendently of
the program load action.

Signed-off-by: Timo Beckers <timo@isovalent.com>
Previously, loading a program would loop forever if the log didn't fit
in the kernel's max log size.

Signed-off-by: Timo Beckers <timo@isovalent.com>
To limit the damage of any future regressions, limit the attempts we'll take
to load a program.

Signed-off-by: Timo Beckers <timo@isovalent.com>
@ti-mo ti-mo force-pushed the tb/loop-of-death branch from fbe9df5 to 3a89f29 Compare March 13, 2025 09:21
@ti-mo
Copy link
Collaborator Author

ti-mo commented Mar 13, 2025

That tells me that the current approach just doesn't work very well and we should try to do better. I have a firm belief that making log size at all configurable was a mistake from the get go.

I think at the time, you just picked whatever libbpf did. Over the years, this piece of code grew more and more complex. We added (and then removed) VerifierError.Truncated. Then LogTrueSize came and made things worse, although improving things in the long term. I remember a conversation years ago when I migrated the Cilium agent over to the lib, where we concluded to kick the can down the road and let the agent deal with automatic buffer resizing. Then years later you suddenly changed your mind and added resizing to the lib anyway.

Meanwhile, we only made half-hearted attempts at reining in the complexity, so we kept shipping regressions. I don't think it's helpful blaming this on a single decision made half a decade ago when the landscape was very different. :) In the current proposal, the starting size of the buffer is plumbed through in a way that makes it easy to remove if we really want to later on. If anything, blame it on the kernel API. 😉

So: how large is "often quite large"?

Large enough to matter. 😄 It doesn't really matter what the actual size is, this will vary per use case. The bigger the delta from minVerifierLogSize, the more retries have to be made. Think of the poor CPUs. 😢

@ti-mo ti-mo dismissed lmb’s stale review March 13, 2025 09:39

Made a few small tweaks based on suggestions. Going for merge now, to at least stop the bleeding. We can make more changes later if you really feel strongly about it.

@ti-mo ti-mo merged commit 9124d6f into cilium:main Mar 13, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants