Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

link: wrap errors for Raw{At,De}tachProgram #320

Merged
merged 1 commit into from
Jun 9, 2021
Merged

Conversation

cyphar
Copy link
Contributor

@cyphar cyphar commented Jun 8, 2021

This allows callers to detect the underlying syscall error, which is
necessary for being able to implement safe fallbacks based on the error
during program loading and unloading. runc in particular needs this to
be able to implement BPF_F_REPLACE fallbacks correctly.

Ref: opencontainers/runc#3008
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

Copy link

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to open the same PR but my draft have the following changes 😛

diff --git a/internal/elf.go b/internal/elf.go
index 54a4313..88326bb 100644
--- a/internal/elf.go
+++ b/internal/elf.go
@@ -24,7 +24,7 @@ func NewSafeELFFile(r io.ReaderAt) (safe *SafeELFFile, err error) {
                }
 
                safe = nil
-               err = fmt.Errorf("reading ELF file panicked: %s", r)
+               err = fmt.Errorf("reading ELF file panicked: %w", r)
        }()
 
        file, err := elf.NewFile(r)
@@ -44,7 +44,7 @@ func (se *SafeELFFile) Symbols() (syms []elf.Symbol, err error) {
                }
 
                syms = nil
-               err = fmt.Errorf("reading ELF symbols panicked: %s", r)
+               err = fmt.Errorf("reading ELF symbols panicked: %w", r)
        }()
 
        syms, err = se.File.Symbols()
@@ -60,7 +60,7 @@ func (se *SafeELFFile) DynamicSymbols() (syms []elf.Symbol, err error) {
                }
 
                syms = nil
-               err = fmt.Errorf("reading ELF dynamic symbols panicked: %s", r)
+               err = fmt.Errorf("reading ELF dynamic symbols panicked: %w", r)
        }()
 
        syms, err = se.File.DynamicSymbols()

Copy link

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff --git a/cmd/bpf2go/compile.go b/cmd/bpf2go/compile.go
index a7509e1..0f0b28f 100644
--- a/cmd/bpf2go/compile.go
+++ b/cmd/bpf2go/compile.go
@@ -91,7 +91,7 @@ func compile(args compileArgs) error {
        }
 
        if err := cmd.Start(); err != nil {
-               return fmt.Errorf("can't execute %s: %s", args.cc, err)
+               return fmt.Errorf("can't execute %s: %w", args.cc, err)
        }
 
        if depRd != nil {
@@ -104,7 +104,7 @@ func compile(args compileArgs) error {
        }
 
        if err := cmd.Wait(); err != nil {
-               return fmt.Errorf("%s: %s", args.cc, err)
+               return fmt.Errorf("%s: %w", args.cc, err)
        }
 
        return nil

@cyphar
Copy link
Contributor Author

cyphar commented Jun 8, 2021

Ah, I did a search-and-replace rather than checking for the cases where the format string was more complicated. I'll enable the errorlint linter and will fix all the errors then push again.

@cyphar
Copy link
Contributor Author

cyphar commented Jun 8, 2021

Wait, @AkihiroSuda please stop sending reviews -- I'm fixing them all now. 😅

@AkihiroSuda
Copy link

@cyphar Sure (FYI: my draft is at AkihiroSuda@650e297)

@cyphar cyphar changed the title *: switch to %w formatting for all errors *: switch to Go 1.13-safe error handling Jun 8, 2021
@cyphar
Copy link
Contributor Author

cyphar commented Jun 8, 2021

@AkihiroSuda This now has fixes to pass all errorlint failures, which included porting things to error.As and error.Is where appropriate.

@lmb
Copy link
Collaborator

lmb commented Jun 8, 2021

Thanks for the PR, error wrapping is a bit muddled at the moment. I'm conflicted about this change: we're not going to make all wrapped errors part of our API contract. That is, if your code relies on unwrapping an error that is not explicitly documented by a function we may well break it in the future. Are you OK with that? @ti-mo do you have opinions on error wrapping vs. no error wrapping?

I also dug into the downstream issue you linked and I wonder whether you could use https://pkg.go.dev/github.com/cilium/ebpf/link#AttachCgroup instead? This should handle the cgroupv2 abstraction for you, including the bpf flags to use.

@cyphar
Copy link
Contributor Author

cyphar commented Jun 8, 2021

@lmb

We can't use that API for two reasons:

  • It depends on giving a path to a cgroup, but we have various hardenings such that the dirfd we use is definitely safe to operate on. To be fair, we could use /proc/self/fd/$fd but I'd prefer to not have to go through /proc if there is a way to use the dirfd directly.
  • As far as I can tell, there doesn't appear to be a nice way for us to detect that the error was "BPF_F_REPLACE is unsupported". Yes there is a custom error for that case, but we can't check it programmatically other than getting the string of the error and matching that (which isn't guaranteed to never change).

The first issue isn't that bad, but the second one kinda makes it untenable (I would've used that API if it weren't for the second issue). The only reason why we need to detect unix.* errors from RawAttachProgram is because I copied (and slightly modified) the code for haveProgAttachReplace() so we can use it in runc. If there was a way to be told that haveProgAttachReplace() == false in the error of RawAttachProgram, that would solve our problem and not require any other changes.

That is, if your code relies on unwrapping an error that is not explicitly documented by a function we may well break it in the future.

The only thing we really care about is being able to get the underlying syscall error from RawAttachProgram and RawDetachProgram (and having that be something we can rely on would be nice). I did a tree-wide change because it seemed that some of the internal and external functions were wrapping errors, so the ones that weren't were either an oversight or something like that. I can reduce this PR to just wrap RawAttachProgram and RawDetachProgram if that would cause less issues for you.

@lmb
Copy link
Collaborator

lmb commented Jun 8, 2021

I'd prefer to not have to go through /proc if there is a way to use the dirfd directly.
As far as I can tell, there doesn't appear to be a nice way for us to detect that the error was "BPF_F_REPLACE is unsupported".

Makes sense. What if we added an fd-only API? From your description it seems like something about BPF_F_REPLACE would still trip you up, but I'm missing context. How does your behaviour differ if F_REPLACE is not supported? Is the problem that the library forces BPF_F_REPLACE if F_ALLOW_MULTI is specified?

I can reduce this PR to just wrap RawAttachProgram and RawDetachProgram if that would cause less issues for you.

Yes please, it'll make it easier to review and get you unblocked.

@cyphar
Copy link
Contributor Author

cyphar commented Jun 8, 2021

@lmb

How does your behaviour differ if F_REPLACE is not supported?

We have slightly strange fallback behaviour. If BPF_F_REPLACE is not supported, we attach the program then detach the old one (effectively an inatomic BPF_F_REPLACE). This is okay for us because we're the only people managing the cgroup. I don't expect this fallback behaviour to be in the ebpf library, but being able to detect that "BPF_F_REPLACE isn't supported" was the error would allow us to fallback cleanly.

Yes please, it'll make it easier to review and get you unblocked.

Sure, I'll do that tomorrow. Thanks.

@florianl
Copy link
Contributor

florianl commented Jun 8, 2021

To me this reported issue sounds like a good example for feature probing (similar to the ideas in issue #235 ). Like @lmb I would appreciate it if not all errors would be part of the API.

This allows callers to detect the underlying syscall error, which is
necessary for being able to implement safe fallbacks based on the error
during program loading and unloading. runc in particular needs this to
be able to implement BPF_F_REPLACE fallbacks correctly.

Ref: opencontainers/runc#3008
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
@cyphar cyphar changed the title *: switch to Go 1.13-safe error handling link: wrap errors for Raw{At,De}tachProgram Jun 9, 2021
@cyphar
Copy link
Contributor Author

cyphar commented Jun 9, 2021

@florianl It would be nice if we could just directly use the feature probing that this library already does internally (and if we could do that, this PR wouldn't be necessary), but I expect that design discussions on such a feature will probably take some time -- we need some kind of workaround for this fairly quickly so we can do a runc release.

But personally I think it's fairly reasonable to expect that the errors from a "raw program API" would allow you to get the underlying syscall error. I've updated this PR to only add wrapping to the RawAttachProgram and RawDetachProgram.

@lmb lmb merged commit ed2c0a0 into cilium:master Jun 9, 2021
@ti-mo
Copy link
Collaborator

ti-mo commented Jun 9, 2021

A first proposal for feature probing is being worked on at the moment, but don't expect full coverage for all program and map types right out of the gate, that will take some time. Of course, if there is a strong use case for particular program types like F_REPLACE, those should be worked on first if it means we can positively impact (and gain adoption in) a project like runc.

I'm not sure where to go with this PR, though. On the surface, propagating errors all the way down to the user sounds reasonable. Do those errors then suddenly become part of the package API? Well, yes. We're not exporting sentinels, and we don't know which error(s) the kernel will give us. We could only document this like "returns error x when y" in godoc, but that doesn't make for a particularly strong API, as the kernel tends to change from under our feet.

If we want to make that work long-term, adding tests for specific error returns would be a base requirement. @cyphar this includes your most recent changes (wrapping RawAttachProgram and RawDetachProgram only). If we allow users to become reliant on certain error returns, we need to make sure that keeps working in the future. I get the urgency, but even these one-liners break with the current way of doing things, and we'll have to maintain them as an exception going forward.

Would it be possible to just rely on 'there was an error' for now to kick off your fallback behaviour, or is it really important to know the cause? If you're willing to implement a workaround, you could string-match errors for now until a feature probe API is available. This prevents your particular pain from potentially knock-on effects to other users of the library.

@cyphar cyphar deleted the wrap-errors branch June 9, 2021 07:58
@lmb
Copy link
Collaborator

lmb commented Jun 9, 2021

Good points @ti-mo.

Do those errors then suddenly become part of the package API? Well, yes. We're not exporting sentinels, and we don't know which error(s) the kernel will give us.

Like I said, if it's not documented and tested it's not part of the API from my POV. @cyphar I merged the PR so that you are unblocked, but I'd suggest adding a test and documentation if you don't want this to regress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants