Skip to content

Don't exit if parsing of /proc/PID/maps fails #401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 20, 2025

Conversation

rockdaboot
Copy link
Contributor

@rockdaboot rockdaboot commented Mar 14, 2025

Fixes #366

I believe this is a remnant from old days.
The agent should not exit if there is a parsing error. Instead, log an error and continue with the next line of the maps file.

Scope

  • treat parsing errors gracefully
  • be more consistent with other parsing code and do not exit()
  • remove unnecessarily exported functions util.HexToUint64() and util.DecToUint64()
  • the formerly exported functions did not document the exiting behavior

Not in scope

  • dealing with breaking changes of the maps format (assumption is that this won't happen as it will break many different workflows / tools)

@rockdaboot rockdaboot self-assigned this Mar 14, 2025
@rockdaboot rockdaboot requested review from a team as code owners March 14, 2025 16:43
@rockdaboot rockdaboot enabled auto-merge (squash) March 14, 2025 16:52
@christos68k
Copy link
Member

christos68k commented Mar 14, 2025

What is assumed failure case here? The /proc/PID/maps format is specified, so if the agent runs into parsing errors we have to consider the root-cause:

  • Is it temporary or will it repeat?
  • Is it something that we expect to see once in a while or something exceptional?

If it's temporary and it doesn't repeat then continuing on parsing failures makes sense. If it repeats, then we'll have an agent that keeps running into the issue and we should probably treat it as a fatal failure and exit.

We should do some research and find out if /proc/PID/maps is guaranteed to be consistent on every read. If that's the case then parsing failures to me seem like exceptional fatal failures and the fail-safe approach is to exit (and if that's what we decide to do, we need a better approach than Fatalf to avoid taking down the OTel collector).

@rockdaboot
Copy link
Contributor Author

What is assumed failure case here? The /proc/PID/maps format is specified, so if the agent runs into parsing errors we have to consider the root-cause:

* Is it temporary or will it repeat?

* Is it something that we expect to see once in a while or something exceptional?

If it's temporary and it doesn't repeat then continuing on parsing failures makes sense. If it repeats, then we'll have an agent that keeps running into the issue and we should probably treat it as a fatal failure and exit.

We should do some research and find out if /proc/PID/maps is guaranteed to be consistent on every read. If that's the case then parsing failures to me seem like exceptional fatal failures and the fail-safe approach is to exit (and if that's what we decide to do, we need a better approach than Fatalf to avoid taking down the OTel collector).

The PR description has been amended to make the purpose of this PR more clear.

Additionally, I don't think it is worth the time to investigate the time for research whether proc/PID/maps is "guaranteed to be consistent". Because even if you could "prove" that a certain kernel version has mathematically safe code... code is not set in stone and there could be future changes with subtle regressions or bugs.

A different issue - and not in scope for this PR - is Whether it makes sense or not to detect general breaking changes in the maps files and how to deal with it. IMO, it is not worth the time as a different format breaks all consumers of those files and kernel devs try hard to avoid that.
But even if it happens, and the agent isn't able to read any mapping, the agent still reports this via logging. We currently do not have a metric for this error case, if you look at SynchronizeProcess(). We could drop the logs and instead introduce a metric, to avoid spamming of logs.

@fabled
Copy link
Contributor

fabled commented Mar 17, 2025

The agent should not exit if there is a parsing error. Instead, log an error and continue with the next line of the maps file.

I'd just make them debug if any.

But yes, having code removed - especially code that can abort the agent unexpected - is good. Even if technically that code should be unreachable, its better to remove the code.

@christos68k
Copy link
Member

Additionally, I don't think it is worth the time to investigate the time for research whether proc/PID/maps is "guaranteed to be consistent". Because even if you could "prove" that a certain kernel version has mathematically safe code... code is not set in stone and there could be future changes with subtle regressions or bugs.

Understanding how the underlying system works is useful here I think. I spent a few minutes looking into https://elixir.bootlin.com/linux/v4.19/source/fs/proc/task_mmu.c. My current (non exhaustive) understanding is that a read lock is taken which means that the contents of /proc/PID/maps are guaranteed to be consistent within a single read system call. This is also mentioned here. This also means that the agent can see inconsistencies/races across multiple calls.

We currently do not have a metric for this error case, if you look at SynchronizeProcess(). We could drop the logs and instead introduce a metric, to avoid spamming of logs.

A metric SGTM, esp since if we're running inside OTel collector, logging may not be under our full control.

@rockdaboot
Copy link
Contributor Author

@fabled @christos68k Changed to debug logs and added a metric for parsing/format errors.

Co-authored-by: Christos Kalkanis <christos@elastic.co>
@rockdaboot rockdaboot requested a review from christos68k March 19, 2025 08:29
@rockdaboot rockdaboot merged commit 0708a7f into open-telemetry:main Mar 20, 2025
26 checks passed
@rockdaboot rockdaboot deleted the fix-366 branch March 20, 2025 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't exit if parsing of /proc/PID/maps fails
4 participants