Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/stanza/fileconsumer] Fix long line parsing #32100

Merged
merged 2 commits into from
Apr 23, 2024

Conversation

OverOrion
Copy link
Contributor

@OverOrion OverOrion commented Apr 2, 2024

Description:
Flush could have sent partial input before EOF was reached, this PR fixes it.

Link to tracking Issue: #31512, #32170

Testing: Added unit test TestFlushPeriodEOF

Documentation: Added a note to force_flush_period option

@OverOrion OverOrion requested a review from djaglowski as a code owner April 2, 2024 12:44
@OverOrion OverOrion requested a review from a team April 2, 2024 12:44
@OverOrion OverOrion force-pushed the stanza-fix-scanner branch 2 times, most recently from 9683fd8 to 954249f Compare April 2, 2024 13:11
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @OverOrion and @ChrsMark for finding this. Unfortunately, I do not believe this implementation works.

The underlying array may point to data that will be overwritten by a subsequent call to Scan.

This certainly seems to be a problem. To articulate the issue a bit more, we are not calling Scan again until we emit the token. However, because the token is emitted as a slice which directly references the scanner's buffer, it's contents may change later.

Possible solutions then would seem to be:

  1. Copy the token into a new slice to ensure the underlying contents will not change.
  2. Clearly advise emit funcs that they may need to do this, depending on their need for correctness vs performance. (And then handle the copy in our emit funcs)

I think it's fair of us to prioritize correctness but it will be interesting to see the performance impact associated with copying every token. To that point, we will certainly need unit tests for this as well as benchmark results to understand the tradeoff we are introducing.

Longer term, we may want to look at replacing the scanner altogether, as it's not necessarily the most performant solution. It does however provide a high degree of robustness that will be difficult to reproduce.

@OverOrion OverOrion marked this pull request as draft April 4, 2024 07:19
@OverOrion OverOrion force-pushed the stanza-fix-scanner branch 6 times, most recently from 224b323 to 36c8f1d Compare April 5, 2024 14:37
@OverOrion OverOrion requested a review from djaglowski April 5, 2024 14:38
@OverOrion OverOrion marked this pull request as ready for review April 5, 2024 14:39
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, finally starting to catch up.

@OverOrion
Copy link
Contributor Author

Thanks @djaglowski @ChrsMark, will address this today!

Flush should only happen if the scanner reached EOF

Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
@OverOrion OverOrion force-pushed the stanza-fix-scanner branch from 36c8f1d to c8e274b Compare April 22, 2024 06:55
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just correcting the description a bit, since we update the flush timeout whenever new data is found, even if we do not emit a token.

Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
@djaglowski djaglowski merged commit ba22e43 into open-telemetry:main Apr 23, 2024
156 checks passed
@github-actions github-actions bot added this to the next release milestone Apr 23, 2024
djaglowski added a commit that referenced this pull request Feb 3, 2025
…37596)

Fixes #35042 (and #32100 again)

The issue affected unterminated logs of particular lengths.
Specifically, longer than our internal `scanner.DefaultBufferSize`
(16kB) and shorter than `max_log_size`.

The failure mode was described in #32100 but was apparently only fixed
in some circumstances. I believe this is a more robust fix. I'll
articulate the exact failure mode again here:
1. During a poll cycle, `reader.ReadToEnd` is called. Within this, a
scanner is created which starts with a default buffer size. The buffer
is filled, but no terminator is found. Therefore the scanner resizes the
buffer to accommodate more data, hoping to find a terminator.
Eventually, the buffer is large enough to contain all content until EOF,
but still no terminator was found. At this time, the flush timer has not
expired, so `reader.ReadToEnd` returns without emitting anything.
2. During the _next_ poll cycle, `reader.ReadToEnd` creates a new
scanner, also with default buffer size. The first time is looks for a
terminator, it of course doesn't find one, but at this time the flush
timer has expired. Therefore, instead of resizing the buffer and
continuing to look for a terminator, it just emits what it has.

What should happen instead is the scanner continues to resize the buffer
to find as much of the unterminated token as possible before emitting
it. Therefore, this fix introduces a simple layer into the split func
stack which allows us to reason about unterminated tokens more
carefully. It captures the length of unterminated tokens and ensures
that when we recreate a scanner, we will start with a buffer size that
is appropriate to read the same content as last time, plus one
additional byte. The extra byte allows us to check if new content has
been added, in which case we will resume resizing. If no new content is
found, the flusher will emit the entire unterminated token as one.
chengchuanpeng pushed a commit to chengchuanpeng/opentelemetry-collector-contrib that referenced this pull request Feb 8, 2025
…pen-telemetry#37596)

Fixes open-telemetry#35042 (and open-telemetry#32100 again)

The issue affected unterminated logs of particular lengths.
Specifically, longer than our internal `scanner.DefaultBufferSize`
(16kB) and shorter than `max_log_size`.

The failure mode was described in open-telemetry#32100 but was apparently only fixed
in some circumstances. I believe this is a more robust fix. I'll
articulate the exact failure mode again here:
1. During a poll cycle, `reader.ReadToEnd` is called. Within this, a
scanner is created which starts with a default buffer size. The buffer
is filled, but no terminator is found. Therefore the scanner resizes the
buffer to accommodate more data, hoping to find a terminator.
Eventually, the buffer is large enough to contain all content until EOF,
but still no terminator was found. At this time, the flush timer has not
expired, so `reader.ReadToEnd` returns without emitting anything.
2. During the _next_ poll cycle, `reader.ReadToEnd` creates a new
scanner, also with default buffer size. The first time is looks for a
terminator, it of course doesn't find one, but at this time the flush
timer has expired. Therefore, instead of resizing the buffer and
continuing to look for a terminator, it just emits what it has.

What should happen instead is the scanner continues to resize the buffer
to find as much of the unterminated token as possible before emitting
it. Therefore, this fix introduces a simple layer into the split func
stack which allows us to reason about unterminated tokens more
carefully. It captures the length of unterminated tokens and ensures
that when we recreate a scanner, we will start with a buffer size that
is appropriate to read the same content as last time, plus one
additional byte. The extra byte allows us to check if new content has
been added, in which case we will resume resizing. If no new content is
found, the flusher will emit the entire unterminated token as one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants