-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement on-write Chunk
compaction
#6858
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good though we'll certainly want to revisit the compaction strategy in the future.
@@ -74,6 +74,9 @@ impl std::fmt::Display for GarbageCollectionTarget { | |||
} | |||
} | |||
|
|||
pub type RemovableChunkIdPerTimePerComponentPerTimelinePerEntity = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆 😭
.any(|e| e.kind != ChunkStoreDiffKind::Addition); | ||
assert!(!any_event_other_than_addition); | ||
} | ||
/// Finds the most appropriate candidate for compaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting though the heuristic nature makes it hard to reason through how it works in practice.
Given chunks will almost arrive in order given log_time
/ log_seq
that seems like it will always create a bias toward 2 votes for the trivial arrival-order compaction. My suspicion is that we would be better off bias toward compacting along the user-defined timeline when there is one, rather than the arrival order since that is likely to be the natural view and most likely timeline for range queries.
An additional observation is that chunk-overlap seems like it should be one of the main drivers of compaction. Any time 2 chunks overlap on one or more timelines, it will cause performance issues for us.
If we have an opportunity to prevent an overlap from being created through compaction, that seems like it will always be a net win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there's definitely an infinite stream of possible improvements in this space.
I'm hoping this simple vote system can help us quickly experiment with more complex biases as we go ("you overlap? +5 to you!", "you're a user-defined timeline? +4 to you!").
2342e2b
to
1b8697d
Compare
Title. ``` $ rerun compact --help Compacts the contents of an .rrd or .rbl file and writes the result to a new file. Use the usual environment variables to control the compaction thresholds: `RERUN_CHUNK_MAX_ROWS`, `RERUN_CHUNK_MAX_ROWS_IF_UNSORTED`, `RERUN_CHUNK_MAX_BYTES`. Example: `RERUN_CHUNK_MAX_ROWS=4096 RERUN_CHUNK_MAX_BYTES=1048576 rerun compact -i input.rrd -o output.rrd` Usage: rerun compact --input <src.rrd> --output <dst.rrd> Options: -i, --input <src.rrd> -o, --output <dst.rrd> -h, --help Print help (see a summary with '-h') ``` ``` $ rerun compact -i plot_stress_5x10_50k_2khz.rrd -o /tmp/out.rrd [2024-07-11T10:55:09Z INFO rerun::run] compaction started src="plot_stress_5x10_50k_2khz.rrd" src_size_bytes=261 MiB dst="/tmp/out.rrd" max_num_rows=1 024 max_num_bytes=8.0 MiB [2024-07-11T10:55:16Z INFO rerun::run] compaction finished src="plot_stress_5x10_50k_2khz.rrd" src_size_bytes=261 MiB dst="/tmp/out.rrd" dst_size_bytes=94.3 MiB time=7.376564451s compaction_ratio="63.895%" ``` - DNM: Requires #6858
Chunk
s will now be compacted as they are written to the store, provided an appropriate candidate can be found.When a
Chunk
gets written to the store, it will be merged with one of its direct neighbors, whichever is deemed more appropriate.The algorithm to find and elect compaction candidates is very simple for now, being mostly focused on the happy path case.
When a merge happens, two events gets fired for the write instead of one: one addition for the new compacted chunk, and one deletion for the pre-existing chunk that got merged with the new incoming chunk, in that order.
Some numbers:
which is pretty much optimal given our current data model.
Because event subscribers are now by far the main bottleneck on the ingestion path, this PR introduces a toggle to disable subscribers, which is very useful when running in headless mode (e.g. our CLI tools).
This will be used in an upcoming PR.
Chunk
concatenation primitives #6857Checklist
main
build: rerun.io/viewernightly
build: rerun.io/viewerCHANGELOG.md
and the migration guideTo run all checks from
main
, comment on the PR with@rerun-bot full-check
.