Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement on-write Chunk compaction #6858

Merged
merged 5 commits into from
Jul 12, 2024
Merged

Conversation

teh-cmc
Copy link
Member

@teh-cmc teh-cmc commented Jul 11, 2024

Chunks will now be compacted as they are written to the store, provided an appropriate candidate can be found.

When a Chunk gets written to the store, it will be merged with one of its direct neighbors, whichever is deemed more appropriate.
The algorithm to find and elect compaction candidates is very simple for now, being mostly focused on the happy path case.

When a merge happens, two events gets fired for the write instead of one: one addition for the new compacted chunk, and one deletion for the pre-existing chunk that got merged with the new incoming chunk, in that order.

Some numbers:

 273.74 MB plot_stress_5x10_50k_2khz.rrd
 105.77 MB plot_stress_5x10_50k_2khz_compacted_256_rows_inf_bytes.rrd
 100.94 MB plot_stress_5x10_50k_2khz_compacted_512_rows_inf_bytes.rrd
  98.73 MB plot_stress_5x10_50k_2khz_compacted_1024_rows_inf_bytes.rrd
  97.85 MB plot_stress_5x10_50k_2khz_compacted_2048_rows_inf_bytes.rrd
  97.21 MB plot_stress_5x10_50k_2khz_compacted_4096_rows_inf_bytes.rrd

which is pretty much optimal given our current data model.

Because event subscribers are now by far the main bottleneck on the ingestion path, this PR introduces a toggle to disable subscribers, which is very useful when running in headless mode (e.g. our CLI tools).
This will be used in an upcoming PR.

Checklist

  • I have read and agree to Contributor Guide and the Code of Conduct
  • I've included a screenshot or gif (if applicable)
  • I have tested the web demo (if applicable):
  • The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
  • If applicable, add a new check to the release checklist!
  • If have noted any breaking changes to the log API in CHANGELOG.md and the migration guide

To run all checks from main, comment on the PR with @rerun-bot full-check.

@teh-cmc teh-cmc added ⛃ re_datastore affects the datastore itself 🚀 performance Optimization, memory use, etc do-not-merge Do not merge this PR include in changelog labels Jul 11, 2024
@jleibs jleibs self-requested a review July 11, 2024 11:15
Copy link
Member

@jleibs jleibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good though we'll certainly want to revisit the compaction strategy in the future.

@@ -74,6 +74,9 @@ impl std::fmt::Display for GarbageCollectionTarget {
}
}

pub type RemovableChunkIdPerTimePerComponentPerTimelinePerEntity =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 😭

.any(|e| e.kind != ChunkStoreDiffKind::Addition);
assert!(!any_event_other_than_addition);
}
/// Finds the most appropriate candidate for compaction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting though the heuristic nature makes it hard to reason through how it works in practice.

Given chunks will almost arrive in order given log_time / log_seq that seems like it will always create a bias toward 2 votes for the trivial arrival-order compaction. My suspicion is that we would be better off bias toward compacting along the user-defined timeline when there is one, rather than the arrival order since that is likely to be the natural view and most likely timeline for range queries.

An additional observation is that chunk-overlap seems like it should be one of the main drivers of compaction. Any time 2 chunks overlap on one or more timelines, it will cause performance issues for us.

If we have an opportunity to prevent an overlap from being created through compaction, that seems like it will always be a net win.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there's definitely an infinite stream of possible improvements in this space.

I'm hoping this simple vote system can help us quickly experiment with more complex biases as we go ("you overlap? +5 to you!", "you're a user-defined timeline? +4 to you!").

Base automatically changed from cmc/chunk_merge_primitives to main July 12, 2024 07:33
@teh-cmc teh-cmc force-pushed the cmc/store_chunks_compaction branch from 2342e2b to 1b8697d Compare July 12, 2024 07:34
@teh-cmc teh-cmc removed the do-not-merge Do not merge this PR label Jul 12, 2024
@teh-cmc teh-cmc merged commit e1ffd65 into main Jul 12, 2024
30 of 31 checks passed
@teh-cmc teh-cmc deleted the cmc/store_chunks_compaction branch July 12, 2024 07:40
teh-cmc added a commit that referenced this pull request Jul 12, 2024
Title.

```
$ rerun compact --help

Compacts the contents of an .rrd or .rbl file and writes the result to a new file.

Use the usual environment variables to control the compaction thresholds: `RERUN_CHUNK_MAX_ROWS`, `RERUN_CHUNK_MAX_ROWS_IF_UNSORTED`, `RERUN_CHUNK_MAX_BYTES`.

Example: `RERUN_CHUNK_MAX_ROWS=4096 RERUN_CHUNK_MAX_BYTES=1048576 rerun compact -i input.rrd -o output.rrd`

Usage: rerun compact --input <src.rrd> --output <dst.rrd>

Options:
  -i, --input <src.rrd>


  -o, --output <dst.rrd>


  -h, --help
          Print help (see a summary with '-h')
```

```
$ rerun compact -i plot_stress_5x10_50k_2khz.rrd -o /tmp/out.rrd
[2024-07-11T10:55:09Z INFO  rerun::run] compaction started src="plot_stress_5x10_50k_2khz.rrd" src_size_bytes=261 MiB dst="/tmp/out.rrd" max_num_rows=1 024 max_num_bytes=8.0 MiB
[2024-07-11T10:55:16Z INFO  rerun::run] compaction finished src="plot_stress_5x10_50k_2khz.rrd" src_size_bytes=261 MiB dst="/tmp/out.rrd" dst_size_bytes=94.3 MiB time=7.376564451s compaction_ratio="63.895%"
```

- DNM: Requires #6858
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include in changelog 🚀 performance Optimization, memory use, etc ⛃ re_datastore affects the datastore itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement in-memory on-write Chunk compaction
2 participants