Implement on-write `Chunk` compaction #6858

teh-cmc · 2024-07-11T10:24:01Z

Chunks will now be compacted as they are written to the store, provided an appropriate candidate can be found.

When a Chunk gets written to the store, it will be merged with one of its direct neighbors, whichever is deemed more appropriate.
The algorithm to find and elect compaction candidates is very simple for now, being mostly focused on the happy path case.

When a merge happens, two events gets fired for the write instead of one: one addition for the new compacted chunk, and one deletion for the pre-existing chunk that got merged with the new incoming chunk, in that order.

Some numbers:

 273.74 MB plot_stress_5x10_50k_2khz.rrd
 105.77 MB plot_stress_5x10_50k_2khz_compacted_256_rows_inf_bytes.rrd
 100.94 MB plot_stress_5x10_50k_2khz_compacted_512_rows_inf_bytes.rrd
  98.73 MB plot_stress_5x10_50k_2khz_compacted_1024_rows_inf_bytes.rrd
  97.85 MB plot_stress_5x10_50k_2khz_compacted_2048_rows_inf_bytes.rrd
  97.21 MB plot_stress_5x10_50k_2khz_compacted_4096_rows_inf_bytes.rrd

which is pretty much optimal given our current data model.

Because event subscribers are now by far the main bottleneck on the ingestion path, this PR introduces a toggle to disable subscribers, which is very useful when running in headless mode (e.g. our CLI tools).
This will be used in an upcoming PR.

Fixes Implement in-memory on-write Chunk compaction #6812
DNM: Requires Chunk concatenation primitives #6857

Checklist

I have read and agree to Contributor Guide and the Code of Conduct
I've included a screenshot or gif (if applicable)
I have tested the web demo (if applicable):
- Using examples from latest main build: rerun.io/viewer
- Using full set of examples from nightly build: rerun.io/viewer
The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
If applicable, add a new check to the release checklist!
If have noted any breaking changes to the log API in CHANGELOG.md and the migration guide

To run all checks from main, comment on the PR with @rerun-bot full-check.

jleibs

Looks good though we'll certainly want to revisit the compaction strategy in the future.

jleibs · 2024-07-11T20:25:59Z

crates/store/re_chunk_store/src/gc.rs

@@ -74,6 +74,9 @@ impl std::fmt::Display for GarbageCollectionTarget {
    }
 }

+pub type RemovableChunkIdPerTimePerComponentPerTimelinePerEntity =


jleibs · 2024-07-11T20:43:35Z

crates/store/re_chunk_store/src/writes.rs

-                    .any(|e| e.kind != ChunkStoreDiffKind::Addition);
-                assert!(!any_event_other_than_addition);
-            }
+    /// Finds the most appropriate candidate for compaction.


This is interesting though the heuristic nature makes it hard to reason through how it works in practice.

Given chunks will almost arrive in order given log_time / log_seq that seems like it will always create a bias toward 2 votes for the trivial arrival-order compaction. My suspicion is that we would be better off bias toward compacting along the user-defined timeline when there is one, rather than the arrival order since that is likely to be the natural view and most likely timeline for range queries.

An additional observation is that chunk-overlap seems like it should be one of the main drivers of compaction. Any time 2 chunks overlap on one or more timelines, it will cause performance issues for us.

If we have an opportunity to prevent an overlap from being created through compaction, that seems like it will always be a net win.

Yeah there's definitely an infinite stream of possible improvements in this space.

I'm hoping this simple vote system can help us quickly experiment with more complex biases as we go ("you overlap? +5 to you!", "you're a user-defined timeline? +4 to you!").

Title. ``` $ rerun compact --help Compacts the contents of an .rrd or .rbl file and writes the result to a new file. Use the usual environment variables to control the compaction thresholds: `RERUN_CHUNK_MAX_ROWS`, `RERUN_CHUNK_MAX_ROWS_IF_UNSORTED`, `RERUN_CHUNK_MAX_BYTES`. Example: `RERUN_CHUNK_MAX_ROWS=4096 RERUN_CHUNK_MAX_BYTES=1048576 rerun compact -i input.rrd -o output.rrd` Usage: rerun compact --input <src.rrd> --output <dst.rrd> Options: -i, --input <src.rrd> -o, --output <dst.rrd> -h, --help Print help (see a summary with '-h') ``` ``` $ rerun compact -i plot_stress_5x10_50k_2khz.rrd -o /tmp/out.rrd [2024-07-11T10:55:09Z INFO rerun::run] compaction started src="plot_stress_5x10_50k_2khz.rrd" src_size_bytes=261 MiB dst="/tmp/out.rrd" max_num_rows=1 024 max_num_bytes=8.0 MiB [2024-07-11T10:55:16Z INFO rerun::run] compaction finished src="plot_stress_5x10_50k_2khz.rrd" src_size_bytes=261 MiB dst="/tmp/out.rrd" dst_size_bytes=94.3 MiB time=7.376564451s compaction_ratio="63.895%" ``` - DNM: Requires #6858

teh-cmc added ⛃ re_datastore affects the datastore itself 🚀 performance Optimization, memory use, etc do-not-merge Do not merge this PR include in changelog labels Jul 11, 2024

This was referenced Jul 11, 2024

Display compaction information in the recording UI #6859

Merged

CLI command for compacting recordings #6860

Merged

jleibs self-requested a review July 11, 2024 11:15

jleibs approved these changes Jul 11, 2024

View reviewed changes

Base automatically changed from cmc/chunk_merge_primitives to main July 12, 2024 07:33

teh-cmc added 5 commits July 12, 2024 09:34

add compaction related store config

33fb108

refactor GC code to expose chunk removal endpoints

7ec6dd9

implement on-write chunk compaction

9f17fd8

propagate changes

6c63359

let's be a lil more specific with that one

1b8697d

teh-cmc force-pushed the cmc/store_chunks_compaction branch from 2342e2b to 1b8697d Compare July 12, 2024 07:34

teh-cmc removed the do-not-merge Do not merge this PR label Jul 12, 2024

teh-cmc merged commit e1ffd65 into main Jul 12, 2024
30 of 31 checks passed

teh-cmc deleted the cmc/store_chunks_compaction branch July 12, 2024 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement on-write `Chunk` compaction #6858

Implement on-write `Chunk` compaction #6858

teh-cmc commented Jul 11, 2024 •

edited by github-actions bot

Loading

jleibs left a comment

jleibs Jul 11, 2024

jleibs Jul 11, 2024

teh-cmc Jul 12, 2024

Implement on-write Chunk compaction #6858

Implement on-write Chunk compaction #6858

Conversation

teh-cmc commented Jul 11, 2024 • edited by github-actions bot Loading

Checklist

jleibs left a comment

Choose a reason for hiding this comment

jleibs Jul 11, 2024

Choose a reason for hiding this comment

jleibs Jul 11, 2024

Choose a reason for hiding this comment

teh-cmc Jul 12, 2024

Choose a reason for hiding this comment

Implement on-write `Chunk` compaction #6858

Implement on-write `Chunk` compaction #6858

teh-cmc commented Jul 11, 2024 •

edited by github-actions bot

Loading