Skip to content

Commit 3e50dd2

Browse files
committed
Design doc wip
1 parent 9a7a199 commit 3e50dd2

File tree

4 files changed

+189
-0
lines changed

4 files changed

+189
-0
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,4 @@ doc/*
3333
/bazel-*
3434

3535
/.vscode/
36+
.DS_store

docs/internals/COMPACTION.md

+188
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# Ra log compaction
2+
3+
This is a living document capturing current work on log compaction.
4+
5+
## Overview
6+
7+
8+
Compaction in Ra is intrinsically linked to the snapshotting
9+
feature. Standard Raft snapshotting removes all entries in the Ra log
10+
that precedes the snapshot index where the snapshot is a full representation of
11+
the state machine state.
12+
13+
The high level idea of compacting in Ra is that instead of deleting all
14+
segment data that precedes the snapshot index the snapshot data can emit a list
15+
of live raft indexes which will be kept, either in their original segments
16+
or written to new compacted segments. the data for these indexes can then
17+
be omitted from the snapshot to reduce its size and write amplification.
18+
19+
20+
### Log sections
21+
22+
Two named sections of the log then emerge.
23+
24+
#### Normal log section
25+
26+
The normal log section is the contiguous log that follows the last snapshot.
27+
28+
#### Compacting log section
29+
30+
The compacting log section consists of all live raft indexes that are lower
31+
than or equal to the last snapshot taken.
32+
33+
![compaction](compaction1.jpg)
34+
35+
36+
## Compaction phases
37+
38+
### Phase 1
39+
40+
Delete whole segments. This is the easiest and most efficient form of "compaction"
41+
and will run immediately after each snapshot is taken.
42+
43+
The run will start with the oldest segment and move towards the newest segment
44+
in the compacting log section. Every segment that has no entries in the live
45+
indexes list returned by the snapshot state will be deleted. Standard Raft
46+
log truncation is achieved by returning and empty list of live indexes.
47+
48+
### Compacted segments: naming (phase 3 compaction)
49+
50+
Segment files in a Ra log have numeric names incremented as they are written.
51+
This is essential as the order is required to ensure log integrity.
52+
53+
Desired Properties of phase 3 compaction:
54+
55+
* Retain immutability, entries will never be deleted from a segment. Instead they
56+
will be written to a new segment.
57+
* lexicographic sorting of file names needs to be consistent with order of writes
58+
* Compaction walks from the old segment to new
59+
* Easy to recover after unclean shutdown
60+
61+
Segments will be compacted when 2 or more adjacent segments fit into a single
62+
segment.
63+
64+
The new segment will have the naming format `OLD-NEW.segment`
65+
66+
This means that a single segment can only be compacted once e.g
67+
`001.segment -> 001-001.segment` as after this there is no new name available
68+
and it has to wait until it can be compacted with the adjacent segment. Single
69+
segment compaction could be optional and only triggered when a substantial,
70+
say 75% or more entries / data can be deleted.
71+
72+
This naming format means it is easy to identify dead segments after an unclean
73+
exit.
74+
75+
During compaction a different extension will be used: `002-004.compacting` and
76+
after an unclean shutdown any such files will be removed. Once synced it will be
77+
renamed to `.segment` and some time after the source files will be deleted (Once
78+
the Ra server has updated its list of segments).
79+
80+
#### When does phase 3 compaction run?
81+
82+
Options:
83+
84+
* On a timer
85+
* After phase 1 if needed based on a ratio of live to dead indexes in the compacting section
86+
* After phase 1 if needed based on disk use / ratio of live data to dead.
87+
88+
![segments](compaction2.jpg)
89+
90+
### Ra Server log worker responsibilities
91+
92+
* Write checkpoints and snapshots
93+
* Perform compaction runs
94+
* Report segments to be deleted back to the ra server (NB: the worker does
95+
not perform the segment deletion itself, it needs to report changes back to the
96+
ra server first). The ra server log worker maintains its own list of segments
97+
to avoid double processing
98+
99+
100+
```mermaid
101+
sequenceDiagram
102+
participant segment-writer
103+
participant ra-server
104+
participant ra-server-log
105+
106+
segment-writer--)ra-server: new segments
107+
ra-server-)+ra-server-log: new segments
108+
ra-server-log->>ra-server-log: phase 1 compaction
109+
ra-server-log-)-ra-server: segment changes (new, to be deleted)
110+
ra-server-)+ra-server-log: new snapshot
111+
ra-server-log->>ra-server-log: write snapshot
112+
ra-server-log->>ra-server-log: phase 1 compaction
113+
ra-server-log-)-ra-server: snapshot written, segment changes
114+
```
115+
116+
#### Impact on segment writer process
117+
118+
The segment writer process as well as the WAL relies heavily on the
119+
`ra_log_snapshot_state` table to avoid writing data that is no longer
120+
needed. This table contains the latest snapshot index for every
121+
ra server in the system. In order to do the same for a compacting state machine
122+
the segment writer would need access to the list of live indexes when flushing
123+
the WAL to segments.
124+
125+
Options:
126+
127+
* It could read the latest snapshot to find out the live indexes
128+
* Live indexes are stored in the `ra_log_snapshot_state` table for easy access.
129+
130+
Snapshots can be taken ahead of the segment part of the log meaning that the
131+
segment writer and log worker may modify the log at different times. To allow
132+
this there needs to be an invariant that the log worker never marks the last
133+
segment for deletion as it may have been appended to after or concurrently
134+
to when the log worker evaluated it's state.
135+
136+
The segment writer can query the `ra_log_snapshot_table` to see if the server
137+
is using compaction (i.e. have preceding live entries) and if so read the
138+
live indexes from the snapshot directory (however it is stored). Then it
139+
can proceed writing any live indexes in the compacting section as well as
140+
contiguous entries in the normal log section.
141+
142+
143+
Segment range: (1, 1000)
144+
145+
Memtable range: (1001, 2000)
146+
147+
Snapshot: 1500, live indexes [1501, 1999],
148+
149+
150+
Alt: if the log worker / Ra server is alive the segment writer could call into
151+
the log worker and ask it to do the log flush and thus make easy use of the
152+
live indexes list. If the Ra server is not running but is still registered
153+
the segment writer will flush all entries (if compacting), even those preceding
154+
last snapshot index. This option minimises changes to segment writer but could
155+
delay flush _if_ log worker is busy (doing compaction perhaps) when
156+
the flush request comes in.
157+
158+
159+
160+
### Snapshot replication
161+
162+
With the snapshot now defined as the snapshot state + live preceding raft indexes
163+
the default snapshot replication approach will need to change.
164+
165+
The snapshot sender (Ra log worker??) needs to negotiate with the follower to
166+
discover which preceding raft indexes the follower does not yet have. Then it would
167+
go on and replicate these before or after (??) sending the snapshot itself.
168+
169+
T: probably before as if a new snapshot has been taken locally we'd most likely
170+
skip some raft index replication on the second attempt.
171+
172+
#### How to store live indexes with snapshot
173+
174+
* New section in snapshot file format.
175+
* Separate file (that can be rebuilt if needed from the snapshot).
176+
177+
178+
### WAL impact
179+
180+
The WAL will use the `ra_log_snapshot_state` to avoid writing entries that are
181+
lower than a server's last snapshot index. This is an important optimisation
182+
that can help the WAL catch up in cases where it is running a longer mailbox
183+
backlog.
184+
185+
`ra_log_snapshot_state` is going to have to be extended to not just store
186+
the snapshot index but also the machine's smallest live raft index (at time of
187+
snapshot) such that the WAL can use that to reduce write workload instead of
188+
the snapshot index.

docs/internals/compaction1.jpg

35.8 KB
Loading

docs/internals/compaction2.jpg

52 KB
Loading

0 commit comments

Comments
 (0)