Skip to content

Commit ac4317d

Browse files
authored
Bump extsort (#31)
1 parent c5014c3 commit ac4317d

13 files changed

+677
-212
lines changed

CHANGELOG.md

+19-1
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,34 @@
11
# Changelog
2+
23
All notable changes to this project will be documented in this file.
34

45
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
56
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
67

8+
## [0.7.0] - 2024-03-24
9+
10+
- Potentially breaking: It is now possible to pass the key and value size at
11+
index creation and opening via type parameters. Both type parameters still uses
12+
u16 as default value if not specified, which makes it backward compatible.
13+
14+
## [0.6.1] - 2024-02-25
15+
16+
- Fix: delete tmp directory when external sorted is used.
17+
18+
## [0.6.0] - 2024-02-19
19+
20+
- Potentially breaking: support for empty index instead of failing if the index is empty.
21+
722
## [0.5.0] - 2022-08-02
23+
824
- Breaking: renamed `Encodable` to `Serialize`
925
- Serde serialization wrapper
1026

1127
## [0.4.0] - 2020-12-23
28+
1229
### Changed
30+
1331
- Breaking: cleaner `Encodable` trait ([PR #6](https://github.com/appaquet/extindex-rs/pull/6/files#diff-3dcefa956e75e2171b83e5134b542405a2adb7909a16dc03fad7fd92e8e2d945L11))
1432
- Moved to `memmap2` as `memmap` isn't supported anymore.
1533
- Moved to `tempfile` as `tempdir` isn't supported anymore.
16-
- Upgrade to `extsort` 0.4
34+
- Upgrade to `extsort` 0.4

Cargo.toml

+11-5
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ license = "Apache-2.0"
77
name = "extindex"
88
readme = "README.md"
99
repository = "https://github.com/appaquet/extindex-rs"
10-
version = "0.6.0"
10+
version = "0.7.0"
1111

1212
[lib]
1313
bench = false
@@ -17,19 +17,25 @@ default = ["serde"]
1717
serde = ["dep:serde", "bincode"]
1818

1919
[dependencies]
20-
bincode = {version = "2.0.0-rc.3", features = ["serde"], optional = true}
20+
bincode = { version = "2.0.0-rc.3", features = ["serde"], optional = true }
2121
byteorder = "1.5"
22-
extsort = "0.4"
22+
extsort = "0.5"
2323
log = "0.4"
2424
memmap2 = "0.9"
25-
serde = {version = "1.0", optional = true}
25+
serde = { version = "1.0", optional = true }
2626
smallvec = "1.13.1"
2727

2828
[dev-dependencies]
2929
criterion = "0.5"
30-
skeptic = "0.13"
3130
tempfile = "3.10"
3231

32+
[build-dependencies]
33+
skeptic = "0.13"
34+
35+
[[bench]]
36+
harness = false
37+
name = "builder"
38+
3339
[[bench]]
3440
harness = false
3541
name = "reader"

README.md

+17-15
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,23 @@
1-
extindex
1+
# extindex
2+
23
[![crates.io](https://img.shields.io/crates/v/extindex.svg)](https://crates.io/crates/extindex)
3-
=========
44

5-
Immutable persisted index (on disk) that can be built in one pass using a sorted iterator, or can
6-
use [extsort](https://crates.io/crates/extsort) to externally sort the iterator first, and
7-
then build the index from it.
5+
Immutable persisted index (on disk) that can be built in one pass using a sorted
6+
iterator, or can use [extsort](https://crates.io/crates/extsort) to externally
7+
sort the iterator first, and then build the index from it.
88

9-
The index allows random lookups and sorted scans. An indexed entry consists of a key and a value.
10-
The key needs to implement `Eq` and `Ord`, and both the key and values need to implement a
11-
`Serializable` trait for serialization to and from disk.
9+
The index allows random lookups and sorted scans. An indexed entry consists of a
10+
key and a value. The key needs to implement `Eq` and `Ord`, and both the key
11+
and values need to implement a `Serializable` trait for serialization to and
12+
from disk. It is possible to rely on the [`serde`](https://crates.io/crates/serde)
13+
library to implement this trait for most types.
1214

13-
The index is built using a skip list like data structure, but in which lookups are starting from
14-
the end of the index instead of from the beginning. This allow building the index in a single
15-
pass on a sorted iterator, since starting from the beginning would require knowing
16-
checkpoints/nodes ahead in the file.
15+
The index is built using a skip list-like data structure, but lookups start from
16+
the end of the index instead of the beginning. This allows building the index in
17+
a single pass on a sorted iterator, as starting from the beginning would require
18+
knowing checkpoints/nodes ahead in the file.
1719

18-
# Example <!-- keep in sync with serde_struct.rs -->
20+
## Example
1921

2022
```rust
2123
extern crate extindex;
@@ -48,6 +50,6 @@ fn main() {
4850
}
4951
```
5052

51-
# TODO
53+
## Roadmap
5254

53-
- [ ] Possibility to use Bloom filter to prevent hitting the disk when index doesn't have a key
55+
- Possibility to use a Bloom filter to avoid disk access when the index does not contain a key.

benches/builder.rs

+128
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
// Copyright 2018 Andre-Philippe Paquet
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
15+
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
16+
17+
use std::{
18+
io::{Read, Write},
19+
time::Duration,
20+
};
21+
22+
use extindex::{Builder, Entry, Serializable};
23+
24+
fn bench_builder(c: &mut Criterion) {
25+
let mut group = c.benchmark_group("builder");
26+
group.sample_size(10);
27+
group.measurement_time(Duration::from_secs(9));
28+
group.sampling_mode(criterion::SamplingMode::Flat);
29+
group.warm_up_time(Duration::from_millis(100));
30+
31+
let sizes = [10_000, 100_000, 1_000_000];
32+
for size in sizes {
33+
group.bench_with_input(BenchmarkId::new("known size", size), &size, |b, size| {
34+
b.iter(|| {
35+
let index_file = tempfile::NamedTempFile::new().unwrap();
36+
let index_file = index_file.path();
37+
38+
let builder = Builder::new(index_file);
39+
builder.build(create_known_size_entries(*size)).unwrap();
40+
});
41+
});
42+
43+
group.bench_with_input(BenchmarkId::new("unknown size", size), &size, |b, size| {
44+
b.iter(|| {
45+
let index_file = tempfile::NamedTempFile::new().unwrap();
46+
let index_file = index_file.path();
47+
48+
let builder = Builder::new(index_file);
49+
builder.build(create_unknown_size_entries(*size)).unwrap();
50+
});
51+
});
52+
}
53+
}
54+
55+
fn create_known_size_entries(
56+
nb_entries: usize,
57+
) -> impl Iterator<Item = Entry<SizedString, SizedString>> {
58+
(0..nb_entries).map(|idx| {
59+
Entry::new(
60+
SizedString(format!("key:{}", idx)),
61+
SizedString(format!("val:{}", idx)),
62+
)
63+
})
64+
}
65+
66+
fn create_unknown_size_entries(
67+
nb_entries: usize,
68+
) -> impl Iterator<Item = Entry<UnsizedString, UnsizedString>> {
69+
(0..nb_entries).map(|idx| {
70+
Entry::new(
71+
UnsizedString(format!("key:{}", idx)),
72+
UnsizedString(format!("val:{}", idx)),
73+
)
74+
})
75+
}
76+
77+
#[derive(Ord, PartialOrd, Eq, PartialEq, Debug)]
78+
struct SizedString(String);
79+
80+
impl Serializable for SizedString {
81+
fn size(&self) -> Option<usize> {
82+
Some(self.0.as_bytes().len())
83+
}
84+
85+
fn serialize<W: Write>(&self, write: &mut W) -> Result<(), std::io::Error> {
86+
write.write_all(self.0.as_bytes()).map(|_| ())
87+
}
88+
89+
fn deserialize<R: Read>(data: &mut R, size: usize) -> Result<SizedString, std::io::Error> {
90+
let mut bytes = vec![0u8; size];
91+
data.read_exact(&mut bytes)?;
92+
Ok(SizedString(String::from_utf8_lossy(&bytes).to_string()))
93+
}
94+
}
95+
96+
impl std::fmt::Display for SizedString {
97+
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
98+
write!(f, "{}", self.0)
99+
}
100+
}
101+
102+
#[derive(Ord, PartialOrd, Eq, PartialEq, Debug)]
103+
pub struct UnsizedString(pub String);
104+
105+
impl Serializable for UnsizedString {
106+
fn size(&self) -> Option<usize> {
107+
None
108+
}
109+
110+
fn serialize<W: Write>(&self, write: &mut W) -> Result<(), std::io::Error> {
111+
write.write_all(self.0.as_bytes()).map(|_| ())
112+
}
113+
114+
fn deserialize<R: Read>(data: &mut R, size: usize) -> Result<UnsizedString, std::io::Error> {
115+
let mut bytes = vec![0u8; size];
116+
data.read_exact(&mut bytes)?;
117+
Ok(UnsizedString(String::from_utf8_lossy(&bytes).to_string()))
118+
}
119+
}
120+
121+
impl std::fmt::Display for UnsizedString {
122+
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
123+
write!(f, "{}", self.0)
124+
}
125+
}
126+
127+
criterion_group!(benches, bench_builder,);
128+
criterion_main!(benches);

benches/reader.rs

+3-75
Original file line numberDiff line numberDiff line change
@@ -21,39 +21,8 @@ use std::{
2121

2222
use extindex::{Builder, Entry, Reader, Serializable};
2323

24-
fn bench_index_builder(c: &mut Criterion) {
25-
let mut group = c.benchmark_group("Builder");
26-
group.sample_size(10);
27-
group.measurement_time(Duration::from_secs(9));
28-
group.sampling_mode(criterion::SamplingMode::Flat);
29-
group.warm_up_time(Duration::from_millis(100));
30-
31-
let sizes = [10_000, 100_000, 1_000_000];
32-
for size in sizes {
33-
group.bench_with_input(BenchmarkId::new("known size", size), &size, |b, size| {
34-
b.iter(|| {
35-
let index_file = tempfile::NamedTempFile::new().unwrap();
36-
let index_file = index_file.path();
37-
38-
let builder = Builder::new(index_file);
39-
builder.build(create_known_size_entries(*size)).unwrap();
40-
});
41-
});
42-
43-
group.bench_with_input(BenchmarkId::new("unknown size", size), &size, |b, size| {
44-
b.iter(|| {
45-
let index_file = tempfile::NamedTempFile::new().unwrap();
46-
let index_file = index_file.path();
47-
48-
let builder = Builder::new(index_file);
49-
builder.build(create_unknown_size_entries(*size)).unwrap();
50-
});
51-
});
52-
}
53-
}
54-
5524
fn bench_random_access(c: &mut Criterion) {
56-
let mut group = c.benchmark_group("RandomAccess1million");
25+
let mut group = c.benchmark_group("random_access");
5726
group.sample_size(10);
5827
group.measurement_time(Duration::from_secs(7));
5928
group.sampling_mode(criterion::SamplingMode::Flat);
@@ -86,7 +55,7 @@ fn bench_random_access(c: &mut Criterion) {
8655
}
8756

8857
fn bench_iter(c: &mut Criterion) {
89-
let mut group = c.benchmark_group("Iter1million");
58+
let mut group = c.benchmark_group("iter_1million");
9059
group.sample_size(10);
9160
group.measurement_time(Duration::from_secs(7));
9261
group.sampling_mode(criterion::SamplingMode::Flat);
@@ -139,17 +108,6 @@ fn bench_iter(c: &mut Criterion) {
139108
});
140109
}
141110

142-
fn create_known_size_entries(
143-
nb_entries: usize,
144-
) -> impl Iterator<Item = Entry<SizedString, SizedString>> {
145-
(0..nb_entries).map(|idx| {
146-
Entry::new(
147-
SizedString(format!("key:{}", idx)),
148-
SizedString(format!("val:{}", idx)),
149-
)
150-
})
151-
}
152-
153111
fn create_unknown_size_entries(
154112
nb_entries: usize,
155113
) -> impl Iterator<Item = Entry<UnsizedString, UnsizedString>> {
@@ -161,31 +119,6 @@ fn create_unknown_size_entries(
161119
})
162120
}
163121

164-
#[derive(Ord, PartialOrd, Eq, PartialEq, Debug)]
165-
struct SizedString(String);
166-
167-
impl Serializable for SizedString {
168-
fn size(&self) -> Option<usize> {
169-
Some(self.0.as_bytes().len())
170-
}
171-
172-
fn serialize<W: Write>(&self, write: &mut W) -> Result<(), std::io::Error> {
173-
write.write_all(self.0.as_bytes()).map(|_| ())
174-
}
175-
176-
fn deserialize<R: Read>(data: &mut R, size: usize) -> Result<SizedString, std::io::Error> {
177-
let mut bytes = vec![0u8; size];
178-
data.read_exact(&mut bytes)?;
179-
Ok(SizedString(String::from_utf8_lossy(&bytes).to_string()))
180-
}
181-
}
182-
183-
impl std::fmt::Display for SizedString {
184-
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
185-
write!(f, "{}", self.0)
186-
}
187-
}
188-
189122
#[derive(Ord, PartialOrd, Eq, PartialEq, Debug)]
190123
pub struct UnsizedString(pub String);
191124

@@ -211,10 +144,5 @@ impl std::fmt::Display for UnsizedString {
211144
}
212145
}
213146

214-
criterion_group!(
215-
benches,
216-
bench_index_builder,
217-
bench_random_access,
218-
bench_iter
219-
);
147+
criterion_group!(benches, bench_random_access, bench_iter);
220148
criterion_main!(benches);

build.rs

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
// Copyright 2018 Andre-Philippe Paquet
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
15+
extern crate skeptic;
16+
17+
fn main() {
18+
// generates doc tests for `README.md`.
19+
skeptic::generate_doc_tests(&["README.md"]);
20+
}

0 commit comments

Comments
 (0)