You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+13-9
Original file line number
Diff line number
Diff line change
@@ -10,17 +10,19 @@ There are two broad categories of ANN index:
10
10
11
11
Graph-based indexes tend to be simpler to implement and faster, but more importantly they can be constructed and updated incrementally. This makes them a much better fit for a general-purpose index than partitioning approaches that only work on static datasets that are completely specified up front. That is why all the major commercial vector indexes use graph approaches.
12
12
13
-
JVector is a graph index in the DiskANN family tree.
13
+
JVector is a graph index that merges the DiskANN and HNSW family trees.
14
+
JVector borrows the hierarchical structure from HNSW, and uses Vamana (the algorithm behind DiskANN) within each layer.
14
15
15
16
16
17
## JVector Architecture
17
18
18
-
JVector is a graph-based index that builds on the DiskANN design with composeable extensions.
19
+
JVector is a graph-based index that builds on the HNSW and DiskANN designs with composable extensions.
19
20
20
-
JVector implements a single-layer graph with nonblocking concurrency control, allowing construction to scale linearly with the number of cores:
21
+
JVector implements a multi-layer graph with nonblocking concurrency control, allowing construction to scale linearly with the number of cores:
21
22

22
23
23
-
The graph is represented by an on-disk adjacency list per node, with additional data stored inline to support two-pass searches, with the first pass powered by lossily compressed representations of the vectors kept in memory, and the second by a more accurate representation read from disk. The first pass can be performed with
24
+
The upper layers of the hierarchy are represnted by an in-memory adjacency list per node. This allows for quick navigation with no IOs.
25
+
The bottom layer of the graph is represented by an on-disk adjacency list per node. JVector uses additional data stored inline to support two-pass searches, with the first pass powered by lossily compressed representations of the vectors kept in memory, and the second by a more accurate representation read from disk. The first pass can be performed with
24
26
* Product quantization (PQ), optionally with [anisotropic weighting](https://arxiv.org/abs/1908.10396)
1.2f, // allow degree overflow during construction by this factor
62
-
1.2f)) // relax neighbor diversity requirement by this factor
64
+
1.2f, // relax neighbor diversity requirement by this factor (alpha)
65
+
true)) // use a hierarchical index
63
66
{
64
67
// build the index (in memory)
65
68
OnHeapGraphIndex index = builder.build(ravv);
@@ -86,6 +89,7 @@ Commentary:
86
89
* For the overflow Builder parameter, the sweet spot is about 1.2 for in-memory construction and 1.5 for on-disk. (The more overflow is allowed, the fewer recomputations of best edges are required, but the more neighbors will be consulted in every search.)
87
90
* The alpha parameter controls the tradeoff between edge distance and diversity; usually 1.2 is sufficient for high-dimensional vectors; 2.0 is recommended for 2D or 3D datasets. See [the DiskANN paper](https://suhasjs.github.io/files/diskann_neurips19.pdf) for more details.
88
91
* The Bits parameter to GraphSearcher is intended for controlling your resultset based on external predicates and won’t be used in this tutorial.
92
+
* Setting the addHierarchy parameter to true, build a multi-layer index. This approach has proven more robust in highly challenging scenarios.
89
93
90
94
91
95
#### Step 2: more control over GraphSearcher
@@ -129,7 +133,7 @@ This is expected given the approximate nature of the index being created and the
*Embeddings models product output from a consistent distribution of vectors. This means that you can save and re-use ProductQuantization codebooks, even for a different set of vectors, as long as you had a sufficiently large training set to build it the first time around. ProductQuantization.MAX_PQ_TRAINING_SET_SIZE (128,000 vectors) has proven to be sufficiently large.
266
+
*Embeddings models produce output from a consistent distribution of vectors. This means that you can save and re-use ProductQuantization codebooks, even for a different set of vectors, as long as you had a sufficiently large training set to build it the first time around. ProductQuantization.MAX_PQ_TRAINING_SET_SIZE (128,000 vectors) has proven to be sufficiently large.
263
267
*JDKThreadLocal objects cannot be referenced except from the thread that created them. This is a difficult design into which to fit caching of Closeable objects like GraphSearcher. JVector provides the ExplicitThreadLocal classto solve this.
264
268
* Fused ADC is only compatible with Product Quantization, not Binary Quantization. This is no great loss since [very few models generate embeddings that are best suited for BQ](https://thenewstack.io/why-vector-size-matters/). That said, BQ continues to be supported with non-Fused indexes.
265
269
* JVector heavily utilizes the Panama Vector API(SIMD) for ANN indexing and search. We have seen cases where the memory bandwidth is saturated during indexing and product quantization and can cause the process to slow down. To avoid this, the batch methods for index and PQ builds use a [PhysicalCoreExecutor](https://javadoc.io/doc/io.github.jbellis/jvector/latest/io/github/jbellis/jvector/util/PhysicalCoreExecutor.html) to limit the amount of operations to the physical core count. The default value is 1/2 the processor count seen by Java. This may not be correct in all setups (e.g. no hyperthreading or hybrid architectures) so if you wish to override the default use the `-Djvector.physical_core_count` property, or pass in your own ForkJoinPool instance.
0 commit comments