This repository contains the code for the paper Accelerating Graph Indexing for ANNS on Modern CPUs.
This code builds upon hnswlib and Eigen.
File format: The data used in this project is formatted in .fvecs
and .ivecs
. Each file contains multiple lines of vectors; each line begins with an integer float
in .fvecs
files and integer
in .ivecs
files. For more information, visit this site.
Below are the datasets mentioned in the paper:
Datasets | Data Volume | Size (GB) | Dim. | Query Volume |
---|---|---|---|---|
SIFT (for test) (link) | 1,000,000 | 0.157 | 128 | 10,000 |
ARGILLA (link) | 21,071,228 | 81 | 1,024 | 100,000 |
ANTON (link) | 19,399,177 | 75 | 1,024 | 100,000 |
LAION (link) | 100,000,000 | 293 | 768 | 100,000 |
IMAGENET (link) | 13,158,856 | 38 | 768 | 100,000 |
COHERE (link) | 10,124,929 | 30 | 768 | 100,000 |
DATACOMP (link) | 12,800,000 | 37 | 768 | 100,000 |
BIGCODE (link) | 10,404,628 | 30 | 768 | 100,000 |
SSNPP (link) | 1,000,000,000 | 960 | 256 | 100,000 |
The following is an example of the file organization for the SIFT
dataset:
HNSW_Flash
├─data
│ ├─sift
│ │ ├─sift_base.fvecs
│ │ ├─sift_query.fvecs
│ │ └─sift_groundtruth.ivecs
│ ├─...
Below are the strategies implemented in this repository. The strategy names are used to select different approaches when running the code.
Strategy | Description |
---|---|
flash | A high-performance implementation of HNSW, optimized for fast search in large-scale datasets. Utilizes techniques specific to Flash for efficient distance calculations. |
hnsw | A standard implementation of Hierarchical Navigable Small World (HNSW) graph, based on the version in nmslib/hnswlib. This is the core algorithm for approximate nearest neighbor search. |
nsg | An implementation of Navigating Spreading-out Graph (NSG). |
nsg-flash | Base on NSG algorithm, utilizes the same techniques as Flash for efficient distance calculations. |
pca-sdc | Extends HNSW by applying Principal Component Analysis (PCA) for dimensionality reduction. Uses SDC (Simultaneous Distance Calculation) for efficient query processing, improving both speed and accuracy. |
pq-adc | Combines HNSW with Product Quantization (PQ). Uses ADC (Asymmetric Distance Calculation) to perform queries efficiently by comparing encoded vectors with raw data. Suitable for scenarios where memory usage is a concern. |
pq-sdc | Similar to pq-adc , but uses SDC for query processing, balancing speed and accuracy. Ideal for large datasets where encoding and fast computation are critical. |
sq-adc | Integrates HNSW with Scalar Quantization (SQ). Uses ADC for distance calculations, enabling fast search with compressed data. This method is useful for low-memory environments. |
sq-sdc | A variation of sq-adc , utilizing SDC for querying. Offers a trade-off between accuracy and efficiency, particularly useful for datasets with high dimensionality. |
taumg | An implementation of τ-Monotonic Graph (τ-MG). |
taumg-flash | Base on τ-MG algorithm, utilizes the same techniques as Flash for efficient distance calculations. |
- cmake >= 3.15
- g++ >= 11.4 with C++17 support
- CPU with SSE/AVX/AVX2/AVX512 support
The algorithms automatically check if the index has been generated and saved in the file system. if not, they will generate and save the index before running the query; otherwise, they will load the existing index file and run the query.
The SIFT
dataset has been pre-configured to run with all strategies:
make build
make run flash sift
For other datasets, you can run the code using the following format:
cd ./bin && ./main [dataset_name] [strategy_name]
For more details on running the code, please refer to ./main.cc
and ./Makefile
.
All parameters for the strategies can be modified in ./include/core.h
. Please recompile the program after making any changes.