Skip to content

Source code for the paper: "Accelerating Graph Indexing for ANNS on Modern CPUs"

Notifications You must be signed in to change notification settings

ZJU-DAILY/HNSW-Flash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HNSW-Flash

This repository contains the code for the paper Accelerating Graph Indexing for ANNS on Modern CPUs.

This code builds upon hnswlib and Eigen.

Prerequisite

Datasets

File format: The data used in this project is formatted in .fvecs and .ivecs. Each file contains multiple lines of vectors; each line begins with an integer $d$, representing the dimension of the vector, followed by $d$ numbers as vector data. The number are of type float in .fvecs files and integer in .ivecs files. For more information, visit this site.

Below are the datasets mentioned in the paper:

Datasets Data Volume Size (GB) Dim. Query Volume
SIFT (for test) (link) 1,000,000 0.157 128 10,000
ARGILLA (link) 21,071,228 81 1,024 100,000
ANTON (link) 19,399,177 75 1,024 100,000
LAION (link) 100,000,000 293 768 100,000
IMAGENET (link) 13,158,856 38 768 100,000
COHERE (link) 10,124,929 30 768 100,000
DATACOMP (link) 12,800,000 37 768 100,000
BIGCODE (link) 10,404,628 30 768 100,000
SSNPP (link) 1,000,000,000 960 256 100,000

The following is an example of the file organization for the SIFT dataset:

HNSW_Flash
├─data
│  ├─sift
│  │  ├─sift_base.fvecs
│  │  ├─sift_query.fvecs
│  │  └─sift_groundtruth.ivecs
│  ├─...

Strategies

Below are the strategies implemented in this repository. The strategy names are used to select different approaches when running the code.

Strategy Description
flash A high-performance implementation of HNSW, optimized for fast search in large-scale datasets. Utilizes techniques specific to Flash for efficient distance calculations.
hnsw A standard implementation of Hierarchical Navigable Small World (HNSW) graph, based on the version in nmslib/hnswlib. This is the core algorithm for approximate nearest neighbor search.
nsg An implementation of Navigating Spreading-out Graph (NSG).
nsg-flash Base on NSG algorithm, utilizes the same techniques as Flash for efficient distance calculations.
pca-sdc Extends HNSW by applying Principal Component Analysis (PCA) for dimensionality reduction. Uses SDC (Simultaneous Distance Calculation) for efficient query processing, improving both speed and accuracy.
pq-adc Combines HNSW with Product Quantization (PQ). Uses ADC (Asymmetric Distance Calculation) to perform queries efficiently by comparing encoded vectors with raw data. Suitable for scenarios where memory usage is a concern.
pq-sdc Similar to pq-adc, but uses SDC for query processing, balancing speed and accuracy. Ideal for large datasets where encoding and fast computation are critical.
sq-adc Integrates HNSW with Scalar Quantization (SQ). Uses ADC for distance calculations, enabling fast search with compressed data. This method is useful for low-memory environments.
sq-sdc A variation of sq-adc, utilizing SDC for querying. Offers a trade-off between accuracy and efficiency, particularly useful for datasets with high dimensionality.
taumg An implementation of τ-Monotonic Graph (τ-MG).
taumg-flash Base on τ-MG algorithm, utilizes the same techniques as Flash for efficient distance calculations.

Environment

  • cmake >= 3.15
  • g++ >= 11.4 with C++17 support
  • CPU with SSE/AVX/AVX2/AVX512 support

Reproduction

The algorithms automatically check if the index has been generated and saved in the file system. if not, they will generate and save the index before running the query; otherwise, they will load the existing index file and run the query.

The SIFT dataset has been pre-configured to run with all strategies:

make build
make run flash sift

For other datasets, you can run the code using the following format:

cd ./bin && ./main [dataset_name] [strategy_name]

For more details on running the code, please refer to ./main.cc and ./Makefile.

All parameters for the strategies can be modified in ./include/core.h. Please recompile the program after making any changes.

About

Source code for the paper: "Accelerating Graph Indexing for ANNS on Modern CPUs"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages