Skip to content

Commit 9a8b34e

Browse files
mlomeli1facebook-github-bot
authored andcommitted
Offline IVF powered by faiss big batch search (facebookresearch#3175)
Summary: This PR introduces the offline IVF (OIVF) framework which contains some tooling to run search using IVFPQ indexes (plus OPQ pretransforms) for large batches of queries using [big_batch_search](https://github.com/mlomeli1/faiss/blob/main/contrib/big_batch_search.py) and GPU faiss. See the [README](https://github.com/mlomeli1/faiss/blob/oivf/demos/offline_ivf/README.md) for details about using this framework. This PR includes the following unit tests, which can be run with the unittest library as so: ```` ~/faiss/demos/offline_ivf$ python3 -m unittest tests/test_iterate_input.py -k test_iterate_back ```` In test_offline_ivf: ```` test_consistency_check test_train_index test_index_shard_equal_file_sizes test_index_shard_unequal_file_sizes test_search test_evaluate_without_margin test_evaluate_without_margin_OPQ test_evaluate_with_margin test_split_batch_size_bigger_than_file_sizes test_split_batch_size_smaller_than_file_sizes test_split_files_with_corrupted_input_file ```` In test_iterate_input: ```` test_iterate_input_file_larger_than_batch test_get_vs_iterate test_iterate_back ```` Pull Request resolved: facebookresearch#3175 Reviewed By: algoriddle Differential Revision: D52218447 Pulled By: mlomeli1 fbshipit-source-id: 78b12457c79b02eb2c9ae993560f2e295798e7e5
1 parent be12427 commit 9a8b34e

34 files changed

+2647
-0
lines changed

demos/offline_ivf/README.md

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
2+
# Offline IVF
3+
4+
This folder contains the code for the offline ivf algorithm powered by faiss big batch search.
5+
6+
Create a conda env:
7+
8+
`conda create --name oivf python=3.10`
9+
10+
`conda activate oivf`
11+
12+
`conda install -c pytorch/label/nightly -c nvidia faiss-gpu=1.7.4`
13+
14+
`conda install tqdm`
15+
16+
`conda install pyyaml`
17+
18+
`conda install -c conda-forge submitit`
19+
20+
21+
## Run book
22+
23+
1. Optionally shard your dataset (see create_sharded_dataset.py) and create the corresponding yaml file `config_ssnpp.yaml`. You can use `generate_config.py` by specifying the root directory of your dataset and the files with the data shards
24+
25+
`python generate_config`
26+
27+
2. Run the train index command
28+
29+
`python run.py --command train_index --config config_ssnpp.yaml --xb ssnpp_1B`
30+
31+
32+
3. Run the index-shard command so it produces sharded indexes, required for the search step
33+
34+
`python run.py --command index_shard --config config_ssnpp.yaml --xb ssnpp_1B`
35+
36+
37+
6. Send jobs to the cluster to run search
38+
39+
`python run.py --command search --config config_ssnpp.yaml --xb ssnpp_1B --cluster_run --partition <PARTITION-NAME>`
40+
41+
42+
Remarks about the `search` command: it is assumed that the database vectors are the query vectors when performing the search step.
43+
a. If the query vectors are different than the database vectors, it should be passed in the xq argument
44+
b. A new dataset needs to be prepared (step 1) before passing it to the query vectors argument `–xq`
45+
46+
`python run.py --command search --config config_ssnpp.yaml --xb ssnpp_1B --xq <QUERIES_DATASET_NAME>`
47+
48+
49+
6. We can always run the consistency-check for sanity checks!
50+
51+
`python run.py --command consistency_check--config config_ssnpp.yaml --xb ssnpp_1B`
52+

demos/offline_ivf/__init__.py

Whitespace-only changes.

demos/offline_ivf/config_ssnpp.yaml

+109
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
d: 256
2+
output: /checkpoint/marialomeli/offline_faiss/ssnpp
3+
index:
4+
prod:
5+
- 'IVF8192,PQ128'
6+
non-prod:
7+
- 'IVF16384,PQ128'
8+
- 'IVF32768,PQ128'
9+
nprobe:
10+
prod:
11+
- 512
12+
non-prod:
13+
- 256
14+
- 128
15+
- 1024
16+
- 2048
17+
- 4096
18+
- 8192
19+
20+
k: 50
21+
index_shard_size: 50000000
22+
query_batch_size: 50000000
23+
evaluation_sample: 10000
24+
training_sample: 1572864
25+
datasets:
26+
ssnpp_1B:
27+
root: /checkpoint/marialomeli/ssnpp_data
28+
size: 1000000000
29+
files:
30+
- dtype: uint8
31+
format: npy
32+
name: ssnpp_0000000000.npy
33+
size: 50000000
34+
- dtype: uint8
35+
format: npy
36+
name: ssnpp_0000000001.npy
37+
size: 50000000
38+
- dtype: uint8
39+
format: npy
40+
name: ssnpp_0000000002.npy
41+
size: 50000000
42+
- dtype: uint8
43+
format: npy
44+
name: ssnpp_0000000003.npy
45+
size: 50000000
46+
- dtype: uint8
47+
format: npy
48+
name: ssnpp_0000000004.npy
49+
size: 50000000
50+
- dtype: uint8
51+
format: npy
52+
name: ssnpp_0000000005.npy
53+
size: 50000000
54+
- dtype: uint8
55+
format: npy
56+
name: ssnpp_0000000006.npy
57+
size: 50000000
58+
- dtype: uint8
59+
format: npy
60+
name: ssnpp_0000000007.npy
61+
size: 50000000
62+
- dtype: uint8
63+
format: npy
64+
name: ssnpp_0000000008.npy
65+
size: 50000000
66+
- dtype: uint8
67+
format: npy
68+
name: ssnpp_0000000009.npy
69+
size: 50000000
70+
- dtype: uint8
71+
format: npy
72+
name: ssnpp_0000000010.npy
73+
size: 50000000
74+
- dtype: uint8
75+
format: npy
76+
name: ssnpp_0000000011.npy
77+
size: 50000000
78+
- dtype: uint8
79+
format: npy
80+
name: ssnpp_0000000012.npy
81+
size: 50000000
82+
- dtype: uint8
83+
format: npy
84+
name: ssnpp_0000000013.npy
85+
size: 50000000
86+
- dtype: uint8
87+
format: npy
88+
name: ssnpp_0000000014.npy
89+
size: 50000000
90+
- dtype: uint8
91+
format: npy
92+
name: ssnpp_0000000015.npy
93+
size: 50000000
94+
- dtype: uint8
95+
format: npy
96+
name: ssnpp_0000000016.npy
97+
size: 50000000
98+
- dtype: uint8
99+
format: npy
100+
name: ssnpp_0000000017.npy
101+
size: 50000000
102+
- dtype: uint8
103+
format: npy
104+
name: ssnpp_0000000018.npy
105+
size: 50000000
106+
- dtype: uint8
107+
format: npy
108+
name: ssnpp_0000000019.npy
109+
size: 50000000
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# This source code is licensed under the MIT license found in the
3+
# LICENSE file in the root directory of this source tree.
4+
5+
import numpy as np
6+
import argparse
7+
import os
8+
9+
10+
def xbin_mmap(fname, dtype, maxn=-1):
11+
"""
12+
Code from
13+
https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/benchmark/dataset_io.py#L94
14+
mmap the competition file format for a given type of items
15+
"""
16+
n, d = map(int, np.fromfile(fname, dtype="uint32", count=2))
17+
assert os.stat(fname).st_size == 8 + n * d * np.dtype(dtype).itemsize
18+
if maxn > 0:
19+
n = min(n, maxn)
20+
return np.memmap(fname, dtype=dtype, mode="r", offset=8, shape=(n, d))
21+
22+
23+
def main(args: argparse.Namespace):
24+
ssnpp_data = xbin_mmap(fname=args.filepath, dtype="uint8")
25+
num_batches = ssnpp_data.shape[0] // args.data_batch
26+
assert (
27+
ssnpp_data.shape[0] % args.data_batch == 0
28+
), "num of embeddings per file should divide total num of embeddings"
29+
for i in range(num_batches):
30+
xb_batch = ssnpp_data[
31+
i * args.data_batch : (i + 1) * args.data_batch, :
32+
]
33+
filename = args.output_dir + f"/ssnpp_{(i):010}.npy"
34+
np.save(filename, xb_batch)
35+
print(f"File {filename} is saved!")
36+
37+
38+
if __name__ == "__main__":
39+
parser = argparse.ArgumentParser()
40+
parser.add_argument(
41+
"--data_batch",
42+
dest="data_batch",
43+
type=int,
44+
default=50000000,
45+
help="Number of embeddings per file, should be a divisor of 1B",
46+
)
47+
parser.add_argument(
48+
"--filepath",
49+
dest="filepath",
50+
type=str,
51+
default="/datasets01/big-ann-challenge-data/FB_ssnpp/FB_ssnpp_database.u8bin",
52+
help="path of 1B ssnpp database vectors' original file",
53+
)
54+
parser.add_argument(
55+
"--filepath",
56+
dest="output_dir",
57+
type=str,
58+
default="/checkpoint/marialomeli/ssnpp_data",
59+
help="path to put sharded files",
60+
)
61+
62+
args = parser.parse_args()
63+
main(args)

0 commit comments

Comments
 (0)