Mine parallel corpora with embeddings
Tokenizes and normalizes all sentences with language-specific scripts. Also converts to lowercase and applies BPE .
./preprocess/prep.sh -l {language} -f {input_file} | gzip > {output_file}
Loads given t7 files and converts them into a compressed format that can be read as a stream.
cat {list_of_input_files} | python ./read_t7.py
Builds faiss index from the input stream. -s
sets the number of sentences one index can hold. If the index size is exceeded, the index is outputted to output_folder
and new index is processed.
cd index
mkdir -p build && cd build
cmake .. && make -j 5
zcat {embeddings_file}.gz | ./bin/build_index -o {output_folder} -s {index_size}
Loads index from index_file
and searches for k
nearest vectors for each input vector. The search performed on an index is the k-nearest-neighbor search. -b
sets the batch size, i.e. the number of vectors queried at the same time.
cd index
mkdir -p build && cd build
cmake .. && make -j 5
zcat {embeddings_file}.gz | ./bin/query_index -i {index_file} -k {k-best} -b {batch_size} > {output_file}