add RegexPhraseQuery #2516

PSeitz · 2024-10-10T09:10:57Z

RegexPhraseQuery supports phrase queries with regex. It supports regex
and wildcards. E.g. a query with wildcards:
"b* b* wolf" matches "big bad wolf"
Slop is supported as well:
"b* wolf"~2 matches "big bad wolf"

Regex queries may match a lot of terms where we still need to
keep track which term hit to load the positions.
The phrase query algorithm groups terms by their frequency
together in the union to prefilter groups early.

This PR comes with some new datastructures:

SimpleUnion - A union docset for a list of docsets. It doesn't do any
caching and is therefore well suited for datasets with lots of skipping.
(phrase search, but intersections in general)

LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in
memory. SegmentPostings uses 1840 bytes per instance with its caches,
which is equivalent to 460 docids.
LoadedPostings is used for terms which have less than 100 docs.
LoadedPostings is only used to reduce memory consumption.

BitSetPostingUnion - Creates a Posting that uses the bitset for docid
hits and the docsets for positions. The BitSet is the precalculated
union of the docsets
In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion,
before creating a new one.

Renamed Union to BufferedUnionScorer to better differentiate from other unions.
Added proptests to test different union types.

Comparison With Lucene

https://github.com/quickwit-oss/search-benchmark-game/tree/regex_phrase_query

Below is a benchmark from the search benchmark game on the wikipedia dataset.
Lucene does not seem to support Regex Phrase Queries with Slop, so it's not part of the benchmark.

Interestingly the slower "grad* ma*ent admission test" is due to the regex evaluation on the FST.

RegexPhraseQuery supports phrase queries with regex. It supports regex and wildcards. E.g. a query with wildcards: "b* b* wolf" matches "big bad wolf" Slop is supported as well: "b* wolf"~2 matches "big bad wolf" Regex queries may match a lot of terms where we still need to keep track which term hit to load the positions. The phrase query algorithm groups terms by their frequency together in the union to prefilter groups early. This PR comes with some new datastructures: SimpleUnion - A union docset for a list of docsets. It doesn't do any caching and is therefore well suited for datasets with lots of skipping. (phrase search, but intersections in general) LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in memory. SegmentPostings uses 1840 bytes per instance with its caches, which is equivalent to 460 docids. LoadedPostings is used for terms which have less than 100 docs. LoadedPostings is only used to reduce memory consumption. BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid hits and the docsets for positions. The BitSet is the precalculated union of the docsets In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion, before creating a new one. Renamed Union to BufferedUnionScorer Added proptests to test different union types.

src/query/union/mod.rs

fulmicoton · 2024-10-11T02:23:11Z

src/query/union/mod.rs

+            .cloned()
+            .max()
+            .unwrap_or(0);
+        let mut doc_bitset = BitSet::with_max_value(max_doc + 1);


Suggested change

let mut doc_bitset = BitSet::with_max_value(max_doc + 1);

let mut doc_bitset = BitSet::with_max_value(max_doc + 1);

why max_doc + 1

BitSet excludes max_doc. We may want to change that, it could be a source of bugs

src/query/union/mod.rs

fulmicoton · 2024-10-11T02:34:47Z

src/postings/postings.rs

@@ -12,16 +12,33 @@ use crate::docset::DocSet;
 /// for merging segments or for testing.
 pub trait Postings: DocSet + 'static {
    /// The number of times the term appears in the document.
-    fn term_freq(&self) -> u32;
+    fn term_freq(&mut self) -> u32;


What requires this?

In the BitSetPostingUnion, we need to seek in the docset list, since we arrive at a hit from the bitset, the docsets are then scanned for the positions.

impl<TDocSet: Postings> Postings for BitSetPostingUnion<TDocSet> { fn term_freq(&mut self) -> u32 { let curr_doc = self.bitset.doc(); let mut term_freq = 0; for docset in &mut self.docsets { if docset.doc() < curr_doc { docset.seek(curr_doc); } if docset.doc() == curr_doc { term_freq += docset.term_freq(); } } term_freq }

And you don't want to make that update on .advance() calls, because this computation is somewhat expensive, and you want it to be lazy (e.g. term_query and regexquery), or is there a different reason?

Yes, advancing all the docsets is quite expensive, so it's only done once we have a hit. In advance we only forward the bitset, which is an aggregation of the doc sets in the union.

src/index/inverted_index_reader.rs

src/postings/loaded_postings.rs

src/query/phrase_query/mod.rs

fulmicoton · 2024-10-11T02:56:51Z

src/query/phrase_query/phrase_weight.rs

@@ -50,27 +50,14 @@ impl PhraseWeight {
            .map(|similarity_weight| similarity_weight.boost_by(boost));
        let fieldnorm_reader = self.fieldnorm_reader(reader)?;
        let mut term_postings_list = Vec::new();
-        if reader.has_deletes() {


I think the point of this was to avoid paying for the position check if we have deletes.

I saw that read_postings and read_postings_no_deletes is the same. The bitset is passed as parameter, so this seems to be an outdated API. I'll remove it in a follow-up.

pub fn read_postings( &self, term: &Term, option: IndexRecordOption, ) -> io::Result<Option<SegmentPostings>> { self.get_term_info(term)? .map(move |term_info| self.read_postings_from_terminfo(&term_info, option)) .transpose() } pub(crate) fn read_postings_no_deletes( &self, term: &Term, option: IndexRecordOption, ) -> io::Result<Option<SegmentPostings>> { self.get_term_info(term)? .map(|term_info| self.read_postings_from_terminfo(&term_info, option)) .transpose() }

can we open a ticket to clean this up?

src/core/tests.rs

fulmicoton · 2024-10-21T06:29:26Z

src/postings/loaded_postings.rs

+///
+/// It exists mainly to reduce memory usage.
+/// `SegmentPostings` uses 1840 bytes per instance due to its caches.
+/// It you need to keep many terms around with few docs, it's cheaper to load all the


this comment was super helpful by the way

src/postings/loaded_postings.rs

fulmicoton · 2024-10-21T06:36:09Z

src/postings/loaded_postings.rs

+    }
+
+    fn doc(&self) -> DocId {
+        if self.cursor == self.doc_ids.len() {


it makes one want to add an extra SENTINEL value at the end of the doc_ids array, to remove the if statement

I changed it to >=, so it takes over the job of the bounds check below
(btw. some branch predictors have performance problems with very dense if statements, e.g. 3 branches in 16bytes of instructions)

src/postings/loaded_postings.rs

src/postings/segment_postings.rs

fulmicoton · 2024-10-21T06:50:00Z

src/query/phrase_query/regex_phrase_weight.rs

+    max_expansions: u32,
+    /// wildcard_mode is true if the query is interpeted as wildcard query instead of regex.
+    /// e.g. wol*
+    wildcard_mode: bool,


nitpick: I think wildcard mode is an anti feature.

true, I'll move the wildcard conversion to an utility function

src/query/automaton_weight.rs

src/query/term_query/term_scorer.rs

fulmicoton · 2024-10-21T06:59:53Z

src/query/union/bitset_union.rs

+/// terms, but need to keep the docsets for the postings.
+pub struct BitSetPostingUnion<TDocSet> {
+    /// The docsets are required to load positions
+    docsets: Vec<RefCell<TDocSet>>,


Why not

Suggested change

docsets: Vec<RefCell<TDocSet>>,

docsets: RefCell<Vec<TDocSet>>,

?

ah that makes much more sense. The performance difference should be gone with this.

fulmicoton · 2024-10-21T07:03:50Z

src/query/union/simple_union.rs

+/// Unlike `BufferedUnion`, it doesn't do any horizon precomputation.
+///
+/// For that reason SimpleUnion is a good choice for queries that skip a lot.
+pub struct SimpleUnion<TDocSet> {


BTW PISA does that for union and outperforms tantivy and lucene

I did some benchmarks some time ago and the intersections + union are faster, but the pure unions are slower with this variant

src/query/union/simple_union.rs

fulmicoton · 2024-10-21T07:12:15Z

src/query/union/simple_union.rs

+        for docset in &mut self.docsets {
+            if docset.doc() <= self.doc {
+                docset.advance();
+            }
+            next_doc = next_doc.min(docset.doc());
+        }
+        self.doc = next_doc;


If you have a lot of docsets, I think a BinaryHeap might be faster.
Well... you can forget it I think.

I did some tests initially with BinaryHeap, but it was quite slow when there were a lot of docsets. It might have been related to the size of SegmentPostings which I wasn't aware of yet.

I think there is a case for BinaryHeap here, but we would need a custom one to fully utilize its structure, e.g. some early exit when iterating the leaf nodes

* add RegexPhraseQuery RegexPhraseQuery supports phrase queries with regex. It supports regex and wildcards. E.g. a query with wildcards: "b* b* wolf" matches "big bad wolf" Slop is supported as well: "b* wolf"~2 matches "big bad wolf" Regex queries may match a lot of terms where we still need to keep track which term hit to load the positions. The phrase query algorithm groups terms by their frequency together in the union to prefilter groups early. This PR comes with some new datastructures: SimpleUnion - A union docset for a list of docsets. It doesn't do any caching and is therefore well suited for datasets with lots of skipping. (phrase search, but intersections in general) LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in memory. SegmentPostings uses 1840 bytes per instance with its caches, which is equivalent to 460 docids. LoadedPostings is used for terms which have less than 100 docs. LoadedPostings is only used to reduce memory consumption. BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid hits and the docsets for positions. The BitSet is the precalculated union of the docsets In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion, before creating a new one. Renamed Union to BufferedUnionScorer Added proptests to test different union types. * cleanup * use Box instead of Vec * use RefCell instead of term_freq(&mut) * remove wildcard mode * move RefCell to outer * clippy

* add RegexPhraseQuery RegexPhraseQuery supports phrase queries with regex. It supports regex and wildcards. E.g. a query with wildcards: "b* b* wolf" matches "big bad wolf" Slop is supported as well: "b* wolf"~2 matches "big bad wolf" Regex queries may match a lot of terms where we still need to keep track which term hit to load the positions. The phrase query algorithm groups terms by their frequency together in the union to prefilter groups early. This PR comes with some new datastructures: SimpleUnion - A union docset for a list of docsets. It doesn't do any caching and is therefore well suited for datasets with lots of skipping. (phrase search, but intersections in general) LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in memory. SegmentPostings uses 1840 bytes per instance with its caches, which is equivalent to 460 docids. LoadedPostings is used for terms which have less than 100 docs. LoadedPostings is only used to reduce memory consumption. BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid hits and the docsets for positions. The BitSet is the precalculated union of the docsets In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion, before creating a new one. Renamed Union to BufferedUnionScorer Added proptests to test different union types. * cleanup * use Box instead of Vec * use RefCell instead of term_freq(&mut) * remove wildcard mode * move RefCell to outer * clippy clippy (quickwit-oss#2527) * clippy * clippy * clippy * clippy * convert allow to expect and remove unused * cargo fmt * cleanup * export sample * clippy chore: Fix merge conflict (#11)

Use Levenshtein distance to score documents in fuzzy term queries Fix managed paths (#5) add RegexPhraseQuery (quickwit-oss#2516) * add RegexPhraseQuery RegexPhraseQuery supports phrase queries with regex. It supports regex and wildcards. E.g. a query with wildcards: "b* b* wolf" matches "big bad wolf" Slop is supported as well: "b* wolf"~2 matches "big bad wolf" Regex queries may match a lot of terms where we still need to keep track which term hit to load the positions. The phrase query algorithm groups terms by their frequency together in the union to prefilter groups early. This PR comes with some new datastructures: SimpleUnion - A union docset for a list of docsets. It doesn't do any caching and is therefore well suited for datasets with lots of skipping. (phrase search, but intersections in general) LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in memory. SegmentPostings uses 1840 bytes per instance with its caches, which is equivalent to 460 docids. LoadedPostings is used for terms which have less than 100 docs. LoadedPostings is only used to reduce memory consumption. BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid hits and the docsets for positions. The BitSet is the precalculated union of the docsets In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion, before creating a new one. Renamed Union to BufferedUnionScorer Added proptests to test different union types. * cleanup * use Box instead of Vec * use RefCell instead of term_freq(&mut) * remove wildcard mode * move RefCell to outer * clippy clippy (quickwit-oss#2527) * clippy * clippy * clippy * clippy * convert allow to expect and remove unused * cargo fmt * cleanup * export sample * clippy chore: Fix merge conflict (#11)

PSeitz requested a review from fulmicoton October 11, 2024 01:03

PSeitz force-pushed the slop_with_wildcard branch from 028a938 to 3343c08 Compare October 11, 2024 01:28

fulmicoton reviewed Oct 11, 2024

View reviewed changes

src/query/union/mod.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Oct 11, 2024

View reviewed changes

src/query/union/mod.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Oct 11, 2024

View reviewed changes

src/query/union/mod.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Oct 11, 2024

View reviewed changes

src/index/inverted_index_reader.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Oct 11, 2024

View reviewed changes

src/postings/loaded_postings.rs Show resolved Hide resolved

fulmicoton reviewed Oct 11, 2024

View reviewed changes

src/postings/loaded_postings.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Oct 11, 2024

View reviewed changes

src/postings/loaded_postings.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Oct 11, 2024

View reviewed changes

src/query/phrase_query/mod.rs Show resolved Hide resolved

fulmicoton reviewed Oct 11, 2024

View reviewed changes

PSeitz added 3 commits October 11, 2024 12:51

cleanup

e5c2d47

use Box instead of Vec

cee6724

use RefCell instead of term_freq(&mut)

93978f7