Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_docids_for_value_range is broken w/certain fast fields that use a GCD inverse #1757

Closed
mmastrac opened this issue Jan 3, 2023 · 3 comments · Fixed by #1760
Closed

get_docids_for_value_range is broken w/certain fast fields that use a GCD inverse #1757

mmastrac opened this issue Jan 3, 2023 · 3 comments · Fixed by #1760
Assignees
Labels

Comments

@mmastrac
Copy link

mmastrac commented Jan 3, 2023

Describe the bug

When inserting fast fields in a certain order, segments will occasionally end up with a column that matches extra documents when calling get_docids_for_value_range using nonexistant values.

The testcase at the bottom of this report illustrates a failing case, though you need to run it repeatedly to cause it to fail.

Expected behaviour: passing fastfield values/ranges to get_docids_for_value_range that cannot possibly exist in a given segment reader should return no matching documents.

Which version of tantivy are you using?

0.19.0

To Reproduce

This code will reproduce the bug, but it won't happen every time. Running it repeatedly will eventually give you a SegmentReader with the min/max values [6341073085727221 3373930417471920086], and that SegmentReader will always match document zero when querying in the invalid value 7999999999999999.


    #[test]
    fn test_gcd_bug() {
        let dir = RamDirectory::create();
        let mut schema_builder = Schema::builder();
        let url_norm_hash_field = schema_builder.add_i64_field("url_norm_hash", FAST | INDEXED);  
        let schema = schema_builder.build();
        let index = Index::create(dir, schema, IndexSettings::default()).unwrap();
        let mut writer = index.writer(50_000_000).unwrap();
        for i in [6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 3373930417471920086_i64, 3373930417471920086_i64, 3373930417471920086_i64, 3373930417471920086_i64] {
            writer.add_document(doc! {
                url_norm_hash_field => i,
            }).unwrap();
        }
        writer.commit();
        let reader = index.reader().unwrap();
        let searcher = reader.searcher();
        for segment in searcher.segment_readers() {
            let field = segment.fast_fields().i64(url_norm_hash_field).unwrap();
            let (min, max) = (field.min_value(), field.max_value());
            println!("{} {}", min, max);
            if 7999999999999999_i64 >= min && 7999999999999999_i64 <= max {
                let mut vec = vec![];
                field.get_docids_for_value_range(7999999999999999_i64..=7999999999999999_i64, 0..u32::MAX, &mut vec);
                println!("{:?}", vec);
            }
        }
    }
@fulmicoton
Copy link
Collaborator

@fulmicoton fulmicoton added the bug label Jan 4, 2023
@fulmicoton
Copy link
Collaborator

confirmed as a bug

@PSeitz
Copy link
Contributor

PSeitz commented Jan 4, 2023

Thanks for the report!
For better performance, we try to convert the users value to the value space of the fast field when running get_docids_for_value_range . With GCD, the value space gets smaller. e.g. [0, 100, 200] is mapped to [0,1,2]. We cannot map a incoming user value of 50 to that user space exactly.

Currently inverse mapping is executed self.monotonic_mapping.inverse(50), which returns 0. Instead we need to detect that we are outside the value space and map to the closest next value. +1 for the start of the range, and -1 for the end of the range

Minimal Example

#[test]
fn test_gcd_bug_regression_1757() {
    let mut schema_builder = Schema::builder();
    let num_field = schema_builder.add_u64_field("url_norm_hash", FAST | INDEXED);
    let schema = schema_builder.build();
    let index = Index::create_in_ram(schema);
    {
        let mut writer = index.writer_for_tests().unwrap();
        writer
            .add_document(doc! {
                num_field => 100u64,
            })
            .unwrap();
        writer
            .add_document(doc! {
                num_field => 200u64,
            })
            .unwrap();
        writer.commit().unwrap();
    }

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();
    let segment = &searcher.segment_readers()[0];
    let field = segment.fast_fields().u64(num_field).unwrap();
    let mut vec = vec![];
    field.get_docids_for_value_range(150..=150, 0..u32::MAX, &mut vec);
    assert_eq!(vec.len(), 0);
}

Buggy Code

This part of the code needs to be able to handle user data

fn get_docids_for_value_range(
    &self,
    range: RangeInclusive<Output>,
    doc_id_range: Range<u32>,
    positions: &mut Vec<u32>,
) {
    self.from_column.get_docids_for_value_range(
        self.monotonic_mapping.inverse(range.start().clone())
            ..=self.monotonic_mapping.inverse(range.end().clone()),
        doc_id_range,
        positions,
    )
}

PSeitz added a commit that referenced this issue Jan 5, 2023
PSeitz added a commit that referenced this issue Jan 5, 2023
PSeitz added a commit that referenced this issue Jan 10, 2023
PSeitz added a commit that referenced this issue Jan 12, 2023
* handle user input on get_docid_for_value_range

fixes #1757

* pass range as parameter
fulmicoton pushed a commit that referenced this issue Jan 16, 2023
* handle user input on get_docid_for_value_range

fixes #1757

* pass range as parameter
Hodkinson pushed a commit to Hodkinson/tantivy that referenced this issue Jan 30, 2023
* handle user input on get_docid_for_value_range

fixes quickwit-oss#1757

* pass range as parameter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants