Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing search results with in memory index #1824

Open
EntilZha opened this issue Jan 23, 2023 · 7 comments
Open

Missing search results with in memory index #1824

EntilZha opened this issue Jan 23, 2023 · 7 comments

Comments

@EntilZha
Copy link

I've been working on a small command line app to merge bibtex files and have been using tantivy to index paper titles. I noticed some strange behavior where if I (from python extension):

  1. Create an in memory index
  2. Call writer.add_document in a for loop
  3. Call writer.commit() outside the for loop
    The search results I'd expect to be exact matches (duplicate detection) are not showing up. I reran the same code several times and every so often I do get the correct results, which suggested to me that on commit, some index entries were getting lost. If I have writer.commit inside the for loop, the issue goes away. Likewise, if I use a file-based index, the issue also goes away.

Reading the docs, it looks like even with an in memory index, calling writer.commit() after the for loop and not in it should work, is there something I'm missing or might this be a bug?

If it's not a known issue already and you cant reproduce it, once I finish with my cli I can share the relevant code.

@PSeitz
Copy link
Contributor

PSeitz commented Jan 24, 2023

Thanks for the report. Can you provide something to reproduce that behavior in rust? I'm not very familiar with python

Do you reload your IndexReader after committing? It may have a view on the old segments.

@K0UR05H
Copy link

K0UR05H commented Jul 1, 2023

Hello @PSeitz
I've run into a similar problem. here is a code snippet to reproduce the behavior:

use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy};

fn main() -> tantivy::Result<()> {
    let mut schema_builder = Schema::builder();
    let title = schema_builder.add_text_field("title", TEXT | STORED);
    let schema = schema_builder.build();

    let index = Index::create_in_ram(schema);

    let reader = index
        .reader_builder()
        .reload_policy(ReloadPolicy::OnCommit)
        .try_into()?;

    let mut index_writer = index.writer(10_000_000)?;
    index_writer.add_document(doc!(
        title => "The Old Man and the Sea"
    ))?;
    index_writer.commit()?;

    let query_parser = QueryParser::for_index(&index, vec![title]);
    let query = query_parser.parse_query("sea")?;

    let searcher = reader.searcher();
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;

    assert!(!top_docs.is_empty());

    Ok(())
}

I guess the problem is ReloadPolicy::OnCommit taking a few milliseconds to reload. I solved this problem by changing the reload policy to manual and calling reader.reload() after index_writer.commit().

@fulmicoton
Copy link
Collaborator

Yes. You can reload the index manually though.

@PSeitz You added the good first issue label? Do you think we should do something here?

@PSeitz
Copy link
Contributor

PSeitz commented Jul 3, 2023

Shouldn't the reload be triggered sync?

@GentBinaku
Copy link

Hi, I would like to take this as a good first issue and work on it. Thanks in advance

@PSeitz
Copy link
Contributor

PSeitz commented Jul 12, 2023

@GentBinaku Thanks, as a starting point the callback in the RamDirectory is https://github.com/quickwit-oss/tantivy/blob/main/src/directory/ram_directory.rs#L234, which updates the IndexReader. Ideally the update would be done before returning from commit(). If that's not possible we should document it.

@giovannicuccu
Copy link
Contributor

Hi, I dug into the issue, and here is what I found:

  • the behavior is present even if you use mmap_directory, it's hidden by the test config (this line const POLLING_INTERVAL: Duration = Duration::from_millis(if cfg!(test) { 1 } else { 500 }); in file_watcher.rs) that executes the reload thread every 1 ms
  • The fix on ram directory could be easy, in the implementation of atomic_write you have to add a wait() for watch_router.broadcast(), doing so you align it with the one implemented in file_watchers (let _ = callbacks.broadcast().wait();)

As of today, the standard behavior is to reload the index asynchronously on commit, with a different delay based on the dir implementation. Is it worth making the reload synchronous?

Doing so would require some changes (such as changing or dropping FileWatcher) and I cannot evaluate the consequences.

While analyzing the code I noticed that the watching logic for the directory is used only by the reloadPolicy feature. If there is no plan to add more events it could be simplified leading to cleaner code. I'd like to give it a try if you decide it's worth pursuing.

P.S. I've started digging in the code base for a few hours so I may have missed something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants