-
-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing search results with in memory index #1824
Comments
Thanks for the report. Can you provide something to reproduce that behavior in rust? I'm not very familiar with python Do you reload your |
Hello @PSeitz use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy};
fn main() -> tantivy::Result<()> {
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)
.try_into()?;
let mut index_writer = index.writer(10_000_000)?;
index_writer.add_document(doc!(
title => "The Old Man and the Sea"
))?;
index_writer.commit()?;
let query_parser = QueryParser::for_index(&index, vec![title]);
let query = query_parser.parse_query("sea")?;
let searcher = reader.searcher();
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
assert!(!top_docs.is_empty());
Ok(())
} I guess the problem is |
Yes. You can reload the index manually though. @PSeitz You added the good first issue label? Do you think we should do something here? |
Shouldn't the reload be triggered sync? |
Hi, I would like to take this as a good first issue and work on it. Thanks in advance |
@GentBinaku Thanks, as a starting point the callback in the RamDirectory is https://github.com/quickwit-oss/tantivy/blob/main/src/directory/ram_directory.rs#L234, which updates the |
Hi, I dug into the issue, and here is what I found:
As of today, the standard behavior is to reload the index asynchronously on commit, with a different delay based on the dir implementation. Is it worth making the reload synchronous? Doing so would require some changes (such as changing or dropping FileWatcher) and I cannot evaluate the consequences. While analyzing the code I noticed that the watching logic for the directory is used only by the reloadPolicy feature. If there is no plan to add more events it could be simplified leading to cleaner code. I'd like to give it a try if you decide it's worth pursuing. P.S. I've started digging in the code base for a few hours so I may have missed something. |
I've been working on a small command line app to merge bibtex files and have been using tantivy to index paper titles. I noticed some strange behavior where if I (from python extension):
writer.add_document
in a for loopwriter.commit()
outside the for loopThe search results I'd expect to be exact matches (duplicate detection) are not showing up. I reran the same code several times and every so often I do get the correct results, which suggested to me that on commit, some index entries were getting lost. If I have
writer.commit
inside the for loop, the issue goes away. Likewise, if I use a file-based index, the issue also goes away.Reading the docs, it looks like even with an in memory index, calling
writer.commit()
after the for loop and not in it should work, is there something I'm missing or might this be a bug?If it's not a known issue already and you cant reproduce it, once I finish with my cli I can share the relevant code.
The text was updated successfully, but these errors were encountered: