-
-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change TokenFilter trait to simplify (a bit) the boxing of filters #2101
Conversation
a80bc6b
to
5508fa4
Compare
5508fa4
to
2cab111
Compare
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## main #2101 +/- ##
==========================================
+ Coverage 94.39% 94.42% +0.03%
==========================================
Files 321 321
Lines 60612 60612
==========================================
+ Hits 57213 57235 +22
+ Misses 3399 3377 -22
|
HDFS benchmark tantivy 0.19:
tantivy main-branch, with default tokenizer built with boxed token filters, this is mostly equivalent to what was in tantivy 0.19
tantivy main-branch, with default tokenizer
|
If I understand the PR correctly, this attempts to get a best of two world situation. You mentioned you are seeing a regression, how do we explain it? |
TextAnalyzer { | ||
tokenizer: self.tokenizer.box_clone(), | ||
tokenizer: Box::new(tokenizer), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have preferred to not have this public constructor, and instead have an extra
TextAnalyzerBuilder::build_with_dynamic_token_filters(token_filters: Vec<BoxTokenFilter>)
, with some documentation explaining about the purpose of two ways to add filters.
tail: T, | ||
} | ||
|
||
impl<'a, T: TokenStream> TokenStream for AsciiFoldingFilterTokenStream<'a, T> { | ||
impl<'a, T: TokenStream> TokenStream for AsciiFoldingFilterTokenStream<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clippy...
src/tokenizer/lower_caser.rs
Outdated
tokenizer, | ||
fn filter<T: TokenStream>(&self, token_stream: T) -> Self::OutputTokenStream<T> { | ||
LowerCaserTokenStream { | ||
tail: token_stream, | ||
buffer: String::new(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are reverting here the work done by Pascal to reduce the number of allocation we have. Is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PSeitz can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed :/
On each tokenizer.token_stream
, it will call filter
and thus will do a String::new()
.
That was not the case before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if we can have an instance with a buffer in the pipeline like before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course. I should have fixed it by adding the allocation at the TokenFilter
level.
closing in favor of #2110 |
I have a preference for this PR instead of this one: #2096
The main difference is that I changed the
TokenFilter
to make it independent of theTokenizer
trait as aTokenFilter
should be defined only by the input token stream and its output token stream.The main benefit compared to #2096 is that it avoids creating a
BoxTokenizer
and implementing theTokenizer
trait on it... it does not seem to be a substantial benefit but... except it is cleaner. For example in #2096, I tried to bypassBoxTokenizer
and implemented theTokenizer
trait directly onBox<dyn BoxableTokenizer>
, but I got stack overflow as I was calling recursively the same methodbox_token_stream
but the compiler did not tell me anything...Bench results:
I also ran the benchmark on tantivy
0.19
before the introduction of GATs, box token streams were used everywhere at that time.default-tokenize-alice time: [1.0371 ms 1.0376 ms 1.0383 ms]
So we have a performance regression on the box token filters, unfortunately. I tried to change the code to have something equivalent to what we had on 0.19 but... this bench results did not change.