-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it correct for facets to assume positive aggregation values? #12585
Comments
Yeah, this is a good callout. I ran into this when adding more flexibility to association faceting a while back (making note that supporting, e.g., "min" would require rethinking these assumptions). My opinion is that the current assumption makes sense for the faceting support currently available, but I know there's conversation going on more generally about improving (rethinking?) aggregation capabilities in Lucene. My preference on this would be to, 1) leave this current behavior in place unless there is a use-case that's immediately blocked by it, but 2) include it in some broader rethinking of Lucene aggregation capabilities. As a side-note on that, I wonder if a successful approach to moving forward with some new aggregation thinking would be to not try to modify the faceting module as-is, but rather spin up a new "aggregations" module under "sandbox" to start sketching out ideas. I think it will be difficult to retrofit more flexible aggregation capabilities into the faceting API that exists today, but maybe I'm wrong? OK, I'm off in the weeds now... |
I'm definitely leaning that way too right now. @Shradha26 and I were considering this (#12553). I initially thought faceting can fill most of the gaps we were identifying. What got me to change my mind was seeing how much extra work there is to bring in a new feature when it has to be reimplemented for multiple faceting classes. If we made facets generic, that would already be a significant rework, so maybe we might as well start fresh? There's other things that are difficult to do with facets - arbitrary aggregation groups come to mind, but genericity is the one that convinced me that it's a better strategy to have a dedicated aggregation engine. I'm sure there are counter-arguments as well or maybe I'm overstating the impact of genericity. |
I thought some more about this issue and it really seems like a bug that I can have a non-positive aggregation value, but I can't return it in top children.
|
I think I'd be curious if there are many real-world use-cases out there that have a non-positive association between each document and facet label? I would guess it's pretty uncommon. I actually can't really contrive an example either. @stefanvodita have you encountered real-world use-cases that would benefit from this? Apologies if you describe it somewhere and I'm missing it. |
I have written a unit test to showcase the problem, but I haven't seen it in the wild. Non-positive associations make sense when we're dealing with data that can naturally be non-positive. Let's say we're managing a school's report cards. Each student has a corresponding document. They take multiple-choice tests, which penalize wrong answers to discourage randomly filling in the test. Each test the students take during the year is represented by a facet label and each student has their score associated with the label. Not all scores would be positive and aggregations over those scores may or may not be positive. Maybe that's too much speculation on my part though. |
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in apache#12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of apache#12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses apache#11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes apache#12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in apache#12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.
#12966 (#13358) Reduce duplication in taxonomy facets; always do counts (#12966) This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.
Description
In
IntTaxonomyFacets
andFloatTaxonomyFacets
when wegetTopChildrenForPath
we maintain a priority queue of aggregation values and corresponding ordinals.If an aggregation value is not positive, we discard it (code). If it is not larger than the lowest value in the queue (initially set to 0), we also discard it (code).
It looks like the facets assume aggregation results should be positive, but I think we could have negatives or zeroes. I wrote a unit test to demo this.
The current behavior makes sense for counts. Counts are always non-negative and we probably don't care about counts of 0. But does it make sense for other aggregations?
The text was updated successfully, but these errors were encountered: