-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] Identifying Gaps in Lucene’s Faceting #12553
Comments
So many ideas here! It's clear we have some room to grow this API. I wonder if we could organize them into a plan with dependencies and priorities. Also some of the ideas I'm not sure I understand. EG What do you mean by an aggregation group? Is this like counting documents that are either red or blue? It seems like the first few things you called out (dynamic groupings, associating data with ordinals, facet aggregations that depend on each other, including nesting) are "user-facing" features and then the last two things are more like implementation or low-level API changes. Do we need to do the low-level changes to support these new user features? |
Thanks, Mike!
Yes, exactly.
For some of the ideas in here, yes. For example, the idea about support adding data for groups (ordinals) would require us to add some new behaviours for the taxonomy index - especially around updates to the index. I've also mentioned how Faceting implementations currently are not generic in that any improvement to IntTaxonomyFacets has to be reimplemented for FloatTaxonomyFacets. As for the new features, it may be tricky to implement them over the existing Faceting implementation because it would require changing some assumptions. For an aggregation, I think we'll want to implement two concepts - an
Yes, I think that's a great idea. I'll try to organise the ideas in a dependency graph. |
One dependency I can point out is between the idea of nested aggregations and that of specific aggregation targets. With nested aggregations, we want to target some aggregation groups and exclude others. In the example above, we exclude nationalitites from the count aggregation and we exclude authors from the max aggregation. |
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.
Thanks @stefanvodita, @Shradha26 for doing a thorough review of gaps with Lucene's faceting. I think that in summary, this is leaning in towards support for aggregation capabilities in Lucene out of the box given the present structure with Faceting is a stepping stone to aggregations but requires a lot of overhead by its consumers to actually implement aggregations. Some of the points form my discussion with @msfroh: As you already mentioned, that I was looking into OpenSearch's code base for aggregations and it looks like decoupling aggregations for Lucene might not be a very straightforward thing but its worth putting efforts into if the Lucene community thinks that rich aggregations will be a good inclusion in Lucene. |
Hi @sandeshkr419! I think it's a good idea to support richer aggregations at a lower level, in Lucene. If the OpenSearch community wants to migrate some of the aggregation functionality to Lucene and make it available for more people, that's great! Even just the cross-pollination of ideas between the projects should be useful. |
+1 to cross-fertilize between OpenSearch's strong aggregations and Lucene's mostly-limited-to-counting (?) facets. If we cross-fertilize carefully, Lucene could provide the strong low level base API / building blocks for doing aggregations efficiently, working properly with modern Lucene features like intra-query concurrency (soon to also be decoupled from the index's segment geometry, I hope), first class query timeouts ( |
This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in apache#12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of apache#12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses apache#11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes apache#12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in apache#12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.
#12966 (#13358) Reduce duplication in taxonomy facets; always do counts (#12966) This is a large change, refactoring most of the taxonomy facets code and changing internal behaviour, without changing the API. There are specific API changes this sets us up to do later, e.g. retrieving counts from aggregation facets. 1. Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development. Addresses genericity issue mentioned in #12553. 2. As a consequence, introduce sparse values to FloatTaxonomyFacets, which previously used dense values always. This issue is part of #12576. 3. Compute counts for all taxonomy facets always, which enables us to add an API to retrieve counts for association facets in the future. Addresses #11282. 4. As a consequence of having counts, we can check whether we encountered a label while faceting (count > 0), while previously we relied on the aggregation value to be positive. Closes #12585. 5. Introduce the idea of doing multiple aggregations in one go, with association facets doing the aggregation they were already doing, plus a count. We can extend to an arbitrary number of aggregations, as suggested in #12546. 6. Don't change the API. The only change in behaviour users should notice is the fix for non-positive aggregation values, which were previously discarded. 7. Add tests which were missing for sparse/dense values and non-positive aggregations.
#13335 is an interesting example where Lucene could more efficiently implement faceting for numeric ranges using points, directly, instead of the two-step process facets uses today (collect into a bitset, then count from there based on doc values, I think?). |
I’d like to gather a list of areas where Lucene’s support for aggregations can be improved and discuss if faceting can be augmented to offer that support or if it would need to be separate functionality. Please suggest more ideas or challenge the ones listed!
Description
Information Retrieval platforms built on top of Lucene like Solr, Elastic Search, and OpenSearch have rich aggregation engines that are different from what Lucene natively supports. Lucene has some unique ideas to make aggregation computation efficient. Some examples are -
Here are some ideas @stefanvodita and I encountered in our work and through exploration of what the other platforms support -
New features
The text was updated successfully, but these errors were encountered: