Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a MergePolicy wrapper that preserves search concurrency? #12877

Closed
jpountz opened this issue Dec 5, 2023 · 4 comments · Fixed by #13430
Closed

Add a MergePolicy wrapper that preserves search concurrency? #12877

jpountz opened this issue Dec 5, 2023 · 4 comments · Fixed by #13430

Comments

@jpountz
Copy link
Contributor

jpountz commented Dec 5, 2023

Description

We have an issue about decoupling search concurrency from index geometry (#9721), but this comes with trade-offs as the per-segment bit of search is hard to parallelize. Maybe we should also introduce a merge policy wrapper that tries to preserve a search concurrency of N by preventing the creation of segments of more than maxDoc/N docs?

@mikemccand
Copy link
Member

+1, I like this idea. It might be implemented by having TieredMergePolicy dynamically set the max segment size (in doc count, not just bytes) as a function of total maxDoc in the index?

Perhaps when the index is tiny it doesn't do the maxDoc/N, but only once enough maxDoc have been index, start enforcing that ...

@jpountz
Copy link
Contributor Author

jpountz commented Dec 7, 2023

+1, I like this idea.

I have a vague recollection of you saying you already implemented something like that, am I making this up? (it's quite possible, I struggle to keep lots of stuff in memory!)

It might be implemented by having TieredMergePolicy dynamically set the max segment size

One potential issue that comes to mind with this approach is that TieredMergePolicy gives up on finding balanced merges when they reach the maximum merged segment size and just packs as many segments as possible as long as the sum of their sizes is under the threshold, so this could potentially worsen write amplification?

Perhaps when the index is tiny it doesn't do the maxDoc/N, but only once enough maxDoc have been index, start enforcing that ...

+1

@mikemccand
Copy link
Member

am I making this up?

Ha! No, you are not hallucinating @jpountz! We do have something like this for Amazon product search -- it's crucial for our usage to keep long-pole query latencies low by maximizing concurrency -- but it might just be as stupid as "setMaxMergedSegmentMB" to something "just right" for our usage. I'll poke around inside and see if our impl is not too embarrassing to share ;)

It might be implemented by having TieredMergePolicy dynamically set the max segment size

One potential issue that comes to mind with this approach is that TieredMergePolicy gives up on finding balanced merges when they reach the maximum merged segment size and just packs as many segments as possible as long as the sum of their sizes is under the threshold, so this could potentially worsen write amplification?

Hmm you're right -- TMP would spend more time doing this "unbalanced packing", though, presumably it would sort of run out of options because it gobbles up the smallish segments aggressively, maybe? Hard to visualize...

@mikemccand
Copy link
Member

am I making this up?

Ha! No, you are not hallucinating @jpountz! We do have something like this for Amazon product search -- it's crucial for our usage to keep long-pole query latencies low by maximizing concurrency -- but it might just be as stupid as "setMaxMergedSegmentMB" to something "just right" for our usage. I'll poke around inside and see if our impl is not too embarrassing to share ;)

OK well our (Amazon product search's) implementation is sorta messy: we subclass TMP and override findMerges to dynamically change maxMergedSegmentMB as a function of the total index size at this moment, and then return super.findMerges(). It works but it's not so clean. I would prefer for this issue that we make this a first class feature of TMP? Something like setMinSegmentCount or setBalancedSegmentCount or so?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants