Add a MergePolicy wrapper that preserves search concurrency? #12877

jpountz · 2023-12-05T08:48:58Z

Description

We have an issue about decoupling search concurrency from index geometry (#9721), but this comes with trade-offs as the per-segment bit of search is hard to parallelize. Maybe we should also introduce a merge policy wrapper that tries to preserve a search concurrency of N by preventing the creation of segments of more than maxDoc/N docs?

The text was updated successfully, but these errors were encountered:

mikemccand · 2023-12-07T12:42:34Z

+1, I like this idea. It might be implemented by having TieredMergePolicy dynamically set the max segment size (in doc count, not just bytes) as a function of total maxDoc in the index?

Perhaps when the index is tiny it doesn't do the maxDoc/N, but only once enough maxDoc have been index, start enforcing that ...

jpountz · 2023-12-07T13:15:11Z

+1, I like this idea.

I have a vague recollection of you saying you already implemented something like that, am I making this up? (it's quite possible, I struggle to keep lots of stuff in memory!)

It might be implemented by having TieredMergePolicy dynamically set the max segment size

One potential issue that comes to mind with this approach is that TieredMergePolicy gives up on finding balanced merges when they reach the maximum merged segment size and just packs as many segments as possible as long as the sum of their sizes is under the threshold, so this could potentially worsen write amplification?

Perhaps when the index is tiny it doesn't do the maxDoc/N, but only once enough maxDoc have been index, start enforcing that ...

+1

mikemccand · 2023-12-07T14:23:00Z

am I making this up?

Ha! No, you are not hallucinating @jpountz! We do have something like this for Amazon product search -- it's crucial for our usage to keep long-pole query latencies low by maximizing concurrency -- but it might just be as stupid as "setMaxMergedSegmentMB" to something "just right" for our usage. I'll poke around inside and see if our impl is not too embarrassing to share ;)

It might be implemented by having TieredMergePolicy dynamically set the max segment size

One potential issue that comes to mind with this approach is that TieredMergePolicy gives up on finding balanced merges when they reach the maximum merged segment size and just packs as many segments as possible as long as the sum of their sizes is under the threshold, so this could potentially worsen write amplification?

Hmm you're right -- TMP would spend more time doing this "unbalanced packing", though, presumably it would sort of run out of options because it gobbles up the smallish segments aggressively, maybe? Hard to visualize...

mikemccand · 2023-12-14T15:57:20Z

am I making this up?

Ha! No, you are not hallucinating @jpountz! We do have something like this for Amazon product search -- it's crucial for our usage to keep long-pole query latencies low by maximizing concurrency -- but it might just be as stupid as "setMaxMergedSegmentMB" to something "just right" for our usage. I'll poke around inside and see if our impl is not too embarrassing to share ;)

OK well our (Amazon product search's) implementation is sorta messy: we subclass TMP and override findMerges to dynamically change maxMergedSegmentMB as a function of the total index size at this moment, and then return super.findMerges(). It works but it's not so clean. I would prefer for this issue that we make this a first class feature of TMP? Something like setMinSegmentCount or setBalancedSegmentCount or so?

jpountz added the type:enhancement label Dec 5, 2023

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Dec 5, 2023

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Dec 5, 2023

carlosdelest mentioned this issue May 27, 2024

Add target search concurrency to TieredMergePolicy #13430

Merged

jpountz closed this as completed in #13430 Jul 17, 2024

github-project-automation bot moved this from Open to Closed in OpenSearch Lucene & Core Performance Tracking Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a MergePolicy wrapper that preserves search concurrency? #12877

Add a MergePolicy wrapper that preserves search concurrency? #12877

jpountz commented Dec 5, 2023

mikemccand commented Dec 7, 2023

jpountz commented Dec 7, 2023

mikemccand commented Dec 7, 2023

mikemccand commented Dec 14, 2023

Add a MergePolicy wrapper that preserves search concurrency? #12877

Add a MergePolicy wrapper that preserves search concurrency? #12877

Comments

jpountz commented Dec 5, 2023

Description

mikemccand commented Dec 7, 2023

jpountz commented Dec 7, 2023

mikemccand commented Dec 7, 2023

mikemccand commented Dec 14, 2023