Global limit for MSQ controller tasks implemented #16889

nozjkoitop · 2024-08-13T07:34:51Z

Improvements

Implemented a way to limit on the number of query controller tasks running at the same time. This limit specifies what percentage or amount of task slots can be allocated to query controllers. If the limit is reached, the tasks would wait for resources instead of potentially blocking the execution of other tasks (and failing after a timeout).

Rationale

There is no mechanism in Druid to prevent the cluster from being overloaded with controller tasks. Currently, it could cause a significant slowdown in processing and may lead to temporary deadlock situations.

Introduced new configuration options

druid.indexer.queue.controllerTaskSlotRatio - optional value which defines the proportion of available task slots that can be allocated to msq controller tasks. This is a floating-point value between 0 and 1. Defaults to null.
druid.indexer.queue.maxControllerTaskSlots - optional value which specifies the maximum number of task slots that can be allocated to controller tasks. This is an integer value that provides a hard limit on the number of task slots available for msq controller tasks. Defaults to null.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

indexing-service/src/main/java/org/apache/druid/indexing/overlord/config/TaskQueueConfig.java

indexing-service/src/test/java/org/apache/druid/indexing/overlord/SimpleTaskRunner.java

docs/configuration/index.md

asdf2014

Remove useless tail

Co-authored-by: Benedict Jin <asdf2014@apache.org>

nozjkoitop · 2024-08-14T10:03:45Z

Remove useless tail

Done, thanks

kfaraz

Currently, it could cause a significant slowdown in processing and may lead to temporary deadlock situations.

@nozjkoitop , rather than proceeding with a limit, I think we should try to figure out what is causing the slowdown in processing and/or the deadlock. Can you elaborate on this?

Once we have done an analysis of exactly what goes wrong when we have too many controller tasks and we have decided to impose a limit, it should not be done through the TaskQueue as done in this PR. Instead, it should be similar to the implementation of parallelIndexTaskSlotRatio and the config should most likely live in WorkerTaskRunnerConfig.

druid/indexing-service/src/main/java/org/apache/druid/indexing/overlord/config/WorkerTaskRunnerConfig.java

Lines 37 to 50 in 73ff9f9

    
             /** 
        
              * The number of task slots that a parallel indexing task can take is restricted using this config as a multiplier 
        
              * 
        
              * A value of 1 means no restriction on the number of slots ParallelIndexSupervisorTasks can occupy (default behaviour) 
        
              * A value of 0 means ParallelIndexSupervisorTasks can occupy no slots. 
        
              * Deadlocks can occur if the all task slots are occupied by ParallelIndexSupervisorTasks, 
        
              * as no subtask would ever get a slot. Set this config to a value < 1 to prevent deadlocks. 
        
              * 
        
              * @return ratio of task slots available to a parallel indexing task at a worker level 
        
              */ 
        
             public double getParallelIndexTaskSlotRatio() 
        
             { 
        
               return parallelIndexTaskSlotRatio; 
        
             }

cryptoe

I think this is a nice feature to have. Left some comments.
Thanks @nozjkoitop for taking this up.

cryptoe · 2024-09-08T02:07:17Z

indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskQueue.java

@@ -440,7 +442,7 @@ private void manageInternalCritical(
            notifyStatus(task, taskStatus, taskStatus.getErrorMsg());
            continue;
          }
-          if (taskIsReady) {
+          if (taskIsReady && !isControllerTaskLimitReached(task.getType(), true)) {


Should It just be limited to MSQ controllers or other job types as well. Maybe Take a json as a input ? where key is the taskType and the value is the limit/float.

Also what is the user behavior if the task is pending for launch due to limit.

Can we communicate why the task is not launching to the user.

If the cluster is totally starved, does the controller get timed out eventually and removed from queue?

…limit

This reverts commit 580999f.

This reverts commit 1b75e47.

…rged behavior with parallelIndexTaskSlotRatio

nozjkoitop · 2024-09-12T15:34:25Z

Thanks @kfaraz, @cryptoe for your comments
The most trivial deadlock scenario occurs when we queue a group of controller tasks but don't have available task slots for actual workers. This results in tasks hanging and eventually timing out. Thanks for highlighting the WorkerTaskRunnerConfig, it seems like the great place for this configuration. I've updated the behavior and merged it with parallelIndexTaskSlotRatio, now it's more flexible, also I've added the logging to inform the user why the task is Pending

kfaraz

We need to ensure that we do not remove any of the existing properties or config fields.

docs/configuration/index.md

indexing-service/src/main/java/org/apache/druid/indexing/overlord/ImmutableWorkerInfo.java

kfaraz · 2024-09-23T08:31:17Z

...-service/src/main/java/org/apache/druid/indexing/overlord/config/WorkerTaskRunnerConfig.java


 public class WorkerTaskRunnerConfig
 {
  @JsonProperty
  private String minWorkerVersion = "0";

  @JsonProperty
-  private double parallelIndexTaskSlotRatio = 1;
+  @JsonDeserialize(using = CustomJobTypeLimitsDeserializer.class)
+  private Map<String, Number> customJobTypeLimits = new HashMap<>();


Why do we need a custom deserializer?
Can't we just have a map from String to Double?

It was added mostly for validation as the idea is to have double / integer values to have an option to specify not only the ratio but also a limit

cryptoe · 2024-09-24T08:33:08Z

indexing-service/src/main/java/org/apache/druid/indexing/overlord/ImmutableWorkerInfo.java

@@ -57,6 +63,7 @@ public ImmutableWorkerInfo(
      @JsonProperty("worker") Worker worker,
      @JsonProperty("currCapacityUsed") int currCapacityUsed,
      @JsonProperty("currParallelIndexCapacityUsed") int currParallelIndexCapacityUsed,
+      @JsonProperty("currTypeSpecificCapacityUsed") Map<String, Integer> typeSpecificCapacityMap,


This should be nullable no ?

You're right, thanks

cryptoe · 2024-09-24T08:33:37Z

indexing-service/src/main/java/org/apache/druid/indexing/overlord/ImmutableWorkerInfo.java

@@ -225,6 +253,89 @@ private int getWorkerParallelIndexCapacity(double parallelIndexTaskSlotRatio)
    return workerParallelIndexCapacity;
  }

+  public boolean canRunTask(Task task, Map<String, Number> taskLimits)


Can you please add java docs for this method.

cryptoe · 2024-09-24T08:37:09Z

indexing-service/src/main/java/org/apache/druid/indexing/overlord/ZkWorker.java

@@ -167,6 +167,29 @@ private int getCurrParallelIndexCapacityUsed(Map<String, TaskAnnouncement> tasks
    return currParallelIndexCapacityUsed;
  }

+  @JsonProperty("currTypeSpecificCapacityUsed")
+  public Map<String, Integer> getCurrTypeSpecificCapacityUsed()


I thought we had deprecated Zk based runner in favour of http.
@kfaraz Does this change still make sense ?

Yes, ZK-based task runner is deprecated and we should not support the new feature with ZK.

cryptoe · 2024-09-24T08:38:51Z

docs/configuration/index.md

@@ -1135,6 +1135,7 @@ The following configs only apply if the Overlord is running in remote mode. For
 |`druid.indexer.runner.taskAssignmentTimeout`|How long to wait after a task has been assigned to a Middle Manager before throwing an error.|`PT5M`|
 |`druid.indexer.runner.minWorkerVersion`|The minimum Middle Manager version to send tasks to. The version number is a string. This affects the expected behavior during certain operations like comparison against `druid.worker.version`. Specifically, the version comparison follows dictionary order. Use ISO8601 date format for the version to accommodate date comparisons. |"0"|
 | `druid.indexer.runner.parallelIndexTaskSlotRatio`| The ratio of task slots available for parallel indexing supervisor tasks per worker. The specified value must be in the range `[0, 1]`. |1|
+|`druid.indexer.runner.taskSlotLimits`| A map where each key is a task type, and the corresponding value represents the limit on the number of task slots that a task of that type can occupy on a worker. The key is a `String` that specifies the task type. The value can either be a Double or Integer. A `Double` in the range [0, 1], representing a ratio of the available task slots that tasks of this type can occupy. An `Integer` that is greater than or equal to 0, representing an absolute limit on the number of task slots that tasks of this type can occupy.|Empty map|


Could you please provide an example as well ?

How does this interact with compaction slots ?

Example added! Good catch on the compaction slots. I'll need to test that, but based on the code, it looks like compaction slots availability will be checked twice (in the duty, and during the worker selection) if related entry will be included in the taskSlotLimits map. If there's a conflict, some tasks might end up in a pending state for a while.

Confirmed that the number of submitted tasks is linked to the compaction task slot limits. However, execution might be delayed if the custom limit is smaller than the one set for compaction.

For example here, taskSlotsMax = 3, but in the overlord configuration, I have druid.indexer.runner.taskSlotLimits={"compact": 2}.

I think the limit on compaction tasks (or kill tasks for that matter) should not be a concern.
This is a runtime property, typically controlled by an admin.
So, if an admin wants to restrict the number of concurrent compaction tasks, it is fair to honor that irrespective of the value of compactionTaskSlotRatio or maxCompactionTaskSlots set in the coordinator dynamic configs.

We just need to call it out clearly in the release notes and the docs of the new property.

kfaraz · 2024-10-02T10:58:28Z

I'm afraid I don't have access to this Slack workspace

You can try this link https://druid.apache.org/community/join-slack from the Druid docs.

…q-controller-task-limit

nozjkoitop · 2024-10-10T10:38:17Z

Hey @cryptoe, what do you think about this solution? I've used SelectWorkerStrategies to utilize dynamic configuration of global limits, and it seems like a good option to me.

cryptoe · 2024-11-21T13:40:45Z

Apologies, I have been meaning to get to this PR. Will try to finish it by EOW.

github-actions · 2025-01-21T00:20:25Z

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

nozjkoitop · 2025-01-21T08:21:38Z

Hi @cryptoe will you have some time to review this changes?

cryptoe · 2025-03-04T15:21:29Z

@nozjkoitop Could you please rebase this PR. Will take a look again. Overall LGTM
cc @kfaraz

…q-controller-task-limit

nozjkoitop · 2025-03-05T11:13:29Z

@cryptoe Done

kfaraz · 2025-03-05T11:36:07Z

Thanks for your work on this, @nozjkoitop !
I will try to finish the review of this PR later this week.

docs/configuration/index.md

indexing-service/src/main/java/org/apache/druid/indexing/overlord/setup/LimiterUtils.java

cryptoe · 2025-03-05T12:47:05Z

indexing-service/src/main/java/org/apache/druid/indexing/overlord/ImmutableWorkerInfo.java

@@ -225,6 +244,13 @@ private int getWorkerParallelIndexCapacity(double parallelIndexTaskSlotRatio)
    return workerParallelIndexCapacity;
  }

+  public Map<String, Integer> incrementTypeSpecificCapacity(String type, int capacityToAdd)


Should there be a corresponding decrement ?

cryptoe · 2025-03-05T12:48:20Z

indexing-service/src/main/java/org/apache/druid/indexing/overlord/ImmutableWorkerInfo.java

@@ -57,6 +61,7 @@ public ImmutableWorkerInfo(
      @JsonProperty("worker") Worker worker,
      @JsonProperty("currCapacityUsed") int currCapacityUsed,
      @JsonProperty("currParallelIndexCapacityUsed") int currParallelIndexCapacityUsed,
+      @JsonProperty("currCapacityUsedByTaskType") Map<String, Integer> currCapacityUsedByTaskType,


or this should be immutable no ?

It is immutable. incrementTypeSpecificCapacity doesn't change the fields. This typeSpecificCapacity is managed exactly like parallelIndexCapacityUsed, which is immutableWorker.getCurrParallelIndexCapacityUsed() + parallelIndexTaskCapacity in ImmutableWorkerInfo constructor arguments, but here the incremented value is created in ImmutableWorkerInfo itself. Wrt decrement, I dont see any for currParallelIndexCapacityUsed and currCapacityUsed either, if I'm not mistaken it's managed by Provisioning Strategy

cryptoe · 2025-03-06T05:04:14Z

Thanks for the changes.
Lets wait for @kfaraz review and then we can get this merged.

kfaraz

Left some suggestions.

indexing-service/src/main/java/org/apache/druid/indexing/overlord/ImmutableWorkerInfo.java

...che/druid/indexing/overlord/setup/EqualDistributionWithCategorySpecWorkerSelectStrategy.java

...ain/java/org/apache/druid/indexing/overlord/setup/EqualDistributionWorkerSelectStrategy.java

...g/apache/druid/indexing/overlord/setup/FillCapacityWithCategorySpecWorkerSelectStrategy.java

indexing-service/src/main/java/org/apache/druid/indexing/overlord/setup/WorkerSelectUtils.java

...che/druid/indexing/overlord/setup/EqualDistributionWithAffinityWorkerSelectStrategyTest.java

...java/org/apache/druid/indexing/overlord/setup/EqualDistributionWorkerSelectStrategyTest.java

website/.spelling

indexing-service/src/main/java/org/apache/druid/indexing/overlord/setup/TaskLimits.java

docs/configuration/index.md

kfaraz

Thanks for the changes, @nozjkoitop !

Global msq limit implemented

b8c6f23

github-actions bot added Area - Documentation Area - Ingestion labels Aug 13, 2024

sviatahorau added 2 commits August 13, 2024 09:43

fix post-merge compilation issues

1b75e47

fix tests compilation

580999f

github-advanced-security bot found potential problems Aug 13, 2024

View reviewed changes

indexing-service/src/main/java/org/apache/druid/indexing/overlord/config/TaskQueueConfig.java Fixed Show fixed Hide fixed

indexing-service/src/test/java/org/apache/druid/indexing/overlord/SimpleTaskRunner.java Fixed Show fixed Hide fixed

address spell check failure

87db3da

asdf2014 reviewed Aug 14, 2024

View reviewed changes

docs/configuration/index.md Outdated Show resolved Hide resolved

asdf2014 reviewed Aug 14, 2024

View reviewed changes

Update docs/configuration/index.md

c2287eb

Co-authored-by: Benedict Jin <asdf2014@apache.org>

kfaraz requested a review from cryptoe August 16, 2024 02:50

kfaraz reviewed Sep 6, 2024

View reviewed changes

cryptoe reviewed Sep 8, 2024

View reviewed changes

nozjkoitop and others added 4 commits September 12, 2024 16:54

Merge branch 'apache:master' into feature-global-msq-controller-task-…

b7889cc

…limit

Revert "fix tests compilation"

02a195a

This reverts commit 580999f.

Revert "fix post-merge compilation issues"

ea3ad87

This reverts commit 1b75e47.

Addressed comments moved configs to the WorkerTaskRunnerConfig and me…

9cc11f5

…rged behavior with parallelIndexTaskSlotRatio

nozjkoitop requested review from kfaraz and cryptoe September 12, 2024 15:37

sviatahorau added 2 commits September 13, 2024 11:42

Fix failing checks

19a1a52

Checkstyle fix

5d45ef4

kfaraz requested changes Sep 23, 2024

View reviewed changes

Rollback parallelIndexRatio removal

1a09ca6

nozjkoitop requested a review from kfaraz September 24, 2024 07:23

cryptoe reviewed Sep 24, 2024

View reviewed changes

sviatahorau added 2 commits September 24, 2024 14:33

Address review comments

ef7ae81

Address spellchecks

23e1c36

nozjkoitop requested a review from cryptoe September 24, 2024 14:02

sviatahorau added 6 commits October 8, 2024 09:50

Merge remote-tracking branch 'upstream/master' into feature-global-ms…

95a3a31

…q-controller-task-limit

revert per-worker limit changes

9459e06

Merge remote-tracking branch 'upstream/master' into feature-global-ms…

9a95773

…q-controller-task-limit

Global limit with dynamic config implemented using select strategies

6834b52

conflicts resolved

fcaaa04

Fix compilation and spellcheck failures

7be4502

nozjkoitop requested a review from cryptoe October 10, 2024 10:38

Trigger the checks

6e1501f

github-actions bot added the stale label Jan 21, 2025

github-actions bot removed the stale label Jan 22, 2025

Merge remote-tracking branch 'upstream/master' into feature-global-ms…

89150ff

…q-controller-task-limit

cryptoe reviewed Mar 5, 2025

View reviewed changes

TODO's resolved

7f5f982

nozjkoitop requested a review from cryptoe March 5, 2025 14:55

cryptoe approved these changes Mar 6, 2025

View reviewed changes

kfaraz reviewed Mar 6, 2025

View reviewed changes

docs/configuration/index.md Outdated Show resolved Hide resolved

sviatahorau added 2 commits March 6, 2025 12:01

Comments addressed, LimiterUtils logic moved to TaskLimits class

832f62a

Checkstyle fix

72f79b7

kfaraz approved these changes Mar 6, 2025

View reviewed changes

kfaraz merged commit 8b56824 into apache:master Mar 6, 2025
75 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global limit for MSQ controller tasks implemented #16889

Global limit for MSQ controller tasks implemented #16889

nozjkoitop commented Aug 13, 2024 •

edited

Loading

asdf2014 left a comment

nozjkoitop commented Aug 14, 2024

kfaraz left a comment •

edited

Loading

cryptoe left a comment

cryptoe Sep 8, 2024

nozjkoitop commented Sep 12, 2024

kfaraz left a comment

kfaraz Sep 23, 2024

nozjkoitop Sep 23, 2024

cryptoe Sep 24, 2024

nozjkoitop Sep 24, 2024

cryptoe Sep 24, 2024

nozjkoitop Sep 24, 2024

cryptoe Sep 24, 2024

kfaraz Sep 24, 2024

cryptoe Sep 24, 2024

cryptoe Sep 24, 2024

nozjkoitop Sep 24, 2024 •

edited

Loading

nozjkoitop Sep 24, 2024

kfaraz Sep 24, 2024

kfaraz commented Oct 2, 2024

nozjkoitop commented Oct 10, 2024

cryptoe commented Nov 21, 2024

github-actions bot commented Jan 21, 2025

nozjkoitop commented Jan 21, 2025

cryptoe commented Mar 4, 2025

nozjkoitop commented Mar 5, 2025

kfaraz commented Mar 5, 2025

cryptoe Mar 5, 2025

cryptoe Mar 5, 2025

nozjkoitop Mar 5, 2025 •

edited

Loading

cryptoe commented Mar 6, 2025

kfaraz left a comment

kfaraz left a comment

	/**
	* The number of task slots that a parallel indexing task can take is restricted using this config as a multiplier
	*
	* A value of 1 means no restriction on the number of slots ParallelIndexSupervisorTasks can occupy (default behaviour)
	* A value of 0 means ParallelIndexSupervisorTasks can occupy no slots.
	* Deadlocks can occur if the all task slots are occupied by ParallelIndexSupervisorTasks,
	* as no subtask would ever get a slot. Set this config to a value < 1 to prevent deadlocks.
	*
	* @return ratio of task slots available to a parallel indexing task at a worker level
	*/
	public double getParallelIndexTaskSlotRatio()
	{
	return parallelIndexTaskSlotRatio;
	}

Global limit for MSQ controller tasks implemented #16889

Global limit for MSQ controller tasks implemented #16889

Conversation

nozjkoitop commented Aug 13, 2024 • edited Loading

Improvements

Rationale

Introduced new configuration options

asdf2014 left a comment

Choose a reason for hiding this comment

nozjkoitop commented Aug 14, 2024

kfaraz left a comment • edited Loading

Choose a reason for hiding this comment

cryptoe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nozjkoitop commented Sep 12, 2024

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nozjkoitop Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz commented Oct 2, 2024

nozjkoitop commented Oct 10, 2024

cryptoe commented Nov 21, 2024

github-actions bot commented Jan 21, 2025

nozjkoitop commented Jan 21, 2025

cryptoe commented Mar 4, 2025

nozjkoitop commented Mar 5, 2025

kfaraz commented Mar 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nozjkoitop Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

cryptoe commented Mar 6, 2025

kfaraz left a comment

Choose a reason for hiding this comment

kfaraz left a comment

Choose a reason for hiding this comment

nozjkoitop commented Aug 13, 2024 •

edited

Loading

kfaraz left a comment •

edited

Loading

nozjkoitop Sep 24, 2024 •

edited

Loading

nozjkoitop Mar 5, 2025 •

edited

Loading