Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TES task scheduling activities more scalable #138

Closed
BMurri opened this issue Mar 19, 2023 · 3 comments
Closed

Make TES task scheduling activities more scalable #138

BMurri opened this issue Mar 19, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request Scalability Enable users can scale TES workloads TES Priority: P1 Groomed to a Priority 1 issue

Comments

@BMurri
Copy link
Collaborator

BMurri commented Mar 19, 2023

Problem:
The task update loop (currently the single largest workload TES does) scales in an exponential manner (based on measurements).

Solution:
The task update loop should scale on a linear scale. See below for suggested solution.

Describe alternatives you've considered
See microsoft/CromwellOnAzure#497. This issue will replace that issue.

Additional context
Currently, the Scheduler service passes each active TesTask to BatchScheduler, which performs various API calls to determine the state of that task in Batch, which it uses to then perform operations on behalf of and/or alter the state of the task. This is performed serially.

The proposed solution is instead to perform the following operations in the following order (assignment of responsibilities TBD):

  1. In Autopool mode, query all jobs and filter by TesTaskIds, or in auto-scale mode, query all pools & jobs filtering by known pool ids, capturing all task and compute node states at that time. This will drastically reduce the number of Batch API calls when there are large numbers of active TesTasks, while not overloading the system when there are very few, by taking advantage of using Lists to query Batch state. This is the part that causes the most load in TES when scaling up workloads.
  2. For each TesTask, process the retrieved information from step 1 and run each state/task pairing that through the current state machine, except that any operations that can be combined across tasks will be combined (to reduce Azure API loads).
  3. Update all tasks in the DB.

Examples of 2. include creating pools/jobs (in auto-scale mode, since they are sharable), retrieving image container credentials, etc. The means of combining operations is TBD.

@BMurri BMurri added the enhancement New feature or request label Mar 19, 2023
@BMurri BMurri self-assigned this Mar 19, 2023
@MattMcL4475 MattMcL4475 changed the title Make TES task scheduling activities reasonably scalable. Make TES task scheduling activities more scalable Apr 7, 2023
BMurri added a commit that referenced this issue May 18, 2023
@ngambani ngambani added this to the 4.5.0 milestone Jun 16, 2023
@ngambani
Copy link

ngambani commented Oct 5, 2023

@BMurri is this fixed with the most recent PR #442 ?

@ngambani ngambani added the Scalability Enable users can scale TES workloads label Oct 5, 2023
@BMurri
Copy link
Collaborator Author

BMurri commented Oct 5, 2023

@ngambani We'll know once I've scale tested it.

@ngambani ngambani added the tobegroomed Add this label while creating new issues to get issues prioritized on the backlog label Oct 11, 2023
@ngambani ngambani removed the tobegroomed Add this label while creating new issues to get issues prioritized on the backlog label Oct 26, 2023
@MattMcL4475 MattMcL4475 added the TES Priority: P1 Groomed to a Priority 1 issue label Dec 11, 2023
@MattMcL4475 MattMcL4475 removed this from the 4.5.0 milestone Dec 11, 2023
@BMurri
Copy link
Collaborator Author

BMurri commented Jan 29, 2024

With the node runner, we now have events

@BMurri BMurri closed this as not planned Won't fix, can't repro, duplicate, stale Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Scalability Enable users can scale TES workloads TES Priority: P1 Groomed to a Priority 1 issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants