Try scheduling as much as available #4528

jonathanmetzman · 2024-12-19T16:16:51Z

Instead of trying to schedule a small amount of fuzz tasks every 2 minutes and hoping this leads
to fuzzing at full capacity, just schedule almost the full amount at once.

vitorguidi

lgtm

Previously, it would just schedule about ~1500 tasks unless the regions were totally full. Now we will schedule up to 15K tasks. Also, we will take into account batch's queueing (there could be other reasons for this besides CPU quota, though there shouldn't) and tasks that were already scheduler but not sent to batch or preprocessed so we don't overload the queue.

vitorguidi · 2024-12-19T18:15:45Z

src/clusterfuzz/_internal/cron/schedule_fuzz.py

+  # TODO(metzman): This doesn't distinguish between fuzz and non-fuzz
+  # tasks (nor preemptible and non-preemptible CPUs). Fix this.
+  waiting_tasks = sum(
+      batch.count_queued_or_scheduled_tasks(project, region)


One option to simulate these behaviors is https://simpy.readthedocs.io/
https://brooker.co.za/blog/2022/04/11/simulation.html

It is hard to imagine what these policies imply, by just the description

vitorguidi · 2024-12-19T18:18:45Z

src/clusterfuzz/_internal/cron/schedule_fuzz.py

+  logs.info(f'Soon committed CPUs: {soon_commited_cpus}')
+  available_cpus = sum(
+      get_available_cpus_for_region(project, region) for region in regions)
+  available_cpus = max(available_cpus - soon_commited_cpus, 0)


Now that we take the queue size into account, we can go back to running this very frequently, right?

Hmmm...I had the opposite thought, that because we can schedule so many more at once, there's no need to run it so often. I think there can be a slight delay between publishing and reaching the queue, so probably above 5 minutes makes most sense.

vitorguidi

lgtm. offered a design time option to anticipate real world behavior of these batch scheduling policies

This reverts commit d222215.

The preprocess count for fuzz task went to zero after #4564 got deployed, reverting. #4528 is also being reverted because it introduced the following error into the fuzz task scheduler, which caused fuzz tasks to stop being scheduled: ``` Traceback (most recent call last): File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_cron.py", line 68, in <module> sys.exit(main()) ^^^^^^ File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_cron.py", line 64, in main return 0 if task_module.main() else 1 ^^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/cron/schedule_fuzz.py", line 304, in main return schedule_fuzz_tasks() ^^^^^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/cron/schedule_fuzz.py", line 284, in schedule_fuzz_tasks available_cpus = get_available_cpus(project, regions) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/cron/schedule_fuzz.py", line 247, in get_available_cpus result = pool.starmap_async( # pylint: disable=no-member ^^^^^^^^^^^^^^^^^^ AttributeError: 'ProcessPoolExecutor' object has no attribute 'starmap_async' ```

Only #4565 was broken. #4528 is actually needed to prevent congestion. Fix the issue that the combination of them caused, python having too many parallelism APIs.

Instead of trying to schedule a small amount of fuzz tasks every 2 minutes and hoping this leads to fuzzing at full capacity, just schedule almost the full amount at once.

The preprocess count for fuzz task went to zero after #4564 got deployed, reverting. #4528 is also being reverted because it introduced the following error into the fuzz task scheduler, which caused fuzz tasks to stop being scheduled: ``` Traceback (most recent call last): File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_cron.py", line 68, in <module> sys.exit(main()) ^^^^^^ File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_cron.py", line 64, in main return 0 if task_module.main() else 1 ^^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/cron/schedule_fuzz.py", line 304, in main return schedule_fuzz_tasks() ^^^^^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/cron/schedule_fuzz.py", line 284, in schedule_fuzz_tasks available_cpus = get_available_cpus(project, regions) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/cron/schedule_fuzz.py", line 247, in get_available_cpus result = pool.starmap_async( # pylint: disable=no-member ^^^^^^^^^^^^^^^^^^ AttributeError: 'ProcessPoolExecutor' object has no attribute 'starmap_async' ```

Only #4565 was broken. #4528 is actually needed to prevent congestion. Fix the issue that the combination of them caused, python having too many parallelism APIs.

jonathanmetzman added 4 commits December 19, 2024 10:41

Increase scheduling

a0eb18d

tmp

a2af5d4

add some limit

59d07b4

merge

2cd0118

jonathanmetzman requested a review from vitorguidi December 19, 2024 16:19

vitorguidi approved these changes Dec 19, 2024

View reviewed changes

vitorguidi reviewed Dec 19, 2024

View reviewed changes

vitorguidi approved these changes Dec 19, 2024

View reviewed changes

jonathanmetzman added 5 commits December 19, 2024 17:23

Correct things

e2bd7b4

fix

ca582d7

Fix

7b1c9b6

fix

05c77bd

more

003dda9

jonathanmetzman merged commit d222215 into oss-fuzz Dec 20, 2024
3 checks passed

jonathanmetzman deleted the z branch December 20, 2024 02:11

vitorguidi added a commit that referenced this pull request Dec 28, 2024

Revert "Try scheduling as much as available (#4528)"

e24ca44

This reverts commit d222215.

vitorguidi mentioned this pull request Dec 28, 2024

Revert 4564 and 4528 #4568

Merged

jonathanmetzman mentioned this pull request Dec 30, 2024

Reland #4565 and #4528 and fix issues. #4570

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try scheduling as much as available #4528

Try scheduling as much as available #4528

jonathanmetzman commented Dec 19, 2024

vitorguidi left a comment

vitorguidi Dec 19, 2024 •

edited

Loading

vitorguidi Dec 19, 2024

jonathanmetzman Dec 20, 2024

vitorguidi left a comment

Try scheduling as much as available #4528

Try scheduling as much as available #4528

Conversation

jonathanmetzman commented Dec 19, 2024

vitorguidi left a comment

Choose a reason for hiding this comment

vitorguidi Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

vitorguidi Dec 19, 2024

Choose a reason for hiding this comment

jonathanmetzman Dec 20, 2024

Choose a reason for hiding this comment

vitorguidi left a comment

Choose a reason for hiding this comment

vitorguidi Dec 19, 2024 •

edited

Loading