[ci] Explore using AWS machines for macOS CI builds #22700

jwnimmer-tri · 2025-03-04T15:46:56Z

Is your feature request related to a problem? Please describe.

We aim to have CI with high uptime, low rate of infrastructure flakes, and low time investment to maintain the infrastructure. Given a growing number of unsolved recent problems with our current hosting (#22587, #22460, #21971) and the pricing of the brittle solution, it's time to explore alternatives.

Describe the solution you'd like

An early experiment with EC2 macOS images shows promise. A clean build on mac2.metal takes around 55 minutes, and mac2-m2pro.metal takes around 30 minutes. For reference, a clean build on current hosting takes around 85 minutes. Given the pricing, mac2-m2pro.metal seems like the best fit.

We should try spinning up a single test job like aws-mac-arm-sequoia-unprovisioned-clang-bazel-experimental-release on mac2-m2pro.metal as a prototype, to identify what challenges we might face in porting to the new solution. The idea would be to do the setup by hand while keeping a notebook of what we did, so that in the future if we decide to do a full port we have the notebook to help guide the future automation.

I suggest sequoia for the prototype because that's our forward-looking OS version, and is the one we're commonly struggling with on current hosting.

The test job should NOT use the remote cache.

Note that the instances are priced per day so be sure to set an instance cap of <= 1 during prototyping, so that we never launch two by mistake.

Describe alternatives you've considered

We looked at GHA runners as an option. The free instances are too slow to be good for much, and the paid instances are approximately 10x more expensive (hourly) than our current solution. Not viable.

We could try GHA self-hosted runners, but maintaining a cluster of macs + VPN fails the "low time investment" criterion.

If we can't find a good hosting solution, then we could also reduce our macOS support footprint. We could imagine only running packaging/wheel builds, never any bazel (test) build flavors. That would probably fit on free GHA.

Additional context

N/A

The text was updated successfully, but these errors were encountered:

tyler-yankee · 2025-03-10T17:18:03Z

After further investigation, I have doubts about the feasibility of this. Happy to discuss more f2f, but documenting my thoughts and sources below.

The main issue we're going to run into is latency, due to the so-called "scrubbing workflow" that a dedicated host enters (source):

In case of an EC2 Mac instance, stopping or terminating a Mac instance initiates a scrubbing workflow of the underlying Dedicated Host, during which it enters the pending state. This scrubbing workflow includes tasks such as erasing the internal SSD, resetting NVRAM, and more, and it can take up to 50 minutes to complete.

Also, spin up time for an instance isn't terrible, but isn't great either:

For an AWS vended AMI with a x86 Mac instance or a Apple silicon Mac instance, the launch time can range from approximately 6 minutes to 20 minutes.

Finally, some notes on billing:

Billing for the dedicated hosts is per-second for the host, regardless of instances (source).
You aren't charged for the "scrubbing workflow" time, or any time in general that the host is not available (source).

So, the existing workflow of: spin up instance from image > run job > terminate instance becomes a problem with this wait time on the dedicated host. At least some of the cost is alleviated by the last bullet point there, but the latency becomes the tradeoff.

You can only run one instance per dedicated host at a time (source). We could keep instances running all the time and have Jenkins connect to them, as opposed to spinning them up/down when starting/stopping a job, but the numbers don't add up in that case: we currently have four image configurations (Sequoia/Sonoma x unprovisioned/provisioned), and would only be able to keep two instances up.

Idea: if we know we wouldn't be charged for a full 24 hours x 7 days due to this latency, could we afford to allocate a third (or even fourth) dedicated host?

tyler-yankee · 2025-03-10T17:51:05Z

Perhaps I should reframe the issue; if you add that 50 minutes of spin-down time and ~10 minutes of spin-up time, then end-to-end would roughly be 90 minutes. So not the great improvement that we thought it would be, but still comparable to the current solution and still cheaper if we only run <=2 hosts (especially if we aren't charged for those 50 minutes).

jwnimmer-tri · 2025-03-10T20:32:38Z

The progress is good, but to help make an overall decision we should still take some measurements. It's not clear how often the worst-case delays will happen.

(1) When running multiple unprovisioned job back-to-back on the same dedicated host, in practice what kind of latency throughput can we achieve?

(a) How long is the time from the build becoming the head of the queue to its first Drake setup script inside the image being run?
(b) How much longer until the build result (pass / fail) is made available?
(c) How much longer until the next build starts?

The data for (a+b+c) will give us throughput -- how many builds per day we can run.
The data for (a+b) will give us developer latency -- from the time of a request, how long until the answer.
The data for (a) will give us a sense of warm-up time.

It's possible that a--c will be different for the first build of the day versus being run back-to-back repeatedly. In the experiment we should run 4+ builds back-to-back on the same dedicated host to see if it improves upon repetition.

For unprovisioned, we care more about throughput than we do latency.

(2) Ditto, but for Provisioned jobs.

For provisioned, we care about both latency and throughput.

jwnimmer-tri added type: feature request component: continuous integration Jenkins, CDash, mirroring of externals, website infrastructure labels Mar 4, 2025

jwnimmer-tri assigned BetsyMcPhail Mar 4, 2025

jwnimmer-tri added this to build system, continuous integration, distribution Mar 4, 2025

github-project-automation bot moved this to On deck in build system, continuous integration, distribution Mar 4, 2025

jwnimmer-tri moved this from On deck to To do in build system, continuous integration, distribution Mar 4, 2025

jwnimmer-tri added the priority: high label Mar 4, 2025

tyler-yankee self-assigned this Mar 4, 2025

tyler-yankee moved this from To do to In progress in build system, continuous integration, distribution Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] Explore using AWS machines for macOS CI builds #22700

[ci] Explore using AWS machines for macOS CI builds #22700

jwnimmer-tri commented Mar 4, 2025 •

edited

Loading

tyler-yankee commented Mar 10, 2025 •

edited

Loading

tyler-yankee commented Mar 10, 2025 •

edited

Loading

jwnimmer-tri commented Mar 10, 2025

[ci] Explore using AWS machines for macOS CI builds #22700

[ci] Explore using AWS machines for macOS CI builds #22700

Comments

jwnimmer-tri commented Mar 4, 2025 • edited Loading

tyler-yankee commented Mar 10, 2025 • edited Loading

tyler-yankee commented Mar 10, 2025 • edited Loading

jwnimmer-tri commented Mar 10, 2025

jwnimmer-tri commented Mar 4, 2025 •

edited

Loading

tyler-yankee commented Mar 10, 2025 •

edited

Loading

tyler-yankee commented Mar 10, 2025 •

edited

Loading