-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] Explore using AWS machines for macOS CI builds #22700
Comments
After further investigation, I have doubts about the feasibility of this. Happy to discuss more f2f, but documenting my thoughts and sources below. The main issue we're going to run into is latency, due to the so-called "scrubbing workflow" that a dedicated host enters (source):
Also, spin up time for an instance isn't terrible, but isn't great either:
Finally, some notes on billing:
So, the existing workflow of: spin up instance from image > run job > terminate instance becomes a problem with this wait time on the dedicated host. At least some of the cost is alleviated by the last bullet point there, but the latency becomes the tradeoff. You can only run one instance per dedicated host at a time (source). We could keep instances running all the time and have Jenkins connect to them, as opposed to spinning them up/down when starting/stopping a job, but the numbers don't add up in that case: we currently have four image configurations (Sequoia/Sonoma x unprovisioned/provisioned), and would only be able to keep two instances up. Idea: if we know we wouldn't be charged for a full 24 hours x 7 days due to this latency, could we afford to allocate a third (or even fourth) dedicated host? |
Perhaps I should reframe the issue; if you add that 50 minutes of spin-down time and ~10 minutes of spin-up time, then end-to-end would roughly be 90 minutes. So not the great improvement that we thought it would be, but still comparable to the current solution and still cheaper if we only run <=2 hosts (especially if we aren't charged for those 50 minutes). |
The progress is good, but to help make an overall decision we should still take some measurements. It's not clear how often the worst-case delays will happen. (1) When running multiple unprovisioned job back-to-back on the same dedicated host, in practice what kind of latency throughput can we achieve? (a) How long is the time from the build becoming the head of the queue to its first Drake setup script inside the image being run? The data for (a+b+c) will give us throughput -- how many builds per day we can run. It's possible that a--c will be different for the first build of the day versus being run back-to-back repeatedly. In the experiment we should run 4+ builds back-to-back on the same dedicated host to see if it improves upon repetition. For unprovisioned, we care more about throughput than we do latency. (2) Ditto, but for Provisioned jobs. For provisioned, we care about both latency and throughput. |
Is your feature request related to a problem? Please describe.
We aim to have CI with high uptime, low rate of infrastructure flakes, and low time investment to maintain the infrastructure. Given a growing number of unsolved recent problems with our current hosting (#22587, #22460, #21971) and the pricing of the brittle solution, it's time to explore alternatives.
Describe the solution you'd like
An early experiment with EC2 macOS images shows promise. A clean build on
mac2.metal
takes around 55 minutes, andmac2-m2pro.metal
takes around 30 minutes. For reference, a clean build on current hosting takes around 85 minutes. Given the pricing,mac2-m2pro.metal
seems like the best fit.We should try spinning up a single test job like
aws-mac-arm-sequoia-unprovisioned-clang-bazel-experimental-release
onmac2-m2pro.metal
as a prototype, to identify what challenges we might face in porting to the new solution. The idea would be to do the setup by hand while keeping a notebook of what we did, so that in the future if we decide to do a full port we have the notebook to help guide the future automation.I suggest sequoia for the prototype because that's our forward-looking OS version, and is the one we're commonly struggling with on current hosting.
The test job should NOT use the remote cache.
Note that the instances are priced per day so be sure to set an instance cap of <= 1 during prototyping, so that we never launch two by mistake.
Describe alternatives you've considered
We looked at GHA runners as an option. The free instances are too slow to be good for much, and the paid instances are approximately 10x more expensive (hourly) than our current solution. Not viable.
We could try GHA self-hosted runners, but maintaining a cluster of macs + VPN fails the "low time investment" criterion.
If we can't find a good hosting solution, then we could also reduce our macOS support footprint. We could imagine only running packaging/wheel builds, never any bazel (test) build flavors. That would probably fit on free GHA.
Additional context
N/A
The text was updated successfully, but these errors were encountered: