Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCS] Feature req: Use CustomTime attribute to record when cached entry was last hit (for eviction/cleanup rules) #2318

Open
whisperity opened this issue Jan 17, 2025 · 3 comments · May be fixed by #2337

Comments

@whisperity
Copy link

Hey!

We are using a Google Cloud Storage Bucket through sccache to cache across CI compilations of LLVM. Unfortunately, it looks like none of the cloud provider implementations allow for keeping the size of the cache in check — it is only possible for local-disk caching (which we can not do as the CI machines are ephemeral VMs).

In order to keep the costs of this shared cache reasonable, we implemented an Age lifecycle rule in GCS: if a cached entry is too old, GCS will deal with deleting it.
According to GCS, an Age rule is matched as:

Age is counted from when an object was uploaded to the current bucket.

This presents a problem with "death-waves": Every $n$ days after an initial CI run had populated the cache with about $6,000$ translation units, they all get deleted, and the subsequent CI run will essentially have to do an almost full rebuild. The only cache hits at $\ge n$ days will be the more recently modified content — which is often the smaller part of the files in the project.

It would be great if sccache with a cloud bucket could behave in the same LRU way as ccache locally does. Doing a full LRU is problematic without handling size limits, but it seems there is another lifecycle property that can be used, at least in the case of GCS: CustomTime.

LRU-like behaviour could be simulated by setting the Days since custom time lifecycle rule (instead of Age), making sure only files that had not been cache hit for $n$ days are evicted from the bucket.
However, sccache does not populate this field at all:

Details view of a file in an sccache bucket, showing that Custom time is empty.

It seems from the documentation that this is a simple timestamp field that could be populated through the API, so all we need is a simple additional HTTP request that tells Google to update this field to "current time" once the caching logic hits a file successfully.


It seems like similar metadata options are available for other cloud providers, such as S3 or Azure but I have no idea or experience whether these could be used for controlling the cache's lifecycle as well as it can be for GCS.

@Xuanwo
Copy link
Collaborator

Xuanwo commented Feb 20, 2025

Hi, @whisperity, thanks for suggesting this. I understand your use cases, but I feel that using CustomTime here is not a good idea and might be more expensive than simply setting a higher age for cache eviction.

Let's take gcs us-east1 as an example, so we have the following things:

  • Standard storage(per GB per Month): $0.020
  • Operation Class A (per 1000 operations): $0.0050
  • Operation Class B (per 1000 operations): $0.0004
  • storage.*.patch (the API we used to update CustomTime) is Class A, the same as storage.*.insert

Now we have two ways:

  • Set a longer age like increase from 30 days to 360 days, so we just need to trigger the death wave per year.
  • Put CustomeTime for every read request, so we can clean the not accessed object in time.

The way A will adds more storage cost and the way B will add more Class A requests.

Let's perform the estimation in the simplest way: we currently have N GB of data, and it will continue to grow each month. We will request that cache X times per month. After a full year, the cost of our cache storage will stabilize, with a maximum of 12N GB of data. The cost increase will be (removing the common cost):

  • Way A: 12N * 0.020 (we have store more data because of age is larger)
  • Way B: N * 0.020 + 12X / 1000 * 0.0050 (we only need to store one month of data, but we need to pay for every patch request)

We will have the following ratio: Way A Cost > Way B Cost => N/X > 12 / 1000 * 0.0050 / 0.02 / 11 => N/X > 0.000273.

Each month, adding more data becomes cost-efficient when it lower than requests * 0.000273. At the level of 1M requests per month, it's 273GB; At the level of 1G requests per month, it's 273TB. More requests we have, the most cost we are saving by just storing them.

Hoping this can contribute to our discussion.

@whisperity
Copy link
Author

Hey @Xuanwo! Thank you for the quick reply and the detailed analysis! I must admit this is information that I did not originally consider, and, in a hindsight, it must have been stupid of me not to think that the cloud provider would nickel & dime us for something as simple as a timestamp… D'oh!

I will come back with some more concrete pricing information tomorrow, because all the actual testing of this feature (albeit with a much smaller project than what we use the CI for!) was done yesterday and it takes GCP about 1.5 days to fully stabilise the billing breakdown from the point of the actual action that resulted in the costs.

There is a bit of a nuance here that suffering a death wave and a cold cache, because a cold cache means that the CI machine has to operate for a longer period in time. For the sake of a complete discussion I will extend the calculation here, but I believe this extra cost quickly trivialises.

We have two primary CI loops, one for testing, and one for packaging a snapshot of the product. They run at different optimisation levels (can't directly share the cache between them) and target a different set of platforms (absolutely can't share the cache between them). The current rounded estimate (for the sake of easier calculation as there is some fluctuation not related to the presence of the cache) is

  • 25 instead of 5 minutes for Linux (n2d-highcpu-48, this is the tier we have measured to give the the best performance characteristics) means at a cost of $1.497024 \dfrac{\$}{hr} * \dfrac{1}{3} = 0.499 \$ $ to "reheat" the cache.
  • 2 hours instead of 20 minutes for Mac in GitHub Actions CI gives us $0.08 \dfrac{\$}{min} * 100 = 8\$ $ for the same.

So even with a death wave frequence of "every month" it's only $8.5 to reheat the cache, which, fiscally, isn't terribly much if it happens rarely enough.

At the level of 1M requests per month, it's 273GB.

For the actual project, let's say we have, on average, 6000 cacheable objects (TUs) at hand. (It's slightly different between the "distribution" CI and the "testing" CI which run on different schedules and triggers, but 6000 is a good average.) So that means $166\dfrac{2}{3}$ CI executions in a month will exhaust your suggested 1M request threshold. If we assume that 5 patches get created every day and each patch needs precisely 1 round of reviews to be accepted (and, thus, one re-run of the pull_request-related CI pipeline), that's 21 * 5 * 2 = 210 CI runs alone without ever running the "distribution" CI loop.

Right now, with having the "death waves" disabled (it only ever happened once and resulted in the sudden spike of CI times, which prompted me to open this ticket), we now have 904 MiB of data (according to both GCP Console in the browser total_bytes_live-object and gsutil du -sh) in the cache. I had no idea the compression logic is this good considering the actual size of a Build/ directory is about 5 GiB. (These are Release builds. Debug builds would be about 120 GiB!)

So certainly, as you analysed, running a PATCH CustomTime every single time is not feasible economically.

However…

Now we have two ways:

  • Set a longer age like increase from 30 days to 360 days, so we just need to trigger the death wave per year.
  • Put CustomeTime for every read request, so we can clean the not accessed object in time.

Could you help me understand the feasibility of the following third option: update the CustomTime regularly, but only with a wide enough interval.

My thoughts right now are to implement it in a way that the sccache administrator could provide as an argument two values:

  • an ExpiryThreshold (that likely equals to the automatic object cleaning rule present in the bucket settings); or maybe a customised API request with the right permissions could even query this value automatically from the bucket's configuration during SCCache's startup
  • a ThresholdDelta which specifies the earliest moment before expiry where a CustomTime update is to be triggered

With these in hand, we can run the PATCH CustomTime request in a way that it only ever patches if the current cache hit is "reasonably close" to said expiry time. What is "reasonable" here is, of course, up for consideration. For example, I would set it such that objects CustomTime-expire after 90 days, and then the "reasonable update delta" is something like 7 or 14 days.

If this was possible, that means that for each cache hit, we would still be expending only one cheaper Class B operation storage.*.get(?), as long as we are Now() < Object.CustomTime + ExpiryDelta - ThresholdDelta. Same behaviour as if this entire feature never existed and CustomTime was never used. In case Now() ≥ …, i.e., we are reasonably close to having the object expire and suffering a valley in cache readiness, we fire the more expensive PATCH operation.

This way, on the individual object's level, the cloud costs would be the same as if we simply had let it expire (based simply on the Age calculated from the Creation date) and then a subsequent compilation would run storage.object.insert (an expensive Class A operation) with the full data. In fact, the cloud costs would still be marginally (but I don't have the exact figures here as they are very much dependent on object size) smaller still, because the PATCH operation only results in an infinitesimal network communication, whereas the POST (insert) operation would have to ingest the full size of the object. (And for the cache hit itself, the data transmitted by the GET operation is still the same.)
However, this would still allow for the cost and importantly time savings on the compute side of things. The real problem is when a death wave happens in the middle of the business day and suddenly something that might be ran on-demand because someone needs the result of the compilation or test is set back by a wide enough margin (1.5 hours is almost 1/5th of a standard workday) to cause loss of flow and attention and waste of useful developer work time.

How easy would be to have OpenDAL return the metadata together with the cache object over its storage interface? Although it seems that the raw JSON API requires two requests (one with ?alt=media at the end) to grab the data:

╰─ curl -X GET -H "Authorization: Bearer $(gcloud auth print-access-token)" https://storage.googleapis.com/storage/v1/b/***/o/f%2fa%2f3%2ffa318a60af06166776aa4687089fa8cdf4cc22c0b0f818e33a5891963ff25b8e
{
  "kind": "storage#object",
  "id": …,
  "selfLink": …,
  "mediaLink": …,
  "name": "f/a/3/fa318a60af06166776aa4687089fa8cdf4cc22c0b0f818e33a5891963ff25b8e",
  "bucket": "***",
  "generation": …,
  "metageneration": …,
  "storageClass": "STANDARD",
  "size": "334",
  "md5Hash": …,
  "crc32c": …,
  "etag": …,
  "timeCreated": "2025-02-10T13:38:52.925Z",
  "updated": "2025-02-19T17:48:26.451Z",
  "timeStorageClassUpdated": "2025-02-10T13:38:52.925Z",
  "customTime": "2025-02-19T17:48:26.210529Z",
  "timeFinalized": "2025-02-10T13:38:52.925Z"
}

╰─ curl -X GET -H "Authorization: Bearer $(gcloud auth print-access-token)" https://storage.googleapis.com/storage/v1/b/***/o/f%2fa%2f3%2ffa318a60af06166776aa4687089fa8cdf4cc22c0b0f818e33a5891963ff25b8e\?alt\=media -s -o - | file -s -
/dev/stdin: Zip archive data, at least v2.0 to extract, compression method=store

╰─ curl -X GET -H "Authorization: Bearer $(gcloud auth print-access-token)"	https://storage.googleapis.com/storage/v1/b/***/o/f%2fa%2f3%2ffa318a60af06166776aa4687089fa8cdf4cc22c0b0f818e33a5891963ff25b8e\?alt\=media -s -o - | bsdtar -xOf - | file -s -
/dev/stdin: Zstandard compressed data (v0.8+), Dictionary ID: None

╰─ curl -X GET -H "Authorization: Bearer $(gcloud auth print-access-token)" https://storage.googleapis.com/storage/v1/b/***/o/f%2fa%2f3%2ffa318a60af06166776aa4687089fa8cdf4cc22c0b0f818e33a5891963ff25b8e\?alt\=media -s -o - | bsdtar -xOf - | unzstd | file -s -
/dev/stdin: Mach-O 64-bit object arm64

This means that we would essentially double the number of Class B operations as we can not avoid doing another request for the metadata which we must still check at every cache hit. So we will have, at N translation units, 2N requests per build. Per 1000 cache hits, we save $\$0.0050 - \$0.0004$ (no PATCH for each file, but another GET for each file) $= \$0.0046$. With the previous calculation specific to our way of using the LLVM project, that is 6000 files per CI run, $6000 / 1000 * \$0.0046 = \$0.0276$ saved compared to the previous solution, but $6000 / 1000 * \$0.0004 = \$0.0024$ spent in excess on the """HEAD""" (metadata GET) compared to the current behaviour on the master branch. At 1M requests, this is $0.4 extra cost. But my intuition is that if we can tune the parameters correctly, we can still achieve $8.5 - $0.0276 - (the cost of the machine uptime overhead it takes to actually fire the PATCH requests) = <$8.4724 cost savings on the cache "reheat".

Presently, what I observe — although formulating these accurately is beyond my mathematical prowess… — is that out of the ~6k TUs an individual patch touches at most 10 files (except for large outliers that break everything due to generated code), which means that every intermediate change to these files during the patch review are dead store to the cache, and it is these files where post-merge the base state will go forever stale as well. Luckily(?), the final version of patch P is a hot store because the post-merge contents of the files hit by a subsequent (in time) CI job of another patch Q are the same as what was cached by the last CI build of patch P. (pull_request CI jobs run on a hypothetical merged/applied version of the PR/patch at the time of triggering, including changes to master that were not present in the history of the PR branch.) In addition, although this is a wide estimate, but because the work on the project is definitely not uniformly distributed, we will likely have at least 80% — but maybe even closer to 90% — of the files staying the same essentially forever, excluding uplifts to newer upstream versions where it is likely that cache misses will significantly dominate.

Do you think my analysis is in the right direction and this approach with greedier comparison of expiry but lazier updates to the CustomTime would be worth pursuing?

@Xuanwo
Copy link
Collaborator

Xuanwo commented Feb 21, 2025

Hi, @whisperity, thank you for the analysis. I believe we are on the right track.

How easy would be to have OpenDAL return the metadata together with the cache object over its storage interface?

It's possible for opendal to return the metadata along with reader, we are working on this.

Do you think my analysis is in the right direction and this approach with greedier comparison of expiry but lazier updates to the CustomTime would be worth pursuing?

This great discussion inspired me to think of another approach: Instead of using PATCH on CustomTime (which is only supported by gcs and cannot be extended to other storage services), how about simply re-uploading it when needed?

For example, if we download a cache and find that it was last modified a month ago, we re-upload it to refresh it. It works almost the same way as PATCH CustomTime, but it is compatible with all existing storage services and requires very few changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants