-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCS] Feature req: Use CustomTime
attribute to record when cached entry was last hit (for eviction/cleanup rules)
#2318
Comments
Hi, @whisperity, thanks for suggesting this. I understand your use cases, but I feel that using Let's take gcs us-east1 as an example, so we have the following things:
Now we have two ways:
The way A will adds more storage cost and the way B will add more Class A requests. Let's perform the estimation in the simplest way: we currently have N GB of data, and it will continue to grow each month. We will request that cache X times per month. After a full year, the cost of our cache storage will stabilize, with a maximum of 12N GB of data. The cost increase will be (removing the common cost):
We will have the following ratio: Each month, adding more data becomes cost-efficient when it lower than Hoping this can contribute to our discussion. |
Hey @Xuanwo! Thank you for the quick reply and the detailed analysis! I must admit this is information that I did not originally consider, and, in a hindsight, it must have been stupid of me not to think that the cloud provider would nickel & dime us for something as simple as a timestamp… D'oh! I will come back with some more concrete pricing information tomorrow, because all the actual testing of this feature (albeit with a much smaller project than what we use the CI for!) was done yesterday and it takes GCP about 1.5 days to fully stabilise the billing breakdown from the point of the actual action that resulted in the costs. There is a bit of a nuance here that suffering a death wave and a cold cache, because a cold cache means that the CI machine has to operate for a longer period in time. For the sake of a complete discussion I will extend the calculation here, but I believe this extra cost quickly trivialises. We have two primary CI loops, one for testing, and one for packaging a snapshot of the product. They run at different optimisation levels (can't directly share the cache between them) and target a different set of platforms (absolutely can't share the cache between them). The current rounded estimate (for the sake of easier calculation as there is some fluctuation not related to the presence of the cache) is
So even with a death wave frequence of "every month" it's only $8.5 to reheat the cache, which, fiscally, isn't terribly much if it happens rarely enough.
For the actual project, let's say we have, on average, 6000 cacheable objects (TUs) at hand. (It's slightly different between the "distribution" CI and the "testing" CI which run on different schedules and triggers, but 6000 is a good average.) So that means Right now, with having the "death waves" disabled (it only ever happened once and resulted in the sudden spike of CI times, which prompted me to open this ticket), we now have 904 MiB of data (according to both GCP Console in the browser So certainly, as you analysed, running a However…
Could you help me understand the feasibility of the following third option: update the My thoughts right now are to implement it in a way that the sccache administrator could provide as an argument two values:
With these in hand, we can run the If this was possible, that means that for each cache hit, we would still be expending only one cheaper Class B operation This way, on the individual object's level, the cloud costs would be the same as if we simply had let it expire (based simply on the Age calculated from the Creation date) and then a subsequent compilation would run How easy would be to have OpenDAL return the metadata together with the cache object over its storage interface? Although it seems that the raw JSON API requires two requests (one with
This means that we would essentially double the number of Class B operations as we can not avoid doing another request for the metadata which we must still check at every cache hit. So we will have, at N translation units, 2N requests per build. Per 1000 cache hits, we save Presently, what I observe — although formulating these accurately is beyond my mathematical prowess… — is that out of the ~6k TUs an individual patch touches at most 10 files (except for large outliers that break everything due to generated code), which means that every intermediate change to these files during the patch review are dead store to the cache, and it is these files where post-merge the base state will go forever stale as well. Luckily(?), the final version of patch P is a hot store because the post-merge contents of the files hit by a subsequent (in time) CI job of another patch Q are the same as what was cached by the last CI build of patch P. ( Do you think my analysis is in the right direction and this approach with greedier comparison of expiry but lazier updates to the |
Hi, @whisperity, thank you for the analysis. I believe we are on the right track.
It's possible for opendal to return the metadata along with reader, we are working on this.
This great discussion inspired me to think of another approach: Instead of using PATCH on For example, if we download a cache and find that it was last modified a month ago, we re-upload it to refresh it. It works almost the same way as |
Hey!
We are using a Google Cloud Storage Bucket through
sccache
to cache across CI compilations of LLVM. Unfortunately, it looks like none of the cloud provider implementations allow for keeping the size of the cache in check — it is only possible for local-disk caching (which we can not do as the CI machines are ephemeral VMs).In order to keep the costs of this shared cache reasonable, we implemented an Age lifecycle rule in GCS: if a cached entry is too old, GCS will deal with deleting it.
According to GCS, an Age rule is matched as:
This presents a problem with "death-waves": Every$n$ days after an initial CI run had populated the cache with about $6,000$ translation units, they all get deleted, and the subsequent CI run will essentially have to do an almost full rebuild. The only cache hits at $\ge n$ days will be the more recently modified content — which is often the smaller part of the files in the project.
It would be great if
sccache
with a cloud bucket could behave in the same LRU way asccache
locally does. Doing a full LRU is problematic without handling size limits, but it seems there is another lifecycle property that can be used, at least in the case of GCS: CustomTime.LRU-like behaviour could be simulated by setting the Days since custom time lifecycle rule (instead of Age), making sure only files that had not been cache hit for$n$ days are evicted from the bucket.
However,
sccache
does not populate this field at all:It seems from the documentation that this is a simple timestamp field that could be populated through the API, so all we need is a simple additional HTTP request that tells Google to update this field to "current time" once the caching logic hits a file successfully.
It seems like similar metadata options are available for other cloud providers, such as S3 or Azure but I have no idea or experience whether these could be used for controlling the cache's lifecycle as well as it can be for GCS.
The text was updated successfully, but these errors were encountered: