-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent HTTP cache GET/PUT silent failures. #487
Comments
Thanks for the great write up. Will investigate |
It seems to be fixed with this #895 |
nevermind, it happen to me again multiple times. |
I also run into this intermittently. Using mac, yarn v1 & turborepo 1.2.5-canary.1 Also see it on jenkins running linux. |
It'd also be good to be able to log the name of the package that has the issue, because to be honest I'd like to know which particular package is the problem but it's not clear to me from there. |
As mentioned here #487, in some situations I'd prefer to know which artifact is so large so I can track it down (most of our artifacts should not be so large). Also if there's a place up the stacktrace I can insert the package name, that'd be really ideal, but that looks like a bigger lift so want to make sure we want to do that.
Seeing this as well locally publishing to a custom remote cache. |
@sppatel we did some investigation and the custom implementation were using with Express has a 5mb filesize limit (and it had nothing to do with our S3 backend). I'm hesitant to bump it because I want to know which package is beyond that 5mb limit before I do. That's what #1201 was meant to address. |
Hmmm. I'm not sure this this is a file limit issue. The latest release of turbo logs the file which this is happening and in my case when it occurs its always the turbo log file - which for my operations should be essentially empty. (e.g eslint, prettier, etc). These log files which error out are in bytes, not megabytes. |
@sppatel I can confirm I had the same issue, and the file in question (now that it's logging 🎉 ) was also a very short logfile |
https://pkg.go.dev/archive/tar@go1.18.2#Writer.Write @jaredpalmer / @sppatel interestingly here it says that
I'm not sure I fully understand that error. It seems to suggest that between the time where we declare the header size and the time we actually write to it, the file sizes are larger? Or something? Reasonable workaround for me would be to just not upload logs to the cache, which I'd be fine with, but I'm not sure if there's an option for that. |
A quick serialization of things we're talking about: Encountering this issue likely means that the cache would have been bad and should not be reused. That it fails in this scenario is likely the better of two bad outcomes. The place where we generate the header: turborepo/cli/internal/cache/cache_http.go Line 109 in 63a07b9
Is distant from the place where we write the file: turborepo/cli/internal/cache/cache_http.go Line 134 in 63a07b9
We do nothing to prevent reading or writing to the outputs in that gap. Addressing this, however, will likely only change the symptom of the bug—hopefully pushing it back into the stack which is causing the problem, but also possibly just ending up serializing the We know of at least one source of guaranteed issues with this where multiple commands write their outputs to the same path. We need strict ordering of output directories considered in our DAG to address that. It also sounds like there may be an issue with our logging and piping content to a file. We intend to address all of the places in |
@nathanhammond My first instinct has been to blame concurrent dev server compiles for these types of intermittency...until I saw this happening in CI on an isolated Docker container. Which makes me think the underlying issue could be in the fact we're using Just to make sure I understand the timeline:
I agree the best course of action from Turbo's perspective is to try to make these rare, but it's impossible to prevent completely. Certainly, uploading an inconsistent cache is more harmful than just not failing. However, we're trying to use |
This should be fixed in #1293 and |
I can confirm that the problem didn't happen since the update 💯 |
@ThibautMarechal same, though it was pretty intermittent in the first place 🎉 thanks @jaredpalmer / @nathanhammond -- can probably close the issue |
Closing, happy to reopen if necessary. |
I am getting Am I missing something? |
@mkotsollaris Can you file a separate issue for us to track? It looks like a separate issue from what is described here (individual http requests failing). |
Thanks #1343 |
What version of Turborepo are you using?
1.0.24
What package manager are you using / does the bug impact?
Yarn v1
What operating system are you using?
Linux
Describe the Bug
I'm seeing intermittent cache misses that I believe to be caused by a failure to PUT/GET a given cache key to/from Vercel. In most cases I'm seeing no log output to indicate that the HTTP request failed—this is why I'm unsure whether the failure occurs on PUT or GET—but, on one occasion I saw some log output in Vercel that may be helpful. See below.
First run, this behavior is expected; this cache entry should not exist, yet, so the build is executed:
I then re-run the job and see an unexpected cache miss:
The above output is from GitHub Actions. In one instance (not the above case), I did see the following log output in Vercel that may be a hint:
The cache artifacts for the
api:build
task in the GitHub Actions case are 2.2 MB (zipped). The artifacts for theweb:build
task in Vercel (that had the visible write error) are 19.7 MB (zipped).Expected Behavior
I expect that a given cache key will be successfully PUT to the Vercel remote cache and then retrieved on subsequent runs when a matching hash is calculated.
To Reproduce
As mentioned, that cache artifacts where I've seen failures are 2.2 MB (zipped) and 19.7 MB (zipped). I wouldn't expect that the actual contents of the cache are pertinent, but I can provide them if needed.
In terms of repro steps, I'm running turbo in GitHub Actions to deploy a CDK app to AWS, and in Vercel to deploy a Next.js app. I've seen these intermittent failures on both platforms.
The text was updated successfully, but these errors were encountered: