Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent HTTP cache GET/PUT silent failures. #487

Closed
adamdottv opened this issue Jan 3, 2022 · 19 comments
Closed

Intermittent HTTP cache GET/PUT silent failures. #487

adamdottv opened this issue Jan 3, 2022 · 19 comments
Labels
kind: bug Something isn't working needs: investigation The bug / idea needs more investigation

Comments

@adamdottv
Copy link
Contributor

What version of Turborepo are you using?

1.0.24

What package manager are you using / does the bug impact?

Yarn v1

What operating system are you using?

Linux

Describe the Bug

I'm seeing intermittent cache misses that I believe to be caused by a failure to PUT/GET a given cache key to/from Vercel. In most cases I'm seeing no log output to indicate that the HTTP request failed—this is why I'm unsure whether the failure occurs on PUT or GET—but, on one occasion I saw some log output in Vercel that may be helpful. See below.

First run, this behavior is expected; this cache entry should not exist, yet, so the build is executed:

2022-01-03T11:42:43.854Z [DEBUG] run.api:build: start
2022-01-03T11:42:43.854Z [DEBUG] run.api:build: task output globs: outputs=[".turbo/turbo-build.log", "cdk.out/**"]
2022-01-03T11:42:43.854Z [DEBUG] run.api:build: task hash: value=3dd8da4e3f3dd6b0
2022-01-03T11:42:43.854Z [DEBUG] run.api:build: log file: path=apps/api/.turbo/turbo-build.log
2022-01-03T11:42:43.855Z [DEBUG] run: performing request: method=GET url=https://api.vercel.com/v8/artifacts/3dd8da4e3f3dd6b0?slug=***

...

api:build: cache miss, executing 3dd8da4e3f3dd6b0

...

2022-01-03T11:43:10.928Z [DEBUG] run.api:build: caching output: outputs=[".turbo/turbo-build.log", "cdk.out/**"]
2022-01-03T11:43:10.945Z [DEBUG] run.api:build: done: status=complete duration=27.091337379s
2022-01-03T11:43:11.355Z [DEBUG] run: performing request: method=PUT url=https://api.vercel.com/v8/artifacts/3dd8da4e3f3dd6b0?slug=***

I then re-run the job and see an unexpected cache miss:

2022-01-03T12:06:04.257Z [DEBUG] run.api:build: start
2022-01-03T12:06:04.258Z [DEBUG] run.api:build: task output globs: outputs=[".turbo/turbo-build.log", "cdk.out/**"]
2022-01-03T12:06:04.258Z [DEBUG] run.api:build: task hash: value=3dd8da4e3f3dd6b0
2022-01-03T12:06:04.259Z [DEBUG] run.api:build: log file: path=apps/api/.turbo/turbo-build.log
2022-01-03T12:06:04.259Z [DEBUG] run: performing request: method=GET url=https://api.vercel.com/v8/artifacts/3dd8da4e3f3dd6b0?slug=***

...

api:build: cache miss, executing 3dd8da4e3f3dd6b0

...

2022-01-03T12:06:31.097Z [DEBUG] run.api:build: caching output: outputs=[".turbo/turbo-build.log", "cdk.out/**"]
2022-01-03T12:06:31.118Z [DEBUG] run.api:build: done: status=complete duration=26.860623477s
2022-01-03T12:06:31.644Z [DEBUG] run: performing request: method=PUT url=https://api.vercel.com/v8/artifacts/3dd8da4e3f3dd6b0?slug=***

The above output is from GitHub Actions. In one instance (not the above case), I did see the following log output in Vercel that may be a hint:

2022-01-02T12:17:21.421Z [DEBUG] run.web:build: done: status=complete duration=25.849629175s
--
06:17:21.423 | [ERROR] Error uploading artifacts to HTTP cache: archive/tar: write too long

The cache artifacts for the api:build task in the GitHub Actions case are 2.2 MB (zipped). The artifacts for the web:build task in Vercel (that had the visible write error) are 19.7 MB (zipped).

Expected Behavior

I expect that a given cache key will be successfully PUT to the Vercel remote cache and then retrieved on subsequent runs when a matching hash is calculated.

To Reproduce

As mentioned, that cache artifacts where I've seen failures are 2.2 MB (zipped) and 19.7 MB (zipped). I wouldn't expect that the actual contents of the cache are pertinent, but I can provide them if needed.

In terms of repro steps, I'm running turbo in GitHub Actions to deploy a CDK app to AWS, and in Vercel to deploy a Next.js app. I've seen these intermittent failures on both platforms.

@jaredpalmer
Copy link
Contributor

Thanks for the great write up. Will investigate

@ThibautMarechal
Copy link
Contributor

It seems to be fixed with this #895
I cannot reproduced it since I update to the version 1.1.7-canary.5

@ThibautMarechal
Copy link
Contributor

nevermind, it happen to me again multiple times.
But i'm on windows, using yarn V1 & turborepo 1.1.9

@mcrowder65
Copy link

I also run into this intermittently. Using mac, yarn v1 & turborepo 1.2.5-canary.1

Also see it on jenkins running linux.

@maschwenk
Copy link
Contributor

It'd also be good to be able to log the name of the package that has the issue, because to be honest I'd like to know which particular package is the problem but it's not clear to me from there.

kodiakhq bot pushed a commit that referenced this issue May 9, 2022
As mentioned here #487, in some situations I'd prefer to know which artifact is so large so I can track it down (most of our artifacts should not be so large).

Also if there's a place up the stacktrace I can insert the package name, that'd be really ideal, but that looks like a bigger lift so want to make sure we want to do that.
@sppatel
Copy link
Contributor

sppatel commented May 12, 2022

Seeing this as well locally publishing to a custom remote cache.

@maschwenk
Copy link
Contributor

maschwenk commented May 13, 2022

@sppatel we did some investigation and the custom implementation were using with Express has a 5mb filesize limit (and it had nothing to do with our S3 backend). I'm hesitant to bump it because I want to know which package is beyond that 5mb limit before I do. That's what #1201 was meant to address.

@sppatel
Copy link
Contributor

sppatel commented May 17, 2022

Hmmm. I'm not sure this this is a file limit issue. The latest release of turbo logs the file which this is happening and in my case when it occurs its always the turbo log file - which for my operations should be essentially empty. (e.g eslint, prettier, etc). These log files which error out are in bytes, not megabytes.

@maschwenk
Copy link
Contributor

@sppatel I can confirm I had the same issue, and the file in question (now that it's logging 🎉 ) was also a very short logfile

@maschwenk
Copy link
Contributor

maschwenk commented May 24, 2022

https://pkg.go.dev/archive/tar@go1.18.2#Writer.Write @jaredpalmer / @sppatel interestingly here it says that write too long error is returned when

Write writes to the current file in the tar archive. Write returns the error ErrWriteTooLong if more than Header.Size bytes are written after WriteHeader.

I'm not sure I fully understand that error. It seems to suggest that between the time where we declare the header size and the time we actually write to it, the file sizes are larger? Or something?


Reasonable workaround for me would be to just not upload logs to the cache, which I'd be fine with, but I'm not sure if there's an option for that.

@nathanhammond
Copy link
Contributor

nathanhammond commented May 25, 2022

A quick serialization of things we're talking about:


Encountering this issue likely means that the cache would have been bad and should not be reused. That it fails in this scenario is likely the better of two bad outcomes.


The place where we generate the header:

hdr, err := tar.FileInfoHeader(info, filepath.ToSlash(target))

Is distant from the place where we write the file:

_, err = io.Copy(tw, f)

We do nothing to prevent reading or writing to the outputs in that gap. Addressing this, however, will likely only change the symptom of the bug—hopefully pushing it back into the stack which is causing the problem, but also possibly just ending up serializing the tar process to occur after the other writer has completed its operation.


We know of at least one source of guaranteed issues with this where multiple commands write their outputs to the same path. We need strict ordering of output directories considered in our DAG to address that. It also sounds like there may be an issue with our logging and piping content to a file. We intend to address all of the places in turbo that can trigger this as high-priority correctness bugs but at some point we reach "you're running a dev server in a separate terminal session" and the solution is "don't do that" or lots of (presumably) expensive file locking/unlocking or FUSE-mounts at the output directories.

@maschwenk
Copy link
Contributor

but at some point we reach "you're running a dev server in a separate terminal session" and the solution is "don't do that"

@nathanhammond My first instinct has been to blame concurrent dev server compiles for these types of intermittency...until I saw this happening in CI on an isolated Docker container. Which makes me think the underlying issue could be in the fact we're using pnpm? pnpm heavily hard links which I saw mentioned as a potential issue for tar-ing? At the same time, I'm unsure why would that impact building package sources, as that shouldn't contain any of those node_module hard links?


Just to make sure I understand the timeline:

  1. Turbo sets up the tar header, setting the header size to be the size of the contents at time T
  2. (T+1) Turbo starts putting files into the tarball
  3. (T+2) Another thread or process (from Turbo or otherwise) writes into that folder, changing the size of the contents
  4. (T+3) Turbo puts another file into the tarball, Go throws an exception because now the size of the tarball exceeds what we had set in the header

I agree the best course of action from Turbo's perspective is to try to make these rare, but it's impossible to prevent completely. Certainly, uploading an inconsistent cache is more harmful than just not failing. However, we're trying to use turbo in CI, and having it intermittently fail makes it very hard for me to feel confident in moving forward. I could easily just tell Turbo to always try to run twice, but it'd be ideal of Turbo allowed an option (or had a default) of rerunning the tar process in these situations, or just bailing on uploading but not failing the compilation. Appreciate the detailed response and love the tool! Happy to contribute to source as well, just curious to know what you recommend 😄

@jaredpalmer
Copy link
Contributor

jaredpalmer commented May 25, 2022

This should be fixed in #1293 and turbo@1.2.14

@ThibautMarechal
Copy link
Contributor

I can confirm that the problem didn't happen since the update 💯

@maschwenk
Copy link
Contributor

maschwenk commented May 30, 2022

@ThibautMarechal same, though it was pretty intermittent in the first place 🎉

thanks @jaredpalmer / @nathanhammond -- can probably close the issue

@gsoltis
Copy link
Contributor

gsoltis commented Jun 3, 2022

Closing, happy to reopen if necessary.

@gsoltis gsoltis closed this as completed Jun 3, 2022
@mkotsollaris
Copy link

I am getting cache miss all the time, even though I am pretty sure I have configured CircleCI secrets accordingly. Here's my repo + CircleCi logs:

Am I missing something?

@gsoltis
Copy link
Contributor

gsoltis commented Jun 6, 2022

@mkotsollaris Can you file a separate issue for us to track? It looks like a separate issue from what is described here (individual http requests failing).

@mkotsollaris
Copy link

Thanks #1343

sokra added a commit that referenced this issue Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: bug Something isn't working needs: investigation The bug / idea needs more investigation
Projects
None yet
Development

No branches or pull requests

9 participants