Intermittent HTTP cache GET/PUT silent failures. #487

adamdottv · 2022-01-03T13:17:29Z

What version of Turborepo are you using?

1.0.24

What package manager are you using / does the bug impact?

Yarn v1

What operating system are you using?

Linux

Describe the Bug

I'm seeing intermittent cache misses that I believe to be caused by a failure to PUT/GET a given cache key to/from Vercel. In most cases I'm seeing no log output to indicate that the HTTP request failed—this is why I'm unsure whether the failure occurs on PUT or GET—but, on one occasion I saw some log output in Vercel that may be helpful. See below.

First run, this behavior is expected; this cache entry should not exist, yet, so the build is executed:

2022-01-03T11:42:43.854Z [DEBUG] run.api:build: start
2022-01-03T11:42:43.854Z [DEBUG] run.api:build: task output globs: outputs=[".turbo/turbo-build.log", "cdk.out/**"]
2022-01-03T11:42:43.854Z [DEBUG] run.api:build: task hash: value=3dd8da4e3f3dd6b0
2022-01-03T11:42:43.854Z [DEBUG] run.api:build: log file: path=apps/api/.turbo/turbo-build.log
2022-01-03T11:42:43.855Z [DEBUG] run: performing request: method=GET url=https://api.vercel.com/v8/artifacts/3dd8da4e3f3dd6b0?slug=***

...

api:build: cache miss, executing 3dd8da4e3f3dd6b0

...

2022-01-03T11:43:10.928Z [DEBUG] run.api:build: caching output: outputs=[".turbo/turbo-build.log", "cdk.out/**"]
2022-01-03T11:43:10.945Z [DEBUG] run.api:build: done: status=complete duration=27.091337379s
2022-01-03T11:43:11.355Z [DEBUG] run: performing request: method=PUT url=https://api.vercel.com/v8/artifacts/3dd8da4e3f3dd6b0?slug=***

I then re-run the job and see an unexpected cache miss:

2022-01-03T12:06:04.257Z [DEBUG] run.api:build: start
2022-01-03T12:06:04.258Z [DEBUG] run.api:build: task output globs: outputs=[".turbo/turbo-build.log", "cdk.out/**"]
2022-01-03T12:06:04.258Z [DEBUG] run.api:build: task hash: value=3dd8da4e3f3dd6b0
2022-01-03T12:06:04.259Z [DEBUG] run.api:build: log file: path=apps/api/.turbo/turbo-build.log
2022-01-03T12:06:04.259Z [DEBUG] run: performing request: method=GET url=https://api.vercel.com/v8/artifacts/3dd8da4e3f3dd6b0?slug=***

...

api:build: cache miss, executing 3dd8da4e3f3dd6b0

...

2022-01-03T12:06:31.097Z [DEBUG] run.api:build: caching output: outputs=[".turbo/turbo-build.log", "cdk.out/**"]
2022-01-03T12:06:31.118Z [DEBUG] run.api:build: done: status=complete duration=26.860623477s
2022-01-03T12:06:31.644Z [DEBUG] run: performing request: method=PUT url=https://api.vercel.com/v8/artifacts/3dd8da4e3f3dd6b0?slug=***

The above output is from GitHub Actions. In one instance (not the above case), I did see the following log output in Vercel that may be a hint:

2022-01-02T12:17:21.421Z [DEBUG] run.web:build: done: status=complete duration=25.849629175s
--
06:17:21.423 | [ERROR] Error uploading artifacts to HTTP cache: archive/tar: write too long

The cache artifacts for the api:build task in the GitHub Actions case are 2.2 MB (zipped). The artifacts for the web:build task in Vercel (that had the visible write error) are 19.7 MB (zipped).

Expected Behavior

I expect that a given cache key will be successfully PUT to the Vercel remote cache and then retrieved on subsequent runs when a matching hash is calculated.

To Reproduce

As mentioned, that cache artifacts where I've seen failures are 2.2 MB (zipped) and 19.7 MB (zipped). I wouldn't expect that the actual contents of the cache are pertinent, but I can provide them if needed.

In terms of repro steps, I'm running turbo in GitHub Actions to deploy a CDK app to AWS, and in Vercel to deploy a Next.js app. I've seen these intermittent failures on both platforms.

The text was updated successfully, but these errors were encountered:

jaredpalmer · 2022-01-03T13:53:44Z

Thanks for the great write up. Will investigate

ThibautMarechal · 2022-03-20T13:33:10Z

It seems to be fixed with this #895
I cannot reproduced it since I update to the version 1.1.7-canary.5

ThibautMarechal · 2022-03-26T16:34:39Z

nevermind, it happen to me again multiple times.
But i'm on windows, using yarn V1 & turborepo 1.1.9

mcrowder65 · 2022-04-20T21:22:14Z

I also run into this intermittently. Using mac, yarn v1 & turborepo 1.2.5-canary.1

Also see it on jenkins running linux.

maschwenk · 2022-05-06T16:24:15Z

It'd also be good to be able to log the name of the package that has the issue, because to be honest I'd like to know which particular package is the problem but it's not clear to me from there.

As mentioned here #487, in some situations I'd prefer to know which artifact is so large so I can track it down (most of our artifacts should not be so large). Also if there's a place up the stacktrace I can insert the package name, that'd be really ideal, but that looks like a bigger lift so want to make sure we want to do that.

sppatel · 2022-05-12T21:20:10Z

Seeing this as well locally publishing to a custom remote cache.

maschwenk · 2022-05-13T16:20:56Z

@sppatel we did some investigation and the custom implementation were using with Express has a 5mb filesize limit (and it had nothing to do with our S3 backend). I'm hesitant to bump it because I want to know which package is beyond that 5mb limit before I do. That's what #1201 was meant to address.

sppatel · 2022-05-17T18:55:35Z

Hmmm. I'm not sure this this is a file limit issue. The latest release of turbo logs the file which this is happening and in my case when it occurs its always the turbo log file - which for my operations should be essentially empty. (e.g eslint, prettier, etc). These log files which error out are in bytes, not megabytes.

maschwenk · 2022-05-24T19:28:36Z

@sppatel I can confirm I had the same issue, and the file in question (now that it's logging 🎉 ) was also a very short logfile

maschwenk · 2022-05-24T19:44:07Z

https://pkg.go.dev/archive/tar@go1.18.2#Writer.Write @jaredpalmer / @sppatel interestingly here it says that write too long error is returned when

Write writes to the current file in the tar archive. Write returns the error ErrWriteTooLong if more than Header.Size bytes are written after WriteHeader.

I'm not sure I fully understand that error. It seems to suggest that between the time where we declare the header size and the time we actually write to it, the file sizes are larger? Or something?

Reasonable workaround for me would be to just not upload logs to the cache, which I'd be fine with, but I'm not sure if there's an option for that.

nathanhammond · 2022-05-25T04:38:45Z

A quick serialization of things we're talking about:

Encountering this issue likely means that the cache would have been bad and should not be reused. That it fails in this scenario is likely the better of two bad outcomes.

The place where we generate the header:

turborepo/cli/internal/cache/cache_http.go

Line 109 in 63a07b9

hdr, err := tar.FileInfoHeader(info, filepath.ToSlash(target))

Is distant from the place where we write the file:

turborepo/cli/internal/cache/cache_http.go

Line 134 in 63a07b9

_, err = io.Copy(tw, f)

We do nothing to prevent reading or writing to the outputs in that gap. Addressing this, however, will likely only change the symptom of the bug—hopefully pushing it back into the stack which is causing the problem, but also possibly just ending up serializing the tar process to occur after the other writer has completed its operation.

We know of at least one source of guaranteed issues with this where multiple commands write their outputs to the same path. We need strict ordering of output directories considered in our DAG to address that. It also sounds like there may be an issue with our logging and piping content to a file. We intend to address all of the places in turbo that can trigger this as high-priority correctness bugs but at some point we reach "you're running a dev server in a separate terminal session" and the solution is "don't do that" or lots of (presumably) expensive file locking/unlocking or FUSE-mounts at the output directories.

maschwenk · 2022-05-25T15:12:36Z

but at some point we reach "you're running a dev server in a separate terminal session" and the solution is "don't do that"

@nathanhammond My first instinct has been to blame concurrent dev server compiles for these types of intermittency...until I saw this happening in CI on an isolated Docker container. Which makes me think the underlying issue could be in the fact we're using pnpm? pnpm heavily hard links which I saw mentioned as a potential issue for tar-ing? At the same time, I'm unsure why would that impact building package sources, as that shouldn't contain any of those node_module hard links?

Just to make sure I understand the timeline:

Turbo sets up the tar header, setting the header size to be the size of the contents at time T
(T+1) Turbo starts putting files into the tarball
(T+2) Another thread or process (from Turbo or otherwise) writes into that folder, changing the size of the contents
(T+3) Turbo puts another file into the tarball, Go throws an exception because now the size of the tarball exceeds what we had set in the header

I agree the best course of action from Turbo's perspective is to try to make these rare, but it's impossible to prevent completely. Certainly, uploading an inconsistent cache is more harmful than just not failing. However, we're trying to use turbo in CI, and having it intermittently fail makes it very hard for me to feel confident in moving forward. I could easily just tell Turbo to always try to run twice, but it'd be ideal of Turbo allowed an option (or had a default) of rerunning the tar process in these situations, or just bailing on uploading but not failing the compilation. Appreciate the detailed response and love the tool! Happy to contribute to source as well, just curious to know what you recommend 😄

jaredpalmer · 2022-05-25T23:36:18Z

This should be fixed in #1293 and turbo@1.2.14

ThibautMarechal · 2022-05-30T16:09:30Z

I can confirm that the problem didn't happen since the update 💯

maschwenk · 2022-05-30T16:17:19Z

@ThibautMarechal same, though it was pretty intermittent in the first place 🎉

thanks @jaredpalmer / @nathanhammond -- can probably close the issue

gsoltis · 2022-06-03T03:26:32Z

Closing, happy to reopen if necessary.

mkotsollaris · 2022-06-06T23:12:16Z

I am getting cache miss all the time, even though I am pretty sure I have configured CircleCI secrets accordingly. Here's my repo + CircleCi logs:

Am I missing something?

gsoltis · 2022-06-06T23:29:50Z

@mkotsollaris Can you file a separate issue for us to track? It looks like a separate issue from what is described here (individual http requests failing).

mkotsollaris · 2022-06-06T23:45:43Z

Thanks #1343

jaredpalmer added kind: bug Something isn't working needs: investigation The bug / idea needs more investigation labels Jan 3, 2022

weyert mentioned this issue Feb 9, 2022

Running in docker gives EOF error on PUT request Tapico/tapico-turborepo-remote-cache#16

Open

maschwenk mentioned this issue May 6, 2022

Better logging for failing to upload an artifact #1201

Merged

gsoltis closed this as completed Jun 3, 2022

sokra added a commit that referenced this issue Oct 25, 2022

increase bench CI timeout (#487)

ee84f00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent HTTP cache GET/PUT silent failures. #487

Intermittent HTTP cache GET/PUT silent failures. #487

adamdottv commented Jan 3, 2022

jaredpalmer commented Jan 3, 2022

ThibautMarechal commented Mar 20, 2022

ThibautMarechal commented Mar 26, 2022

mcrowder65 commented Apr 20, 2022

maschwenk commented May 6, 2022

sppatel commented May 12, 2022

maschwenk commented May 13, 2022 •

edited

Loading

sppatel commented May 17, 2022 •

edited

Loading

maschwenk commented May 24, 2022

maschwenk commented May 24, 2022 •

edited

Loading

nathanhammond commented May 25, 2022 •

edited

Loading

maschwenk commented May 25, 2022

jaredpalmer commented May 25, 2022 •

edited

Loading

ThibautMarechal commented May 30, 2022

maschwenk commented May 30, 2022 •

edited

Loading

gsoltis commented Jun 3, 2022

mkotsollaris commented Jun 6, 2022

gsoltis commented Jun 6, 2022

mkotsollaris commented Jun 6, 2022

Intermittent HTTP cache GET/PUT silent failures. #487

Intermittent HTTP cache GET/PUT silent failures. #487

Comments

adamdottv commented Jan 3, 2022

What version of Turborepo are you using?

What package manager are you using / does the bug impact?

What operating system are you using?

Describe the Bug

Expected Behavior

To Reproduce

jaredpalmer commented Jan 3, 2022

ThibautMarechal commented Mar 20, 2022

ThibautMarechal commented Mar 26, 2022

mcrowder65 commented Apr 20, 2022

maschwenk commented May 6, 2022

sppatel commented May 12, 2022

maschwenk commented May 13, 2022 • edited Loading

sppatel commented May 17, 2022 • edited Loading

maschwenk commented May 24, 2022

maschwenk commented May 24, 2022 • edited Loading

nathanhammond commented May 25, 2022 • edited Loading

maschwenk commented May 25, 2022

jaredpalmer commented May 25, 2022 • edited Loading

ThibautMarechal commented May 30, 2022

maschwenk commented May 30, 2022 • edited Loading

gsoltis commented Jun 3, 2022

mkotsollaris commented Jun 6, 2022

gsoltis commented Jun 6, 2022

mkotsollaris commented Jun 6, 2022

maschwenk commented May 13, 2022 •

edited

Loading

sppatel commented May 17, 2022 •

edited

Loading

maschwenk commented May 24, 2022 •

edited

Loading

nathanhammond commented May 25, 2022 •

edited

Loading

jaredpalmer commented May 25, 2022 •

edited

Loading

maschwenk commented May 30, 2022 •

edited

Loading