cmd/compile: fixedbug/issue10958.go failures #18589

ALTree · 2017-01-10T11:09:48Z

The fixedbug/issue10958.go test, introduced in 7f1ff65 ( cmd/compile: insert scheduling checks on loop backedges), is flaky.

What version of Go are you using (`go version`)?

go version devel +7f1ff65 Mon Jan 9 21:01:29 2017 +0000 linux/amd64

What operating system and processor architecture are you using (`go env`)?

GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="XXX/gocode"
GORACE=""
GOROOT="XXX/go1.7/go"
GOTOOLDIR="XXX/go1.7/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build124360425=/tmp/go-build"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

Ran all.bash 3 times on a CentOS 6.5 server.

What did you see?

It failed reliably with:

##### ../test
# go run run.go -- fixedbugs/issue10958.go
signal: killed

It happened on the linux/mips64 builder too:

https://build.golang.org/log/3f4be6b0c792fd179683d0046a2e6f178f8928d2

The text was updated successfully, but these errors were encountered:

gopherbot · 2017-01-10T13:35:13Z

CL https://golang.org/cl/35051 mentions this issue.

dr2chase · 2017-01-10T13:36:00Z

Can you try upping the timeout in that test? See https://go-review.googlesource.com/35051 for an example:

--- a/test/fixedbugs/issue10958.go
+++ b/test/fixedbugs/issue10958.go
@@ -1,5 +1,5 @@
 // +build !nacl
-// buildrun -t 2  -gcflags=-d=ssa/insert_resched_checks/on,ssa/check/on
+// buildrun -t 10  -gcflags=-d=ssa/insert_resched_checks/on,ssa/check/on
 
 // Copyright 2016 The Go Authors. All rights reserved.
 // Use of this source code is governed by a BSD-style

I figure I get one try at not-flaky, if it fails, we yank the test.

ALTree · 2017-01-10T13:45:20Z

Do I have to launch the test via all.bash or running go run issue10958.go is enough?

Because if I go run it I hangs forever (there's a data race in the code, is that normal?).

dr2chase · 2017-01-10T14:07:09Z

In test directory go run run.go -- fixedbugs/issue10958.go.
Standalone, go run -gcflags -d=ssa/insert_resched_checks/on issue10958.go
I should add a test comment explaining that it is normally expected to hang.

The data race is more or less required -- the threads have to enter their actually-but-not-obviously infinite loops, and those loops cannot contain calls. The test is whether the compiler is properly inserting rescheduling checks into those loops, and if they have the intended effect.

ALTree · 2017-01-10T14:10:53Z

I patched the file bumping to 10s and ran all.bash and it still failed (same error)
$ go run run.go -- fixedbugs/issue10958.go in test gives:

# go run run.go -- fixedbugs/issue10958.go
exit status 2
# command-line-arguments
compile: Did not find a phase matching insert_resched_checks in -d=ssa/... debug option

go run -gcflags -d=ssa/insert_resched_checks/on issue10958.go in test/fixedbugs gives

# command-line-arguments
compile: Did not find a phase matching insert_resched_checks in -d=ssa/... debug option

dr2chase · 2017-01-10T14:13:38Z

That message is from an "old" compiler. There's a new phase in the one paired with that test, and the phase is present whether GOEXPERIMENT is set or not.

ALTree · 2017-01-10T14:20:30Z

Yeah sorry I set a bad GOROOT while trying.

go run run.go -- fixedbugs/issue10958.go never fails (not even with 2s).

Still seeing failures with all.bash and 10s.

dr2chase · 2017-01-10T14:22:52Z

Too much concurrency in all.bash, perhaps? Thanks for checking this.

ALTree · 2017-01-10T14:24:27Z

Ok. I'll let you decide what to do with this issue. all.bash has always been quite slow on that machine (not sure why) so maybe it's not worth yanking the test... or maybe we could wait to see how frequent the flake is.

dr2chase · 2017-01-10T14:33:14Z

It's worth disabling. It's only for an experiment, and flaky tests are vile.

aclements · 2017-01-10T16:31:44Z

A few thoughts:

You should be able to reproduce it with just go run run.go from the test directory, which runs everything in there with full concurrency, but doesn't require the full all.bash. If that succeeds, then something is weird about the all.bash environment.

If you run with GOTRACEBACK=crash and send it a SIGQUIT before it times out (maybe increase the timeout to something really high), the traceback may be useful.

ALTree · 2017-01-10T17:44:24Z

Thanks for the suggestion.

go run run.go fails (after a while), as expected:

$ gotip run run.go

# go run run.go -- fixedbugs/issue10958.go
signal: killed

FAIL	fixedbugs/issue10958.go	2.005s

In the meantime, another CL was merged, the builders tested it and two failed on this test:

freebsd/arm: https://build.golang.org/log/02cb9ac66d2d9a19feb6fb479917e709a50850d5
linux/mips: https://build.golang.org/log/9f3e8129e95710b42ac5528f523bfbcaf171a5f1

so yeah, it's really flaky.

ALTree · 2017-01-10T18:00:15Z

With a really high timeout (I tried 600) it does not fail, so I guess the machine is just choking a little on that test. It has been disabled now so we're fine.

rsc · 2017-01-10T18:24:36Z

@ALTree can you say more about the machine? Is it a VM? Hardware? Software? How many cores? What is GOMAXPROCS? Are there other things running on the machine?

It's possible the OS descheduled a thread for 10 seconds but that seems like a long time.

ALTree · 2017-01-10T20:49:54Z

Just to confirm that the timeout was correctly set to 10, I tried again (with all.bash):

##### ../test
# go run run.go -- fixedbugs/issue10958.go
signal: killed

FAIL	fixedbugs/issue10958.go	10.025s

(note 10.025s at the end).

No VM, it's real hardware. The machine is an old NeXtScale cluster (we'll dismiss it during 2017 I believe). OS is CentOS 6.5, uname -a says

Linux nodeXXX 2.6.32-431.el6.x86_64 #1 SMP x86_64 GNU/Linux

There are many nodes, and each node has 2 x Intel Xeon E5-2670v2 2.50 GHz and 128 GB ram, but the resources that I'm requesting to the jobs manager are select=1:ncpus=8:mem=32gb (8 cores, 32 gb ram, all on the same node).

Maybe what's happening is that all.bash sees the hardware of the whole node, and it spawns many threads (20? 40?), which are then pinned to the 8 cores I requested, chocking them a little.

What is GOMAXPROCS?

Not set.

Are there other things running on the machine?

Possibly. Since I've only requested about half a node, if another user asks for just a few cores the job manager could assign the job to the same node. The 8 cores I requested are mine though (and they're guaranteed).

rsc · 2017-01-17T19:24:41Z

I think all we need to do here is increase the timeout, for Go 1.9.

dr2chase · 2017-02-22T18:53:50Z

I modified the test (and the rescheduling code also) in this CL, it may work better now: https://go-review.googlesource.com/c/36206/

alexbrainman · 2017-04-07T05:00:10Z

Seen on our openbsd-amd64 builder https://build.golang.org/log/86fd5a7f6d52845eeecce77a39bb5e799ebf2b6a

Alex

bradfitz · 2017-04-13T18:08:36Z

The last four openbsd-386 builds all failed with this too.

I'm going to disable the test.

gopherbot · 2017-04-13T18:16:35Z

CL https://golang.org/cl/40651 mentions this issue.

Updates #18589 Change-Id: I2c3bbc8257c68295051bd2e63e1e11794d0609c3 Reviewed-on: https://go-review.googlesource.com/40651 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com> Reviewed-by: David Chase <drchase@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

gopherbot · 2017-05-02T12:41:18Z

CL https://golang.org/cl/42431 mentions this issue.

5 shards, each of which spins up NumCPU processes, each of which is running at GOMAXPROCS=NumCPU, is too much for one machine. It makes my laptop unusable. It might also be in part responsible for test flakes that require a moderately responsive system, like #18589 (backedge scheduling) and #19276 (locklinear). It's possible that Go should be a better neighbor in general; that's #17969. In the meantime, fix this corner of the world. Builders snapshot the world and run shards on different machines, so keeping sharding high for them is good. This is a partial reversion of CL 18199. Fixes #20141. Change-Id: I123cf9436f4f4da3550372896265c38117b78071 Reviewed-on: https://go-review.googlesource.com/42431 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

5 shards, each of which spins up NumCPU processes, each of which is running at GOMAXPROCS=NumCPU, is too much for one machine. It makes my laptop unusable. It might also be in part responsible for test flakes that require a moderately responsive system, like golang#18589 (backedge scheduling) and golang#19276 (locklinear). It's possible that Go should be a better neighbor in general; that's golang#17969. In the meantime, fix this corner of the world. Builders snapshot the world and run shards on different machines, so keeping sharding high for them is good. This is a partial reversion of CL 18199. Fixes golang#20141. Change-Id: I123cf9436f4f4da3550372896265c38117b78071

dr2chase · 2018-06-26T14:32:42Z

Moved to 1.12, the test will likely change or be obsoleted when we add signal-based preemption.

randall77 · 2018-11-26T22:59:46Z

Punting to 1.13, we aren't going to do anything for this cycle.

ALTree added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 10, 2017

dr2chase self-assigned this Jan 10, 2017

gopherbot closed this as completed in d9a0579 Jan 10, 2017

rsc reopened this Jan 10, 2017

bradfitz added the Testing An issue that has been verified to require only test changes, not just a test failure. label Jan 13, 2017

bradfitz added this to the Go1.8Maybe milestone Jan 13, 2017

rsc modified the milestones: Go1.9Early, Go1.8Maybe Jan 17, 2017

ALTree mentioned this issue Mar 13, 2017

test: issue10958 test flaky (especially plan9/arm) #19521

Closed

ALTree mentioned this issue May 2, 2017

test: concurrency a bit too aggressive #20141

Closed

bradfitz modified the milestones: Go1.9, Go1.9Early May 3, 2017

bradfitz modified the milestones: Go1.10, Go1.9 Jun 28, 2017

mdempsky modified the milestones: Go1.10, Go1.11 Nov 30, 2017

dr2chase modified the milestones: Go1.11, Go1.12 Jun 26, 2018

randall77 modified the milestones: Go1.12, Go1.13 Nov 26, 2018

andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

rsc unassigned dr2chase Jun 23, 2022

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

mknyszek added this to Go Compiler / Runtime Jul 13, 2022

mknyszek moved this to Triage Backlog in Go Compiler / Runtime Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: fixedbug/issue10958.go failures #18589

cmd/compile: fixedbug/issue10958.go failures #18589

ALTree commented Jan 10, 2017 •

edited

Loading

gopherbot commented Jan 10, 2017

dr2chase commented Jan 10, 2017

ALTree commented Jan 10, 2017

dr2chase commented Jan 10, 2017

ALTree commented Jan 10, 2017

dr2chase commented Jan 10, 2017

ALTree commented Jan 10, 2017

dr2chase commented Jan 10, 2017

ALTree commented Jan 10, 2017

dr2chase commented Jan 10, 2017

aclements commented Jan 10, 2017

ALTree commented Jan 10, 2017

ALTree commented Jan 10, 2017

rsc commented Jan 10, 2017

ALTree commented Jan 10, 2017

rsc commented Jan 17, 2017

dr2chase commented Feb 22, 2017

alexbrainman commented Apr 7, 2017

bradfitz commented Apr 13, 2017

gopherbot commented Apr 13, 2017

gopherbot commented May 2, 2017

dr2chase commented Jun 26, 2018

randall77 commented Nov 26, 2018

cmd/compile: fixedbug/issue10958.go failures #18589

cmd/compile: fixedbug/issue10958.go failures #18589

Comments

ALTree commented Jan 10, 2017 • edited Loading

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you see?

gopherbot commented Jan 10, 2017

dr2chase commented Jan 10, 2017

ALTree commented Jan 10, 2017

dr2chase commented Jan 10, 2017

ALTree commented Jan 10, 2017

dr2chase commented Jan 10, 2017

ALTree commented Jan 10, 2017

dr2chase commented Jan 10, 2017

ALTree commented Jan 10, 2017

dr2chase commented Jan 10, 2017

aclements commented Jan 10, 2017

ALTree commented Jan 10, 2017

ALTree commented Jan 10, 2017

rsc commented Jan 10, 2017

ALTree commented Jan 10, 2017

rsc commented Jan 17, 2017

dr2chase commented Feb 22, 2017

alexbrainman commented Apr 7, 2017

bradfitz commented Apr 13, 2017

gopherbot commented Apr 13, 2017

gopherbot commented May 2, 2017

dr2chase commented Jun 26, 2018

randall77 commented Nov 26, 2018

ALTree commented Jan 10, 2017 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?