Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IAVL crashes with "value missing hash" under high load #240

Closed
p4u opened this issue May 8, 2020 · 12 comments
Closed

IAVL crashes with "value missing hash" under high load #240

p4u opened this issue May 8, 2020 · 12 comments
Assignees

Comments

@p4u
Copy link

p4u commented May 8, 2020

When performing a benchmark under high load operation, IAVL panics. We have seen this error on different servers and many times already. Looks like there is something missing on the IAVL tree, but not sure what. After re-launching the program everything goes back to normal and IAVL works fine.

panic: Value missing for hash 5c762dd92f621bd3df4f93cdd153d5639590de63f8be117c317615faab68c0c6 corresponding to nodeKey n\v-�/b�O���S�c���c��|1v��h��

goroutine 1452118376 [running]:
github.com/tendermint/iavl.(*nodeDB).GetNode(0xc001fd0600, 0xc0724f3ea0, 0x20, 0x20, 0x0)
	/go/pkg/mod/github.com/tendermint/iavl@v0.13.3/nodedb.go:109 +0x7bf
github.com/tendermint/iavl.(*Node).getRightNode(...)
	/go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:417
github.com/tendermint/iavl.(*Node).get(0xc071dd45a0, 0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x81, 0xffffffffffffffff, 0xc0c934ad40, 0xc0475daff2)
	/go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:174 +0x2f2
github.com/tendermint/iavl.(*Node).get(0xc071dd4500, 0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x81, 0x1, 0x1, 0x30ab980)
	/go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:172 +0x1ac
github.com/tendermint/iavl.(*Node).get(0xc071c8bf40, 0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x81, 0xffffffffffffffff, 0xc017393410, 0xe8b266)
	/go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:175 +0x26f
github.com/tendermint/iavl.(*Node).get(0xc071f69cc0, 0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x81, 0x1, 0xf, 0xc0054bea00)
	/go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:172 +0x1ac
github.com/tendermint/iavl.(*Node).get(0xc071f69c20, 0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x90, 0x90, 0x81, 0xc021c80c60)
	/go/pkg/mod/github.com/tendermint/iavl@v0.13.3/node.go:175 +0x26f
github.com/tendermint/iavl.(*ImmutableTree).Get(0xc0c6762b40, 0xc021c80c60, 0x81, 0x90, 0x81, 0x90, 0xc0173934a8, 0xf6cf71)
	/go/pkg/mod/github.com/tendermint/iavl@v0.13.3/immutable_tree.go:157 +0x5a
gitlab.com/vocdoni/go-dvote/vochain.(*State).Envelope(0xc0341e7380, 0xc021c80bd0, 0x81, 0x1, 0x2, 0xc021c80bd0, 0x81)
	/src/vochain/iavl.go:477 +0xd3
gitlab.com/vocdoni/go-dvote/router.(*Router).getEnvelopeStatus(0xc0267c2690, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/src/router/voteCallbacks.go:98 +0x22c
created by gitlab.com/vocdoni/go-dvote/router.(*Router).Route
	/src/router/router.go:284 +0x429
@tessr
Copy link
Contributor

tessr commented May 8, 2020

Hello! Thanks for giving us a heads up on this. We'll take a look ASAP.

@p4u
Copy link
Author

p4u commented May 8, 2020

Thanks for the quick reply @tessr

That could be a data race in our side. We are investigating it... we'll report more information once available.

@erikgrinaker
Copy link
Contributor

Yeah, I don't believe IAVL is concurrency-safe (although I'm not 100% sure), so if the application is making concurrent accesses - in particular writes - then I believe you'd see problems like this under load. Are you using IAVL concurrently? Do you see the same problems if you put an RWMutex around it?

@erikgrinaker erikgrinaker self-assigned this May 8, 2020
@p4u
Copy link
Author

p4u commented May 8, 2020

We are using a RWMutex already:

However we suspect that the fact that we do use pointers

app.State.IAppTree = app.State.AppTree.ImmutableTree

Could raise a data race when app.State.AppTree changes.

@erikgrinaker
Copy link
Contributor

Ah - yes, I think that might be a cause. Let us know what you find, and if there's a problem with IAVL itself it would be very helpful with a minimal reproducible case.

@mvdan
Copy link

mvdan commented May 8, 2020

@erikgrinaker just to double check - if one is running write operations on the mutable tree, such as tree.Set, it's not safe to concurrently call read operations on its immutable tree, such as tree.ImmutableTree.Get. I assume this is the case because, from reading your code, I can see that a mutable tree operation can modify the immutable tree, such as modifying the immutable tree's root node.

If the above is correct, then it's very probably a data race on our side. We do have a mutex guarding the immutable tree, like @p4u mentioned, but the write operations on the mutable tree aren't guarded by the same mutex. So we might end up calling tree.Set and tree.ImmutableTree.Get concurrently.

@erikgrinaker
Copy link
Contributor

I'm not too familiar with the details here, but that would not surprise me - for example when removing orphan nodes or pruning old versions. To be safe, I would use a mutex for all operations against the entire IAVL store.

We should really specify the exact concurrency semantics for IAVL, and maybe down the line implement some form of concurrency control.

@mvdan
Copy link

mvdan commented May 8, 2020

Thanks - sounds good with me. I was going to suggest that the docs be clarified a bit, for now. The immutable tree does say Note that this tree is not thread-safe., but we didn't understand that to also mean operations on the parent mutable tree.

@erikgrinaker
Copy link
Contributor

For sure, we should clarify what the intended behavior here is. It is sort of unintuitive that an immutable tree is, well, mutable.

@mvdan
Copy link

mvdan commented May 8, 2020

Yes, that's a good point. I think that's what caused confusion for us - the ImmutableTree name seems to imply that it's a completely immutable and static snapshot of a tree, when it's not. It's meant for immutable use, but it itself is not immutable.

@erikgrinaker
Copy link
Contributor

I'll close this for now, but do let us know if you still see problems even with a mutex around all operations.

@p4u
Copy link
Author

p4u commented May 8, 2020

So I can confirm that the issue was in our side due a data race.

It's also fixed in our side and after running several benchmarks, no Panic found.

Thanks for your time and help @erikgrinaker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants