Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to copy layered worldstate on 22.10.2 - Snapshots=False #4784

Closed
non-fungible-nelson opened this issue Dec 7, 2022 · 17 comments · Fixed by #4906
Closed

Unable to copy layered worldstate on 22.10.2 - Snapshots=False #4784

non-fungible-nelson opened this issue Dec 7, 2022 · 17 comments · Fixed by #4906
Labels
bug Something isn't working mainnet

Comments

@non-fungible-nelson
Copy link
Contributor

non-fungible-nelson commented Dec 7, 2022

Description

Besu user experiencing the following bug:

022-12-07 08:38:20.491-07:00 | vert.x-worker-thread-0 | INFO | MainnetBlockValidator | Optional[Unable to copy Layered Worldstate for 0x3c2a249675dca59a4477c9a3ffdcc51e203617266f1a0f6434fa1a1dc5fec3f0]. Block 16130345 (0xa8b549ea57d83df60aec71c6fb40ab8391b0f640d7a2d2a013fce3a296800ad0), caused by org.hyperledger.besu.plugin.services.exception.StorageException: Unable to copy Layered Worldstate for 0x3c2a249675dca59a4477c9a3ffdcc51e203617266f1a0f6434fa1a1dc5fec3f0...

Directly after a restart and their node is stuck. Restart does not fix. Initially the user experienced the error below, then upon restart, the error above.

2022-12-07 08:09:44.904-07:00 | vert.x-worker-thread-0 | WARN | EngineNewPayload | Invalid new payload: number: 16130345, hash: 0xa8b549ea57d83df60aec71c6fb40ab8391b0f640d7a2d2a013fce3a296800ad0, parentHash: 0x3c2a249675dca59a4477c9a3ffdcc51e203617266f1a0f6434fa1a1dc5fec3f0, latestValidHash: 0x3c2a249675dca59a4477c9a3ffdcc51e203617266f1a0f6434fa1a1dc5fec3f0, status: INVALID, validationError: Unable to process block because parent world state 0x3b34cb5ab01b1b590a73ac002acbf60790360952ed1679bd6d9f9b608856b29a is not available

User does not have snapshots enabled. User is on 22.10.2.

Discord context

Acceptance Criteria

  • Bonsai is resilient to this concurrency bug

Steps to Reproduce (Bug)

  1. Run Besu
  2. Halt or restart
  3. Experience error (specific DB state)

Expected behavior: [What you expect to happen]
Continues normally.

Actual behavior: [What actually happens]
Bonsai resumes as normal

Frequency: [What percentage of the time does it occur?]
Infrequent

Versions (Add all that apply)

  • Software version: 22.10.2

Additional Information (Add any of the following or anything else that may be relevant)

  • Besu setup info - snapshots false
  • System info - memory, CPU
@yorickdowne
Copy link

yorickdowne commented Dec 20, 2022

Yep ran into this. Restart doesn't fix it; no bonsai snapshots. Running bonsai tries with snap sync.

eth-main-execution-1  | 2022-12-20 13:41:14.414+00:00 | vert.x-worker-thread-0 | INFO  | MainnetBlockValidator | Optional[Unable to copy Layered Worldstate for 0x1551f6d040298ec23c52f0fcf6bab2ee80cff9b958e1bdc27c11ba98b612b1d4]. Block 16224004 (0xe6f16e7d0103a45734a8d744de0b1c57c143fac4e750ebd9afd6e0ee708e94fc), caused by org.hyperledger.besu.plugin.services.exception.StorageException: Unable to copy Layered Worldstate for 0x1551f6d040298ec23c52f0fcf6bab2ee80cff9b958e1bdc27c11ba98b612b1d4
eth-main-execution-1  | 2022-12-20 13:41:19.451+00:00 | vert.x-worker-thread-0 | INFO  | MainnetBlockValidator | Optional[Unable to copy Layered Worldstate for 0x1551f6d040298ec23c52f0fcf6bab2ee80cff9b958e1bdc27c11ba98b612b1d4]. Block 16224004 (0xe6f16e7d0103a45734a8d744de0b1c57c143fac4e750ebd9afd6e0ee708e94fc), caused by org.hyperledger.besu.plugin.services.exception.StorageException: Unable to copy Layered Worldstate for 0x1551f6d040298ec23c52f0fcf6bab2ee80cff9b958e1bdc27c11ba98b612b1d4

Frequency: Reasonably frequent. Saw it in 2 out of 4 fresh syncs so far.

@bryson-m
Copy link

I ran into this today as well. Same hashes and block number as @yorickdowne

@AegeanDad
Copy link

AegeanDad commented Dec 22, 2022

I am having this exact issue following a proper shutdown and reboot. Prior to this, I had deleted the Besu DB and completed a fresh sync. Following the sync, everything was normal until the reboot. Running 22.10.3.

Dec 22 15:31:26 GHBeelink2 besu[4272]: 2022-12-22 15:31:26.971-08:00 | vert.x-worker-thread-0 | INFO | MainnetBlockValidator | Optional[Unable to copy Layered Worldstate for 0x3bc82334b7902026f2e9ed64122ccf224899f014dc25823a8b30568e588f20cd]. Block 16243159 (0xd25d6b550954f0b0358297170861224b6136e17d54180fb3ddfa0b04c7964f06), caused by org.hyperledger.besu.plugin.services.exception.StorageException: Unable to copy Layered Worldstate for 0x3bc82334b7902026f2e9ed64122ccf224899f014dc25823a8b30568e588f20cd

@0xcd2c6
Copy link

0xcd2c6 commented Dec 22, 2022

Experiencing the same issue after having to transfer validators to a different machine earlier this week because of an unrelated issue. Running 22.10.3 combined with Teku. Have deleted db and trying to resync currently.

@garyschulte
Copy link
Contributor

This issue is mitigated in the 23.1.0-beta release coming on the 28th. The underlying data problem that is the cause of this issue persists, but 23.1.0 should be able to recover from these kinds of errors via backward sync of subsequent blocks.

@timjrobinson
Copy link

timjrobinson commented Jan 1, 2023

I started getting this issue and tried the 23.1.0 release candidate docker image (23.1.0-RC1-SNAPSHOT-openjdk-latest) to mitigate it and am getting these errors instead:

rocketpool_eth1  | 2023-01-01 02:26:03.390+00:00 | vert.x-eventloop-thread-17 | ERROR | ExecutionEngineJsonRpcMethod | failed to exec consensus method engine_newPayloadV1
rocketpool_eth1  | java.lang.RuntimeException: java.lang.RuntimeException: org.hyperledger.besu.plugin.services.exception.StorageException: Unable to copy Layered Worldstate for 0x334a634a192f87ae76c23c1b8a1f40619382d00825877d07ebef1947d777f637
rocketpool_eth1  | 	at org.hyperledger.besu.ethereum.MainnetBlockValidator.validateAndProcessBlock(MainnetBlockValidator.java:175)
rocketpool_eth1  | 	at org.hyperledger.besu.consensus.merge.blockcreation.MergeCoordinator.validateBlock(MergeCoordinator.java:441)
rocketpool_eth1  | 	at org.hyperledger.besu.consensus.merge.blockcreation.MergeCoordinator.rememberBlock(MergeCoordinator.java:455)
rocketpool_eth1  | 	at org.hyperledger.besu.consensus.merge.blockcreation.TransitionCoordinator.rememberBlock(TransitionCoordinator.java:142)
rocketpool_eth1  | 	at org.hyperledger.besu.ethereum.api.jsonrpc.internal.methods.engine.EngineNewPayload.syncResponse(EngineNewPayload.java:219)
rocketpool_eth1  | 	at org.hyperledger.besu.ethereum.api.jsonrpc.internal.methods.ExecutionEngineJsonRpcMethod.lambda$response$0(ExecutionEngineJsonRpcMethod.java:73)
rocketpool_eth1  | 	at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159)
rocketpool_eth1  | 	at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100)
rocketpool_eth1  | 	at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157)
rocketpool_eth1  | 	at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76)
rocketpool_eth1  | 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
rocketpool_eth1  | 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
rocketpool_eth1  | 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
rocketpool_eth1  | 	at java.base/java.lang.Thread.run(Thread.java:833)
rocketpool_eth1  | Caused by: java.lang.RuntimeException: org.hyperledger.besu.plugin.services.exception.StorageException: Unable to copy Layered Worldstate for 0x334a634a192f87ae76c23c1b8a1f40619382d00825877d07ebef1947d777f637
rocketpool_eth1  | 	at org.hyperledger.besu.ethereum.bonsai.BonsaiLayeredWorldState.copy(BonsaiLayeredWorldState.java:275)
rocketpool_eth1  | 	at org.hyperledger.besu.ethereum.MainnetBlockValidator.lambda$validateAndProcessBlock$0(MainnetBlockValidator.java:120)
rocketpool_eth1  | 	at java.base/java.util.Optional.map(Optional.java:260)
rocketpool_eth1  | 	at org.hyperledger.besu.ethereum.MainnetBlockValidator.validateAndProcessBlock(MainnetBlockValidator.java:117)
rocketpool_eth1  | 	... 13 more
rocketpool_eth1  | Caused by: org.hyperledger.besu.plugin.services.exception.StorageException: Unable to copy Layered Worldstate for 0x334a634a192f87ae76c23c1b8a1f40619382d00825877d07ebef1947d777f637
rocketpool_eth1  | 	at org.hyperledger.besu.ethereum.bonsai.BonsaiLayeredWorldState.lambda$copy$4(BonsaiLayeredWorldState.java:272)
rocketpool_eth1  | 	at java.base/java.util.Optional.orElseThrow(Optional.java:403)
rocketpool_eth1  | 	at org.hyperledger.besu.ethereum.bonsai.BonsaiLayeredWorldState.copy(BonsaiLayeredWorldState.java:269)
rocketpool_eth1  | 	... 16 more

@non-fungible-nelson
Copy link
Contributor Author

@timjrobinson - can you provide us your config?
@garyschulte - see above and let's discuss tomorrow.

@garyschulte
Copy link
Contributor

@timjrobinson, 23.1.0 mitigates the issue by preventing database corruption when this error occurs. If you have run into it on 22.10.x, the database is in an inconsistent state. I am working on a standalone cli tool that can repair the state, if you still have a besu database in an inconsistent state (i.e. you have not tried to resync)

@timjrobinson
Copy link

@non-fungible-nelson the config is all rocketpool defaults but with snap sync enabled.

I deleted my besu data to switch back to geth so unfortunately I don't have that database any more.

@j4cko
Copy link

j4cko commented Jan 4, 2023

I ran into the same issue a few weeks back and admittedly switched to geth at the time. But I still have the corrupted database. If there is something I can try, let me know (could also share).

@felix-halim
Copy link

I also ran into the same issue today:

2023-01-12 11:22:18.852-08:00 | vert.x-worker-thread-0 | INFO | MainnetBlockValidator | Optional[Unable to copy Layered Worldstate for 0x8a8d1fbd1f40bea01fd7a4ec7e7c50661f3e5f30baf16de5db99fc660c53f882]. Block 16387938 (0x8c0993a6ee49b3e8f4726400117d377c1c358037b28701c4955ad58a303f2dcd), caused by org.hyperledger.besu.plugin.services.exception.StorageException: Unable to copy Layered Worldstate for 0x8a8d1fbd1f40bea01fd7a4ec7e7c50661f3e5f30baf16de5db99fc660c53f882

I am using 22.10.3 with bonsai with checkpoint.

Is there a way to fix this using command line yet?
Or should I delete everything and sync from the start?
This will happen again, right?

Thanks!

@non-fungible-nelson
Copy link
Contributor Author

Hi folks - this bug is a "symptom" of three underlying issues we have patched:
#4786 Bugfix snapshot transaction segfaults after storage truncation
#4862 Bugfix potential chain head and worldstate inconsistency
#4906 Bugfix for selfdestruct and bonsai during heal step.

We will be testing these over the weekend for a release next week. This upcoming release will also have a check at boot to check for these types of errors and correct them or resync the worldstate (much shorter than a full resync). But if your node is experiencing this on 22.10.3, a resync is the best way for now to address. The trigger for the majority of these bugs is very rare (fixed in #4906), and a resync should fix it. If you happen to be unlucky and complete your snap sync with the narrow conditions for this bug, it wont manifest immediately, and should be rectified when we release 22.10.4. Thanks for the patience.

@AegeanDad
Copy link

AegeanDad commented Jan 12, 2023 via email

@non-fungible-nelson
Copy link
Contributor Author

Definitely leave your node running. The problem actually can begin during the sync phase (i.e. one of the bugs we patched was in the "heal" step of sync, it improperly filled the database, leading to these worldstate errors). If your node is running fine for now, definitely let it run while we get the release ready for next week.

@b-m-f
Copy link

b-m-f commented Jan 29, 2023

This happened to me twice in 1 week.

The first time it was triggered by a full SSD. Second time around there was enough storage and it occured 2 days after a successful resync.

Bonsai with snapshots.

Edit:

I hace updated to the 23.1-SNAPSHOT image from Dockerhub and started a resync. Hope there is no problem with this.

@non-fungible-nelson
Copy link
Contributor Author

non-fungible-nelson commented Jan 29, 2023

@b-m-f make sure to disable snapshots if they are enabled. At this time, they are exacerbating these issues. We are tracking this here #4768. These bugs all manifest a bit differently, but are the same.

@non-fungible-nelson
Copy link
Contributor Author

tracking in multiple other locations. closing here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mainnet
Projects
None yet
10 participants