-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(op-node/op-batcher/op-proposer): add fallbackClient #18
Conversation
* copy over develop chainsgo * stage rollup config changes * final rollup config values
Update to upstream optimism(v1.1.0) release
… and handleReceipt access state concurrently (#5)
* chore: update readme, add testnet assets * doc: clarify readme
* ci: add the ci code used to package and release docker images (#7) * ci: add the ci code used to package and release docker images Co-authored-by: Welkin <welkin.b@nodereal.com> * fix: add latest tag for docker image (#9) Co-authored-by: Welkin <welkin.b@nodereal.com> * try to use cache for docker build (#10) Co-authored-by: Welkin <welkin.b@nodereal.com> --------- Co-authored-by: Welkin <welkin.b@nodereal.com>
The current implementation only deals with the scenario that the first RPC endpoint is down for a while. For unstable providers, the error rate may be still high. To increase the success rate, we can try all endpoints in the URL list one by one if it fails. However, if some of the endpoints are down, it could increase the overall latency as we would still try the down services. To avoid this, we can retain logic such as thresholds, health checks, and auto recovery. For example, we can mark the URL as 'unavailable' if it gets too many errors in a time range, and we can have a background goroutine to check the failed URLs. If they are alive again, we can mark them as 'healthy'. |
Trying multiple endpoints within a single call may further aggravate the discrepancy in block height. Thus, we have decided to set aside this strategy for the moment. |
The code adds metrics, allowing us to immediately perceive a fallback event when it occurs |
const BatcherFallbackThreshold int64 = 10 | ||
const ProposerFallbackThreshold int64 = 3 | ||
const TxmgrFallbackThreshold int64 = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too small to tolerate network jitter in my opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Op-batcher and op-proposer do not have metrics like op_node_default_rpc_client_responses_total
for reference, but when I tested locally, I found that 10 sometimes cannot trigger the fallback well, especially Txmgr, because it periodically submits transactions, and the request frequency is very low. Do you have any better suggested values?
Added an e2e case: TestL1FallbackClient_SwitchUrl, so that we can test the effect of FallbackClient switching url locally |
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
# Conflicts: # assets/testnet/genesis.json # op-batcher/metrics/metrics.go # op-node/metrics/metrics.go
move to #55 |
The FallbackClient is designed to automatically switch to an alternative L1 endpoint when the current one encounters issues, and revert back once the primary endpoint is functional again.
The core logic operates as follows: When an error occurs, it is recorded until either the error count surpasses a predetermined threshold or the count is reset by a timer that executes once per minute. If the threshold is exceeded within a one-minute span, it suggests the L1 endpoint has become unreliable, prompting the system to select the next address in the L1 URL. A new RPC client is then created to replace the current one. Simultaneously, a separate Goroutine monitors the initial endpoint's health and, when deemed stable, reverts back to its use.
As the clients employed in op-node and op-batcher/op-proposer differ significantly, two distinct fallbackClients have been implemented. The op-node version incorporates additional features such as subscription management and RPC legality checks, resulting in a more complex implementation.
Example of l1 flag after code modification: