Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(op-node/op-batcher/op-proposer): add fallbackClient #18

Closed
wants to merge 26 commits into from

Conversation

welkin22
Copy link
Contributor

@welkin22 welkin22 commented Jul 7, 2023

The FallbackClient is designed to automatically switch to an alternative L1 endpoint when the current one encounters issues, and revert back once the primary endpoint is functional again.

The core logic operates as follows: When an error occurs, it is recorded until either the error count surpasses a predetermined threshold or the count is reset by a timer that executes once per minute. If the threshold is exceeded within a one-minute span, it suggests the L1 endpoint has become unreliable, prompting the system to select the next address in the L1 URL. A new RPC client is then created to replace the current one. Simultaneously, a separate Goroutine monitors the initial endpoint's health and, when deemed stable, reverts back to its use.

As the clients employed in op-node and op-batcher/op-proposer differ significantly, two distinct fallbackClients have been implemented. The op-node version incorporates additional features such as subscription management and RPC legality checks, resulting in a more complex implementation.

Example of l1 flag after code modification:

--l1=https://data-seed-prebsc-1-s1.binance.org:8545,https://data-seed-prebsc-2-s2.binance.org:8545,https://data-seed-prebsc-2-s3.binance.org:8545

trianglesphere and others added 17 commits June 1, 2023 16:53
* copy over develop chainsgo

* stage rollup config changes

* final rollup config values
Update to upstream optimism(v1.1.0) release
… and handleReceipt access state concurrently (#5)
* chore: update readme, add testnet assets

* doc: clarify readme
* ci: add the ci code used to package and release docker images (#7)

* ci: add the ci code used to package and release docker images

Co-authored-by: Welkin <welkin.b@nodereal.com>

* fix: add latest tag for docker image (#9)

Co-authored-by: Welkin <welkin.b@nodereal.com>

* try to use cache for docker build (#10)

Co-authored-by: Welkin <welkin.b@nodereal.com>

---------

Co-authored-by: Welkin <welkin.b@nodereal.com>
@owen-reorg
Copy link
Collaborator

The current implementation only deals with the scenario that the first RPC endpoint is down for a while.

For unstable providers, the error rate may be still high.

To increase the success rate, we can try all endpoints in the URL list one by one if it fails.

However, if some of the endpoints are down, it could increase the overall latency as we would still try the down services. To avoid this, we can retain logic such as thresholds, health checks, and auto recovery. For example, we can mark the URL as 'unavailable' if it gets too many errors in a time range, and we can have a background goroutine to check the failed URLs. If they are alive again, we can mark them as 'healthy'.

@welkin22
Copy link
Contributor Author

The current implementation only deals with the scenario that the first RPC endpoint is down for a while.

For unstable providers, the error rate may be still high.

To increase the success rate, we can try all endpoints in the URL list one by one if it fails.

However, if some of the endpoints are down, it could increase the overall latency as we would still try the down services. To avoid this, we can retain logic such as thresholds, health checks, and auto recovery. For example, we can mark the URL as 'unavailable' if it gets too many errors in a time range, and we can have a background goroutine to check the failed URLs. If they are alive again, we can mark them as 'healthy'.

Trying multiple endpoints within a single call may further aggravate the discrepancy in block height. Thus, we have decided to set aside this strategy for the moment.

@welkin22
Copy link
Contributor Author

The code adds metrics, allowing us to immediately perceive a fallback event when it occurs

@welkin22 welkin22 requested review from owen-reorg and bnoieh July 13, 2023 02:26
Comment on lines +25 to +27
const BatcherFallbackThreshold int64 = 10
const ProposerFallbackThreshold int64 = 3
const TxmgrFallbackThreshold int64 = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too small to tolerate network jitter in my opinion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Op-batcher and op-proposer do not have metrics like op_node_default_rpc_client_responses_total for reference, but when I tested locally, I found that 10 sometimes cannot trigger the fallback well, especially Txmgr, because it periodically submits transactions, and the request frequency is very low. Do you have any better suggested values?

@welkin22
Copy link
Contributor Author

Added an e2e case: TestL1FallbackClient_SwitchUrl, so that we can test the effect of FallbackClient switching url locally

@github-actions
Copy link
Contributor

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 29, 2023
@welkin22 welkin22 removed the Stale label Jul 31, 2023
@github-actions
Copy link
Contributor

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions
Copy link
Contributor

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Aug 30, 2023
@github-actions github-actions bot closed this Sep 4, 2023
@welkin22 welkin22 reopened this Sep 19, 2023
@welkin22 welkin22 removed the Stale label Sep 19, 2023
@owen-reorg owen-reorg changed the base branch from release/testnet to develop September 20, 2023 17:12
# Conflicts:
#	assets/testnet/genesis.json
#	op-batcher/metrics/metrics.go
#	op-node/metrics/metrics.go
@welkin22
Copy link
Contributor Author

move to #55

@welkin22 welkin22 closed this Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants