feat(op-node/op-batcher/op-proposer): add fallbackClient #18

welkin22 · 2023-07-07T10:28:06Z

The FallbackClient is designed to automatically switch to an alternative L1 endpoint when the current one encounters issues, and revert back once the primary endpoint is functional again.

The core logic operates as follows: When an error occurs, it is recorded until either the error count surpasses a predetermined threshold or the count is reset by a timer that executes once per minute. If the threshold is exceeded within a one-minute span, it suggests the L1 endpoint has become unreliable, prompting the system to select the next address in the L1 URL. A new RPC client is then created to replace the current one. Simultaneously, a separate Goroutine monitors the initial endpoint's health and, when deemed stable, reverts back to its use.

As the clients employed in op-node and op-batcher/op-proposer differ significantly, two distinct fallbackClients have been implemented. The op-node version incorporates additional features such as subscription management and RPC legality checks, resulting in a more complex implementation.

Example of l1 flag after code modification:

--l1=https://data-seed-prebsc-1-s1.binance.org:8545,https://data-seed-prebsc-2-s2.binance.org:8545,https://data-seed-prebsc-2-s3.binance.org:8545

* copy over develop chainsgo * stage rollup config changes * final rollup config values

Update to upstream optimism(v1.1.0) release

… and handleReceipt access state concurrently (#5)

* chore: update readme, add testnet assets * doc: clarify readme

* ci: add the ci code used to package and release docker images (#7) * ci: add the ci code used to package and release docker images Co-authored-by: Welkin <welkin.b@nodereal.com> * fix: add latest tag for docker image (#9) Co-authored-by: Welkin <welkin.b@nodereal.com> * try to use cache for docker build (#10) Co-authored-by: Welkin <welkin.b@nodereal.com> --------- Co-authored-by: Welkin <welkin.b@nodereal.com>

op-node/sources/fallback_client.go

op-node/node/client.go

op-node/sources/fallback_client.go

op-node/node/node.go

owen-reorg · 2023-07-11T07:48:34Z

The current implementation only deals with the scenario that the first RPC endpoint is down for a while.

For unstable providers, the error rate may be still high.

To increase the success rate, we can try all endpoints in the URL list one by one if it fails.

However, if some of the endpoints are down, it could increase the overall latency as we would still try the down services. To avoid this, we can retain logic such as thresholds, health checks, and auto recovery. For example, we can mark the URL as 'unavailable' if it gets too many errors in a time range, and we can have a background goroutine to check the failed URLs. If they are alive again, we can mark them as 'healthy'.

op-node/sources/fallback_client.go

…to the comments

welkin22 · 2023-07-12T13:09:53Z

The current implementation only deals with the scenario that the first RPC endpoint is down for a while.

For unstable providers, the error rate may be still high.

To increase the success rate, we can try all endpoints in the URL list one by one if it fails.

However, if some of the endpoints are down, it could increase the overall latency as we would still try the down services. To avoid this, we can retain logic such as thresholds, health checks, and auto recovery. For example, we can mark the URL as 'unavailable' if it gets too many errors in a time range, and we can have a background goroutine to check the failed URLs. If they are alive again, we can mark them as 'healthy'.

Trying multiple endpoints within a single call may further aggravate the discrepancy in block height. Thus, we have decided to set aside this strategy for the moment.

welkin22 · 2023-07-12T13:12:10Z

The code adds metrics, allowing us to immediately perceive a fallback event when it occurs

op-node/sources/fallback_client.go

bnoieh · 2023-07-13T10:41:48Z

op-service/client/ethclient.go

+const BatcherFallbackThreshold int64 = 10
+const ProposerFallbackThreshold int64 = 3
+const TxmgrFallbackThreshold int64 = 3


too small to tolerate network jitter in my opinion

Op-batcher and op-proposer do not have metrics like op_node_default_rpc_client_responses_total for reference, but when I tested locally, I found that 10 sometimes cannot trigger the fallback well, especially Txmgr, because it periodically submits transactions, and the request frequency is very low. Do you have any better suggested values?

welkin22 · 2023-07-14T03:02:19Z

Added an e2e case: TestL1FallbackClient_SwitchUrl, so that we can test the effect of FallbackClient switching url locally

github-actions · 2023-07-29T01:50:21Z

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2023-08-15T01:47:24Z

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2023-08-30T01:48:00Z

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

# Conflicts: # assets/testnet/genesis.json # op-batcher/metrics/metrics.go # op-node/metrics/metrics.go

welkin22 · 2023-09-21T06:32:54Z

move to #55

trianglesphere and others added 17 commits June 1, 2023 16:53

op-batcher: Add metrics for pending L2 transaction data size (#5797)

70c10eb

feat(op-node): Finalize Mainnet Rollup Config [release branch] (#5905)

d826cb0

* copy over develop chainsgo * stage rollup config changes * final rollup config values

Merge branch 'op-v1.1.0' into update-upstream-v1.1.0

8c167a4

Merge pull request #3 from bnb-chain/update-upstream-v1.1.0

f489750

Update to upstream optimism(v1.1.0) release

fix(op-batcher): solve race condition of BatchSubmitter publishTxToL1…

3156060

… and handleReceipt access state concurrently (#5)

chore: update readme, add testnet assets (#9)

1d70e51

* chore: update readme, add testnet assets * doc: clarify readme

FallbackClient impl

b0fdefa

double check fail count

b590e29

RegisterSubscribeFunc

2206444

FallbackClient for op-batcher,op-proposer

e292543

miss currentClient

d067fc9

add log and change order

8118284

fallback client add fallbackThreshold

fca50a3

add validateRpc

075ac9b

add document

c67636e

add document

0b46fbf

owen-reorg reviewed Jul 11, 2023

View reviewed changes

op-node/sources/fallback_client.go Show resolved Hide resolved

owen-reorg reviewed Jul 11, 2023

View reviewed changes

op-node/sources/fallback_client.go Outdated Show resolved Hide resolved

owen-reorg reviewed Jul 11, 2023

View reviewed changes

op-node/sources/fallback_client.go Outdated Show resolved Hide resolved

owen-reorg reviewed Jul 11, 2023

View reviewed changes

op-node/sources/fallback_client.go Outdated Show resolved Hide resolved

bnoieh reviewed Jul 11, 2023

View reviewed changes

op-node/node/client.go Outdated Show resolved Hide resolved

op-node/sources/fallback_client.go Outdated Show resolved Hide resolved

op-node/sources/fallback_client.go Show resolved Hide resolved

op-node/node/node.go Outdated Show resolved Hide resolved

bnoieh reviewed Jul 11, 2023

View reviewed changes

op-node/sources/fallback_client.go Outdated Show resolved Hide resolved

op-node/sources/fallback_client.go Outdated Show resolved Hide resolved

Welkin added 2 commits July 12, 2023 16:35

Put the switching logic into goroutine and modify the code according …

94046e2

…to the comments

add metrics and don't switch url when error is Rpc.Error

6c028e8

welkin22 requested review from owen-reorg and bnoieh July 13, 2023 02:26

Welkin added 2 commits July 13, 2023 10:32

use const to remove magic number

217de47

fix NoopTxMetrics

d4db1b8

bnoieh reviewed Jul 13, 2023

View reviewed changes

Welkin added 4 commits July 13, 2023 21:39

add TestL1FallbackClient_SwitchUrl e2e case

503c061

should be >= threshold

7cc8f0a

change threshold to 20

443aeb3

log->logT

f3e9110

github-actions bot added the Stale label Jul 29, 2023

welkin22 removed the Stale label Jul 31, 2023

github-actions bot added the Stale label Aug 15, 2023

owen-reorg removed the Stale label Aug 15, 2023

github-actions bot added the Stale label Aug 30, 2023

github-actions bot closed this Sep 4, 2023

welkin22 reopened this Sep 19, 2023

welkin22 removed the Stale label Sep 19, 2023

owen-reorg changed the base branch from release/testnet to develop September 20, 2023 17:12

Merge branch 'develop' into l1-fallback-client

327f202

# Conflicts: # assets/testnet/genesis.json # op-batcher/metrics/metrics.go # op-node/metrics/metrics.go

welkin22 closed this Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(op-node/op-batcher/op-proposer): add fallbackClient #18

feat(op-node/op-batcher/op-proposer): add fallbackClient #18

welkin22 commented Jul 7, 2023 •

edited

Loading

owen-reorg commented Jul 11, 2023

welkin22 commented Jul 12, 2023

welkin22 commented Jul 12, 2023

bnoieh Jul 13, 2023

welkin22 Jul 14, 2023

welkin22 commented Jul 14, 2023

github-actions bot commented Jul 29, 2023

github-actions bot commented Aug 15, 2023

github-actions bot commented Aug 30, 2023

welkin22 commented Sep 21, 2023

feat(op-node/op-batcher/op-proposer): add fallbackClient #18

feat(op-node/op-batcher/op-proposer): add fallbackClient #18

Conversation

welkin22 commented Jul 7, 2023 • edited Loading

owen-reorg commented Jul 11, 2023

welkin22 commented Jul 12, 2023

welkin22 commented Jul 12, 2023

bnoieh Jul 13, 2023

Choose a reason for hiding this comment

welkin22 Jul 14, 2023

Choose a reason for hiding this comment

welkin22 commented Jul 14, 2023

github-actions bot commented Jul 29, 2023

github-actions bot commented Aug 15, 2023

github-actions bot commented Aug 30, 2023

welkin22 commented Sep 21, 2023

welkin22 commented Jul 7, 2023 •

edited

Loading