[Backport 17.06] NetworkDB qlen optimization #2379

trapier · 2019-05-17T21:37:26Z

Backport of #2216 to 17.06. This should be considered a starting point. Behaves as expected. Needs further review from a logic standpoint.

git checkout -b 17.06-netdb-qlen-issue origin/bump_17.06
git cherry-pick -sx 5ed38221164e04c78d20f17839aab52fafd7fe88

Cherry-Pick Conflicts

Location	Conflict Reason	Resolution
`networkdb/broadcast.go:113`,`networkdb/broadcast.go:171`	`node` was removed by #2216. `nDB.config.NodeName` changed to `nDB.config.NodeID` via #1936 and was then removed in #2216	"ours". Keeping this message in since I don't know when and why it became unnecessary. See `5ed3822`#r200441648. Decision to leave this in amenable to further research into why it was removed with #2216, which removed it to make networkdb more efficient.
`networkdb/cluster.go:30`	Order of variables in patch context does not match following #2169 which backported #2134.	"both". Keep `rejoinClusterDuration` and `rejoinInterval` as in 17.06 and bring in `maxQueueLenBroadcastOnSync` from #2216.
`networkdb/delegate.go:277`	`reapEntryInterval` moved to networkdb.go:Config in #1957	Stay with 17.06 const convention for `reapEntryInterval`

Logical Conflicts

Location	Conflict Reason	Resolution
`networkdb/delegate.go:179`	#2040 not in 17.06. `fatal error: sync: Unlock of unlocked RWMutex`	"ours". Don't need to unlock for a lock that 17.06 does not have. Do not take this line from #2216.

How I tested

Vendor this PR onto github.com/docker/docker-ee tag v17.06.2-ee-21 and change networkdb StatsPrintPeriod from 5 minutes to 15 seconds to improve observability in test.
In a separate build backport the networkdb diagnostic API to libnetwork 9a47f8d and docker-ee cd8d740be0c048f156438d21975af753ac1b5225 (v17.06.2-ee-20). Use this build to stress networkdb in steps 6 and 8.
Bring up a 31 node dind cluster (one manager 30 workers) with the build from step 1.
Join one worker on 17.06 build from step 2 with networkdb diagnostic API.
Create an overlay network and deploy a mode global service to it.
Use the diagnostic API to introduce a "static load" of 1400 additional entries, half in endpoint_table, and half in overlay_peer_table on the test network.
Wait for test network qLen to reach 0 on all nodes according to NetworkDB stats in dockerd logs.
Use the diagnostic node to introduce a "dynamic load" by adding and 0.5s later removing an endpoint_table entry, doing this three times a second continuously. This seems to be about the max rate the cluster can sustain under otherwise quiet conditions before qLen begins to grow without bound.
Restart one of the non-diagnostic workers, and observe its qLen. Prior to this PR, qLen on the joining node initially rises to the total number of entries, then steadily increases from there over minutes. With this PR qLen on the joining node matches that of other nodes on the first sample after rejoining (within 15s StatsPrintPeriod).

Test tooling

Scripts developed for testing this pr provided for reference below. They probably won't work flawlessly in your environment.

test.sh.txt brings up dind cluster. provide argument down to remove the nodes.

stressndb.sh.txt requires arguments either static to introduce 1400 entries or dynamic to produce the load noted in steps 6 and 8 above, respectively.

Added some optimizations to reduce the messages in the queue: 1) on join network the node execute a tcp sync with all the nodes that it is aware part of the specific network. During this time before the node was redistributing all the entries. This meant that if the network had 10K entries the queue of the joining node will jump to 10K. The fix adds a flag on the network that would avoid to insert any entry in the queue till the sync happens. Note that right now the flag is set in a best effort way, there is no real check if at least one of the nodes succeed. 2) limit the number of messages to redistribute coming from a TCP sync. Introduced a threshold that limit the number of messages that are propagated, this will disable this optimization in case of heavy load. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com> (cherry picked from commit 5ed3822) Signed-off-by: Trapier Marshall <trapier.marshall@docker.com>

thaJeztah · 2019-05-23T11:26:47Z

Haven't checked all the details, but would it be better to include some of the additional commits that caused the conflicts (e.g. include #1936 so that the cherry-pick will be clean)?

trapier · 2019-05-24T18:37:51Z

Don't have enough experience with this code base to know what octopuses lurk behind the intervening PRs. I suppose it would make sense to start down that path and see what kind of hair starts growing. Should I do that on a separate PR so we can track progress and compare the results?

thaJeztah · 2019-05-24T21:25:44Z

😂 yes, if you could give it a quick go and push it separately, that'd be useful.

I do understand the concerns, and yes, we should look carefully if there's nothing in the cherry-picks that we don't want, just that it could help future cherry-picks / backports if we have less modifications in the branch.

arkodg

LGTM

arkodg · 2019-06-21T20:27:23Z

PTAL @euanh

euanh · 2019-06-24T15:13:56Z

@thaJeztah, @trapier Did you come to any conclusions about whether it's worthwhile to backport the intermediate commits to make future merges easier?

trapier · 2019-06-24T19:01:23Z

I have not followed through on my suggestion to take another branch and investigate the cost taking the interim PRs. Other accountabilities current preclude my ability to proceed with this investigation.

For whatever it's worth, my sense is there isn't currently and before its EOL won't be a great demand for additional backports to 17.06. EE Engine 17.06 is currently scheduled for support EOL on 2020-04-16 .

arkodg approved these changes Jun 3, 2019

View reviewed changes

euanh approved these changes Jun 25, 2019

View reviewed changes

euanh merged commit c4e81a0 into moby:bump_17.06 Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport 17.06] NetworkDB qlen optimization #2379

[Backport 17.06] NetworkDB qlen optimization #2379

trapier commented May 17, 2019 •

edited

Loading

thaJeztah commented May 23, 2019

trapier commented May 24, 2019

thaJeztah commented May 24, 2019

arkodg left a comment

arkodg commented Jun 21, 2019

euanh commented Jun 24, 2019

trapier commented Jun 24, 2019

[Backport 17.06] NetworkDB qlen optimization #2379

[Backport 17.06] NetworkDB qlen optimization #2379

Conversation

trapier commented May 17, 2019 • edited Loading

Cherry-Pick Conflicts

Logical Conflicts

How I tested

Test tooling

thaJeztah commented May 23, 2019

trapier commented May 24, 2019

thaJeztah commented May 24, 2019

arkodg left a comment

Choose a reason for hiding this comment

arkodg commented Jun 21, 2019

euanh commented Jun 24, 2019

trapier commented Jun 24, 2019

trapier commented May 17, 2019 •

edited

Loading