-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NetworkDB qlen optimization #2216
Conversation
Allow to write and delete X number of entries Allow to query the queue length Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
Added some optimizations to reduce the messages in the queue: 1) on join network the node execute a tcp sync with all the nodes that it is aware part of the specific network. During this time before the node was redistributing all the entries. This meant that if the network had 10K entries the queue of the joining node will jump to 10K. The fix adds a flag on the network that would avoid to insert any entry in the queue till the sync happens. Note that right now the flag is set in a best effort way, there is no real check if at least one of the nodes succeed. 2) limit the number of messages to redistribute coming from a TCP sync. Introduced a threshold that limit the number of messages that are propagated, this will disable this optimization in case of heavy load. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>
ping @ctelfer @dani-docker PTAL |
case "lt": | ||
if node.result > size { | ||
log.Fatalf("Expected queue size from %s to be %d < %d", node.id, node.result, size) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what this is doing, but not why. Why is an assertion necessary for what looks like a query. This command computes an average queue size. Why should it fail if the size for one of the nodes queues is larger or smaller than some threshold and, moreover, why is necessary that such a test be present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken comments offline. the short answer is that this is to understand if any of the node has a bigger queue on the testing infra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I understand now that this is a test, not an information gathering function despite its debug output at the end. Please disregard this comment and the one below.
// Enable watch of tables from clients | ||
for i := 0; i < parallelWriters; i++ { | ||
go clientWatchTable(ips[i], servicePort, networkName, tableName, doneCh) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something seems wrong here. What happens if the # of parallel writers is less than the number of IPs specified in the command? Seems like some nodes would go "unwatched". More importantly, what happens if the number of parallel watchers is grater than the number of IPs. Seems like it will panic!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So @fcrisciani explained that it was intended to be able to write with less than the full # of nodes in order to ensure that the writes get propagated. Having more parallel writers than nodes would, of course, cause a panic. However it was designed for use in more constrained (testing) environments.
key := "key-" + strconv.Itoa(i) + "-" | ||
logrus.Infof("Spawn worker: %d on IP:%s", i, ips[i]) | ||
go deleteKeysNumber(ips[i], servicePort, networkName, tableName, key, numberOfKeys, doneCh) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto here. I'm probably not understanding how this is supposed to be invoked. But from the code it seems like parallelWriters should just be derived from len(ips).
@@ -168,7 +167,6 @@ func (nDB *NetworkDB) sendTableEvent(event TableEvent_Type, nid string, tname st | |||
id: nid, | |||
tname: tname, | |||
key: key, | |||
node: nDB.config.NodeID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, the node attribute is just redundant, I don't see us reading anywhere. Is this the reason for removing it? And slightly reducing the message size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes correct, today it has no use, so I just removed it. This header also is not being sent out so will save memory locally not in the messages sent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// note this is a best effort, we are not checking the result of the bulk sync | ||
nDB.Lock() | ||
n.inSync = true | ||
nDB.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what will happen if we got a false positive here, in other words, one node in bulkSync
failed and the others succeeded, causing the err
to be nil? Do we just count on periodic sync? would this affect convergence time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this flag is mainly to reduce the number of messages sent around during the join network. The current logic tries to improve convergence readvertising also messages that are coming from tcp sync. Before this change the behavior was such that if the table had 10K entries the node was starting with 10K messages in the queue because was trying to propagate all that entries that came to know during the join network tcp sync. This change instead make sure that during the first tcp sync no entries are being resent around. This way a node will join and will have close to 0 entries in the queue and full state synced.
Now to answer your question if the bulkSync succeed with at least 1 node, that will be enough to have the internal state "enough in sync" in term of entries, because in theory all the nodes on the same network holds the same content. The difference will only be on the entries still "in flight", but the number of them should be negligible compared with the number of entries coming from the bulk sync.
This flag does not really affect convergence logic because anyway this node when it joins start with an empty table and empty queue, so was already not helping in propagating the messages around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if I addressed the question properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that explains it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Added some optimizations to reduce the messages in the queue:
The other commit is mainly testing infrastructure to expose the queue length and to allow to write a fixed number of entries.