-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix overlay vxlan races #2146
Fix overlay vxlan races #2146
Conversation
CI Failed |
Codecov Report
@@ Coverage Diff @@
## master #2146 +/- ##
=========================================
Coverage ? 40.48%
=========================================
Files ? 139
Lines ? 22491
Branches ? 0
=========================================
Hits ? 9105
Misses ? 12048
Partials ? 1338
Continue to review full report at Codecov.
|
drivers/overlay/ov_network.go
Outdated
n.joinCnt++ | ||
} | ||
func (n *network) joinSandbox(s *subnet, restore bool, incJoinCnt bool) error { | ||
networkOnce.Do(networkOnceInit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if 2 joinSanbox are happening, the first one will trigger this networkOnceInit
, while the second one will continue forward acquiring the lock. Is it ok for the second one to go ahead also before the first networkOnceInit
is completed?
If I look at the code I guess this will create a potential race with the n.initSandbox(restore)
that actually if restore is false
will try to delete the VNI
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked on this very early on... once.Do() holds a mutex while doing the initialization. If two goroutines execute the same once.Do() concurrently, the second one will block until the first one completes.
}) | ||
return s.initErr | ||
subnetErr := s.initErr | ||
if !s.sboxInit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this dead code?
if at line 304 was false, will be set to true
if it was already true then won't enter here anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is s.sboxInit
not n.sboxInit
... So no, this is different. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh...
drivers/overlay/ov_network.go
Outdated
// failure of vxlan device creation if the vni is assigned to some other | ||
// network. | ||
if deleteErr := deleteInterface(vxlanName); deleteErr != nil { | ||
logrus.Warnf("could not delete vxlan interface, %s, error %v, after config error, %v\n", vxlanName, deleteErr, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove the trailing \n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yes, will do.
s.sboxInit = true | ||
} | ||
} | ||
if subnetErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We declare subnetErr inside if !s.sboxInit { . and again using it for outside the if scope. Can we declare outside of if and make the scope to function level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. That's definitely a bug. It is not supposed to be redeclared within the 'if'.
0026dc2
to
5b8a15a
Compare
Pushed an update with a much simplified locking scheme. There is now only a single lock for each network and all its subnets. I originally put in the second lock because of a deadlock (which I had hit during testing before the initial PR) described in the commit log. This new version breaks the deadlock by posting the notification to initialize the peerDB on a join using a fresh goroutine. This prevents the 'join' from deadlocking waiting on the channel of the peerDB goroutine while the peerdb goroutine is waiting for the 'join' to release the network lock. This version also addresses the comments by @fcrisciani (newlines) and @selansen (redeclaration of subnetErr). |
drivers/overlay/ov_network.go
Outdated
@@ -296,29 +294,39 @@ func (d *driver) RevokeExternalConnectivity(nid, eid string) error { | |||
return nil | |||
} | |||
|
|||
func (n *network) incEndpointCount() { | |||
func (n *network) joinSandbox(s *subnet, restore bool, incJoinCount bool) error { | |||
networkOnce.Do(networkOnceInit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you keep this comment:
// If there is a race between two go routines here only one will win
// the other will wait.
s.initErr = n.initSubnetSandbox(s, restore) | ||
}) | ||
return s.initErr | ||
subnetErr := s.initErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can simply avoid using this subnetErr, and just leverage the s.initErr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for line 321
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there is a very specific reason why it isn't. We want to return:
s.initErr
iff it was previously set- otherwise we want to return the result of
n.initSubnetSandbox(s, restore)
regardless of whether that value gets stored ins.initErr
. So, on line 321 we do not want to just returns.initErr
. It may not have been set if there was a failure but we are not in arestore
case.
This is the semantics from the #1800 and I've preserved them. The logic basically implies that if an error is recoverable, return it so it is flagged appropriately, but to make it persistent. But if the error is not recoverable (restore case), make it persistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect that the error failure will be different? on multiple retry?
drivers/overlay/ov_network.go
Outdated
@@ -470,7 +477,7 @@ func (n *network) generateVxlanName(s *subnet) string { | |||
id = n.id[:5] | |||
} | |||
|
|||
return "vx-" + fmt.Sprintf("%06x", n.vxlanID(s)) + "-" + id | |||
return "vx-" + fmt.Sprintf("%06x", s.vni) + "-" + id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering that we are touching this line, this will be more efficient:
fmt.Sprintf("vx-%06x-%v", s.vni,id)
drivers/overlay/ov_network.go
Outdated
@@ -483,7 +490,7 @@ func (n *network) generateBridgeName(s *subnet) string { | |||
} | |||
|
|||
func (n *network) getBridgeNamePrefix(s *subnet) string { | |||
return "ov-" + fmt.Sprintf("%06x", n.vxlanID(s)) | |||
return "ov-" + fmt.Sprintf("%06x", s.vni) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here just with a single Sprintf
|
||
n.setVxlanID(s, 0) | ||
for _, vni := range vnis { | ||
n.driver.vxlanIdm.Release(uint64(vni)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need extra for loop instead of using previous logic where its all done inside one place. What are we trying to achieve by doing this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair question. The vxlanIdm.Release() operation can do locking and write to store while other overlay networks are coming and going as well. So, I was thinking that this prevents holding the network lock through all those other potentially blocking operations. (i.e. collect the work list using the lock and then go through the process of releasing them) But since this function only gets called while the driver is holding the driver lock, that may be a superfluous optimization.
} | ||
|
||
return vnis, nil | ||
} | ||
|
||
func (n *network) obtainVxlanID(s *subnet) error { | ||
//return if the subnet already has a vxlan id assigned | ||
if s.vni != 0 { | ||
if n.vxlanID(s) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other places you replaced n.vxlanID(s) with s.vni or n.vni. Here we are using n.vxlanID(s) instead of directly accessing with s.vni . Trying to understand why this change is required ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fair. The revamped locking holds the network lock through network.join() and network.leave() operations. As such, calling n.vxlanID(s)
within those methods (or their sub-functions) would be a double lock. That's the reasons for the changes above.
However, the changes above also remove the subnet locks and place subnet locking all within the domain of the network's lock. As such, network
methods need to acquire the network lock before accessing any of the subnet fields. This driver calls obtainVxlanID() outside of the context of network.join()
in several places. Hence accessing the subnet members requires doing so through the network lock.
The lack of locking in obtainVxlanId() was, in principle, a race condition before that just wasn't addressed.
@@ -1059,7 +1067,7 @@ func (n *network) obtainVxlanID(s *subnet) error { | |||
return fmt.Errorf("getting network %q from datastore failed %v", n.id, err) | |||
} | |||
|
|||
if s.vni == 0 { | |||
if n.vxlanID(s) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
drivers/overlay/ov_network.go
Outdated
return s.vni | ||
} | ||
|
||
func (n *network) setVxlanID(s *subnet, vni uint32) { | ||
n.Lock() | ||
defer n.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the defer is just slower, in this case for a 3 lines methods, I would keep it as before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few minor comments, rest looks good
LGTM |
5b8a15a
to
26212fe
Compare
Addressed the comment, defer and sprintf comments above and have re-pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ctelfer are we good with merging this? |
Signed-off-by: Santhosh Manohar <santhosh@docker.com> Signed-off-by: Chris Telfer <ctelfer@docker.com>
Instead of using "sync.Once" to determine whether to initialize a network sandbox or subnet sandbox, we use a traditional mutex + initialization boolean. This is because the initialization state isn't truly a once-and-done condition. Rather, libnetwork destroys network and subnet sandboxes when the last endpoint leaves them. The use of sync.Once in this kind of scenario requires, therefore, re-initializing the Once which is impoissible. So the approach that libnetwork currently takes is to use a pointer to a Once and redirect that pointer to a new Once on reset. This leads to nasty race conditions. In addition to refactoring the locking, this patch merges the functions joinSandbox(), and joinSubnetSandbox(). This makes the code both cleaner and it also holds the network and subnet locks through the series of read-modify-writes avoiding further potential races. This does reduce the potential parallelism which could be applied should there be many joins coming in on many different subnets in the same overlay network. However, this should be an extremely minor performance hit for a very obscure case. One important pattern in this commit is that it is crucial to avoid sending peerDB messages while holding a driver or network lock. The changes herein defer such (asynchronous) notifications until after release of such locks. This prevents deadlocks where the peerDB blocks acquiring said locks while the network method blocks trying to send to the peerDB's channel. Signed-off-by: Chris Telfer <ctelfer@docker.com>
26212fe
to
5d113d1
Compare
Rebased to head of master, updated a few comments and re-tested a few times to make sure this looks good. Everything has been clean. I'll admit that I'm still a bit nervous after #2143 / #2180. But from a rational standpoint, what this PR does is remove the unsafe re-onceing from the overlay code, make joins and leaves to the overlay sandboxes atomic, and merge in #1800 error handling code. It should be safe to merge. |
@fcrisciani do you know which Docker version will include this merge (vendors this libnetwork hash)? We are running 18.03 and are regularly experiencing the issue that the vxlan file is already existing.
|
This function takes the patch offered in #1800 and attempts to address the issues found in #1765. It removes a race condition documented in the issue where "re-once"ing the creation of an overlay sandbox can race with incoming join requests to cause a leak/collision with the vxlan interface. This PR addresses said issue by removing the "re-once" pattern and replacing it with a traditional mutex+boolean initializer pattern. This also allows removing some restore error handling by having the sandbox join/leave perform proper cleanup on error. More information is available in the commit logs of the PR.