Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic in drivers/overlay/encryption.go #2462

Merged
merged 1 commit into from
Nov 15, 2019

Conversation

arkodg
Copy link
Contributor

@arkodg arkodg commented Oct 8, 2019

Issue - "index out of range" panic in drivers/overlay/encryption.go:539
due to a mismatch in indices between curKeys and spis due to
case where updateKeys might bail out due to an error and
not update the spis

Context - This issue can be hit when a worker node gets disconnected from the manager, for more than 1 key rotation interval (12h), that in itself its something to worry about, and even though the workloads get shifted to a new node, once the worker node reconnects, due to the issue mentioned above, causes dockerd to panic

Fix - Reconfigure keys when there is a key update failure

Signed-off-by: Arko Dasgupta arko.dasgupta@docker.com

@arkodg
Copy link
Contributor Author

arkodg commented Oct 8, 2019

@dani-docker this is part 1 of the solution.
We need to eventually send a KeyConfiguration Notification similar to
https://github.com/docker/libnetwork/blob/d8d4c8cf03d7d036a76d5470553cd8753e522a99/agent.go#L329
when the agent reconnects so that the decryption issue can be resolved in 1 key rotation interval and not 3 :)

rIP := net.ParseIP(rIPs)
return updateNodeKey(lIP, aIP, rIP, spis, d.keys, newIdx, priIdx, delIdx), false
})

if (newKey != nil && newIdx == -1) ||
(primary != nil && priIdx == -1) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is correct, there is mismatch between driver.key and driver.secMap.node[].spi when an error happens here. The driver.key is already appended in line 466, and that is only change to the driver state.

I wonder if it is better to revert driver.key state when there is error here? That way driver.key and driver.secMap.node[[].spi are kept consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suwang48404 if we revert the append we will not be able to find the key in the slice when we get an update to delete the key, resulting in a similar error :)

Issue - "index out of range" panic in drivers/overlay/encryption.go:539
due to a mismatch in indices between curKeys and spis due to
case where updateKeys might bail out due to an error and
not update the spis

Fix - Reconfigure keys when there is a key update failure

Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
@arkodg arkodg force-pushed the fix-key-spi-panic branch from 98623d5 to 4420ee9 Compare October 31, 2019 20:25
@arkodg
Copy link
Contributor Author

arkodg commented Oct 31, 2019

@suwang48404 took your advice on reconfiguring keys in case of an update failure reducing the down time by 24hr (2 x 12h rotation intervals )

@arkodg
Copy link
Contributor Author

arkodg commented Oct 31, 2019

Set KeyRotationInterval to 20s
Debug Logs

DEBU[2019-10-31T18:19:19.488287400Z] Adding key 5a857                             
DEBU[2019-10-31T18:19:19.488378800Z] Primary Key f6140                            
DEBU[2019-10-31T18:19:19.488410400Z] Remove Key e0599                             
DEBU[2019-10-31T18:19:19.488448600Z] Updating Keys. New: (key: 780b0, tag: 0xaf42), Primary: (key: 42428, tag: 0xaf40), Pruned: (key: 20839, tag: 0xaf3c) 
DEBU[2019-10-31T18:19:19.488518200Z] Current: [(key: 41474, tag: 0xaf38) (key: 57349, tag: 0xaf36) (key: 42428, tag: 0xaf40)] 
WARN[2019-10-31T18:19:19.488691700Z] Failed to update datapath keys in driver overlay: cannot find proper key indices while processing key update:(newIdx,priIdx,delIdx):(3, 2, -1) 
WARN[2019-10-31T18:19:19.488766000Z] Reconfiguring datapath keys for  overlay     
DEBU[2019-10-31T18:19:19.488930100Z] Initial encryption keys: [(key: 42428, tag: 0xaf40) (key: 70624, tag: 0xaf3e) (key: 780b0, tag: 0xaf42)] 

Subsequent Logs

DEBU[2019-10-31T18:19:39.454300900Z] Adding key 61b2c                             
DEBU[2019-10-31T18:19:39.454371400Z] Primary Key 5a857                            
DEBU[2019-10-31T18:19:39.454716600Z] Remove Key c2839                             
DEBU[2019-10-31T18:19:39.454791000Z] Updating Keys. New: (key: 2aa25, tag: 0xaf44), Primary: (key: 780b0, tag: 0xaf42), Pruned: (key: 70624, tag: 0xaf3e) 
DEBU[2019-10-31T18:19:39.454874200Z] Current: [(key: 42428, tag: 0xaf40) (key: 70624, tag: 0xaf3e) (key: 780b0, tag: 0xaf42)] 
DEBU[2019-10-31T18:19:39.455003400Z] Updated: [(key: 780b0, tag: 0xaf42) (key: 42428, tag: 0xaf40) (key: 2aa25, tag: 0xaf44)] 

Copy link
Contributor

@suwang48404 suwang48404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arkodg
Copy link
Contributor Author

arkodg commented Oct 31, 2019

PTAL @dani-docker @selansen @euanh

@dani-docker
Copy link
Contributor

dani-docker commented Nov 1, 2019

@arkodg
So what happened to the SPIs when you use the newly configured key as "current" when updateNodeKey is called?

I can see how this takes care of the panic, but what about functionality?

@arkodg
Copy link
Contributor Author

arkodg commented Nov 1, 2019

@dani-docker updateNodeKey is called which fails since it cannot find proper key indices . This triggers the new code to be executed and the driver receive this event https://github.com/docker/libnetwork/blob/4420ee92f5b3b951f98a36b2bc8144a19b560a22/drivers/overlay/overlay.go#L340 and calls setKeys which resets the spis https://github.com/docker/libnetwork/blob/1f28166bb386cf9223d2d00a28382b0e474be314/drivers/overlay/encryption.go#L440 and checkEncryption will rebuild the spis if there is a new service that lands on that node

@dani-docker
Copy link
Contributor

dani-docker commented Nov 4, 2019

@arkodg
I am not referring to updateKeys which is the one that triggers your new code to reconfigure and set the current keys, I am referring to updateNodeKey which will be called as part of updateKeys on the next key rotation after your change;
my concern is around programSA and other methods within updateKeys that uses the current keys which no longer matches from updateNodeKey POV...

My other concern, although we got rid of the error, but the encryption will fail for that node until the next rotation, ie: 12 hours, (assuming updateNodeKey I mentioned previously is of no concerns)

@dani-docker
Copy link
Contributor

@arkodg as we discussed offline, I made a build to verify the fix and my concerns.
The good news, there was no panic!
The bad news, we still need to wait for an additional key rotation for things to get back to normal (in addition to the connection resume).
The reason for this is:
1- gossip keys are not fully reset, https://github.com/dani-docker/libnetwork/blob/c7125bd9bdcbcde60f41bde37fd8d99bc679f55d/agent.go#L140-L180

Here are the logs:
Last key exchange before the worker went offline, the networkdb keys are set to:
Note the upcoming key 6fee5

Nov  5 20:02:22 ucpworker dockerd: time="2019-11-05T20:02:22.883279305Z" level=debug msg="Adding key 6fee5"
Nov  5 20:02:22 ucpworker dockerd: time="2019-11-05T20:02:22.883322624Z" level=debug msg="Primary Key c7ba4"
Nov  5 20:02:22 ucpworker dockerd: time="2019-11-05T20:02:22.883338622Z" level=debug msg="Remove Key d0db2"
Nov  5 20:02:22 ucpworker dockerd: time="2019-11-05T20:02:22.883369816Z" level=debug msg="Updating Keys. New: (key: 34af1, tag: 0x8186), Primary: (key: 78ee2, tag: 0x8183), Pruned: (key: 4b6d2, tag: 0x8180)"

First key exchange when the connection resumes:
Note: the upcoming key from the last exchange became primary and the real primary key was not added to the ring (the real pk as you can see from the next exchange is 1f9ab)

Nov  5 20:48:33 ucpworker dockerd: time="2019-11-05T20:48:33.331347736Z" level=debug msg="Adding key eecb5"
Nov  5 20:48:33 ucpworker dockerd: time="2019-11-05T20:48:33.331377063Z" level=debug msg="Primary Key 6fee5"
Nov  5 20:48:33 ucpworker dockerd: time="2019-11-05T20:48:33.331395620Z" level=debug msg="Remove Key 368d7"
Nov  5 20:48:33 ucpworker dockerd: time="2019-11-05T20:48:33.331426101Z" level=debug msg="Updating Keys. New: (key: 5e540, tag: 0x8194), Primary: (key: 4cbbb, tag: 0x818f), Pruned: (key: 78ee2, tag: 0x8183)"
Nov  5 20:48:33 ucpworker dockerd: time="2019-11-05T20:48:33.331378040Z" level=debug msg="(*session).listen" module=node/agent node.id=kh2ki4dnh18521nehq5uktr3y session.id=ninp2zpstnpggwj0cbu493m5i
Nov  5 20:48:33 ucpworker dockerd: time="2019-11-05T20:48:33.331358188Z" level=debug msg="(*session).heartbeat" module=node/agent node.id=kh2ki4dnh18521nehq5uktr3y session.id=ninp2zpstnpggwj0cbu493m5i
Nov  5 20:48:33 ucpworker dockerd: time="2019-11-05T20:48:33.331451713Z" level=debug msg="Current: [(key: 78ee2, tag: 0x8183) (key: 998b7, tag: 0x8182) (key: 34af1, tag: 0x8186)]"
Nov  5 20:48:33 ucpworker dockerd: time="2019-11-05T20:48:33.331556203Z" level=warning msg="Failed to update datapath keys in driver overlay: cannot find proper key indices while processing key update:(newIdx,priIdx,delIdx):(3, -1, 0)"
Nov  5 20:48:33 ucpworker dockerd: time="2019-11-05T20:48:33.331572784Z" level=warning msg="Reconfiguring datapath keys for  overlay"

This causes memberlist decryption errors, example:

Nov  5 20:50:22 ucpworker dockerd: time="2019-11-05T20:50:22.152056354Z" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.10.13.43:58728"
Nov  5 20:50:22 ucpworker dockerd: time="2019-11-05T20:50:22.187031029Z" level=debug msg="Calling GET /v1.38/version"
Nov  5 20:50:22 ucpworker dockerd: time="2019-11-05T20:50:22.678311206Z" level=debug msg="memberlist: Initiating push/pull sync with: 10.10.13.43:7946"
Nov  5 20:50:22 ucpworker dockerd: time="2019-11-05T20:50:22.678750982Z" level=debug msg="memberlist: Failed to join 10.10.13.43: No installed keys could decrypt the message"

Finally on the next key exchange after the connection resume, the newly added key became the primary and things get happy

Nov  5 20:50:22 ucpworker dockerd: time="2019-11-05T20:50:22.883350621Z" level=debug msg="Adding key 46f2d"
Nov  5 20:50:22 ucpworker dockerd: time="2019-11-05T20:50:22.883378828Z" level=debug msg="Primary Key eecb5"
Nov  5 20:50:22 ucpworker dockerd: time="2019-11-05T20:50:22.883390212Z" level=debug msg="Remove Key 1f9ab"
Nov  5 20:50:22 ucpworker dockerd: time="2019-11-05T20:50:22.883409505Z" level=debug msg="Updating Keys. New: (key: 5563a, tag: 0x8196), Primary: (key: 5e540, tag: 0x8194), Pruned: (key: 4cbbb, tag: 0x818f)"
Nov  5 20:50:22 ucpworker dockerd: time="2019-11-05T20:50:22.883424131Z" level=debug msg="Current: [(key: 4cbbb, tag: 0x818f) (key: 34af1, tag: 0x8186) (key: 3c095, tag: 0x8191) (key: 5e540, tag: 0x8194)]"

The gossip fix should be simple, just reset the whole networkDB key ring (similar to the approach done to the driver keys)

Finally, I went through the code and I don't see where we reprogram IPSec after the calling clearEncryptionStates , could be when we rejoin the network? but couldn't verify until we fix gossip.

@dani-docker
Copy link
Contributor

Had a chat with @arkodg and we decided to limit the scope of this PR to avoid the panic and eventually (after 1-2 key rotations, ie 12-14 hours) the encryption will resume correctly.

In a different PR we will address the following use cases:

Approving this PR for now

@arkodg
Copy link
Contributor Author

arkodg commented Nov 15, 2019

@dani-docker thanks !

@thaJeztah
Copy link
Member

ping @euanh @suwang48404 PTAL

@suwang48404
Copy link
Contributor

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants