-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sds: keep warming when dynamic inserted cluster can't be extracted secret entity #13894
Conversation
…ter when initialize Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
…cret entity Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
…-sds-activate-timing Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
…-sds-activate-timing Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
…-sds-activate-timing Signed-off-by: Shikugawa <rei@tetrate.io>
@mattklein123 this is ready for review. @rgs1 friendly ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments/questions to get started. Thank you!
/wait
ENVOY_LOG(warn, "Failed to activate {}", cluster.info()->name()); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the cluster ever activate? I think this function never gets called again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how can the cluster ever leave warming? It needs to get updated again by the management server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, until #13777 (that's the TODO in the previous comment but I pasted wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But stepping back, what does this change give us? If the cluster stays in warming and the server starts up, we will still get 503s? Or if a cluster gets stuck in warming even if the user sets a timeout, it seems like it would be extremely difficult to debug this?
What is the exact scenario we are trying to protect against? I assume the server came up and SDS is not ready during a CDS update? Because I don't think this woulds block initial server startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that's fine, but I don't really understand what the implications of this are and whether it's going to cause more confusing behavior that is also hard to debug. Can you add more comments and we can discuss? Do we need stats for this also? I just don't know what we are dealing with here honestly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this won't cause more confusing behavior. While I agree that the current behavior is still confusing because the cluster with not ready ssl will fail when it gets traffic. The problem that this PR fix is that Envoy won't advertise itself is ready in that case.
The runtime flag is off by default and will be removed when we have long term fix. I don't think we need a stats for this given we have a warn level log and the flag is off by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK can you add more comments and we can go from there? I think we are going in circles as I'm unclear on what the old/new behavior is. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123 reworded the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other issue, which I am not sure if it is addressed here or another PR, is that occasionally we see that Envoy fails to actually send an SDS request to the control plane, so the cluster never actually gets ready. This is exacerbated by the previous issue, because not only will we never be able to handle (tls) traffic, we will also mark ourselves as ready.
@howardjohn I have seen such an issue as well. Was this root caused to a bug in envoy?
FYI tested this internally, no issues thus far. cc: @lizan |
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
@rgs1 thanks for confirming, I think the previous commit didn't guard change correctly (it checks transport socket factory readiness regardless of runtime flag). |
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks one question.
/wait
// TODO(lizan): #13777/#13952 In long term we want to fix this behavior with init manager | ||
// to keep clusters in warming state until Envoy get SDS response. | ||
ENVOY_LOG(warn, "Failed to activate {} due to no secret entity", cluster.info()->name()); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand how this blocks initialization. Unless I am missing something, the ClusterManagerInitHelper will still complete initialization with this early return. This will cause workers to start, etc., leading to 503s. If this is not the case can you update the comments?
This will block warming -> active in a subsequent CDS update, which is marginally better, but I don't think it fixes the problem of server init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the additional comment from the offline:
server init determine decide if all clusters are initialized by number of elements in secondary_init_clusters_
and primary_init_clusters_
This early return doesn't skip removing that cluster from these two lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. That's what I thought.
Signed-off-by: Shikugawa <rei@tetrate.io>
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Commit Message: fix #11120. This PR highly depends on #12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster.
Additional Description:
Risk Level: Mid
Testing: Unit
Docs Changes:
Release Notes:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]