Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cds: fix blocking a revert of a warming cluster #15269

Merged
merged 2 commits into from
Mar 5, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/root/version_history/current.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ Bug Fixes

* active http health checks: properly handles HTTP/2 GOAWAY frames from the upstream. Previously a GOAWAY frame due to a graceful listener drain could cause improper failed health checks due to streams being refused by the upstream on a connection that is going away. To revert to old GOAWAY handling behavior, set the runtime feature `envoy.reloadable_features.health_check.graceful_goaway_handling` to false.
* buffer: tighten network connection read and write buffer high watermarks in preparation to more careful enforcement of read limits. Buffer high-watermark is now set to the exact configured value; previously it was set to value + 1.
* cds: fix blocking the update for a warming cluster when the update is the same as the active version.
* fault injection: stop counting as active fault after delay elapsed. Previously fault injection filter continues to count the injected delay as an active fault even after it has elapsed. This produces incorrect output statistics and impacts the max number of consecutive faults allowed (e.g., for long-lived streams). This change decreases the active fault count when the delay fault is the only active and has gone finished.
* filter_chain: fix filter chain matching with the server name as the case-insensitive way.
* grpc-web: fix local reply and non-proto-encoded gRPC response handling for small response bodies. This fix can be temporarily reverted by setting `envoy.reloadable_features.grpc_web_fix_non_proto_encoded_response_handling` to false.
Expand Down
12 changes: 8 additions & 4 deletions source/common/upstream/cluster_manager_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -617,10 +617,14 @@ bool ClusterManagerImpl::addOrUpdateCluster(const envoy::config::cluster::v3::Cl
const auto existing_active_cluster = active_clusters_.find(cluster_name);
const auto existing_warming_cluster = warming_clusters_.find(cluster_name);
const uint64_t new_hash = MessageUtil::hash(cluster);
if ((existing_active_cluster != active_clusters_.end() &&
existing_active_cluster->second->blockUpdate(new_hash)) ||
(existing_warming_cluster != warming_clusters_.end() &&
existing_warming_cluster->second->blockUpdate(new_hash))) {
if (existing_warming_cluster != warming_clusters_.end()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 @tbarrella do you think the same issue may exist in LDS in

// The listener should be updated back to its original state and the warming listener should be
?

It's hard for me to tell, since the logic here is subtle and orders are different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's from #12645 which looks like it was meant to fix the same issue with LDS. I guess the approach is different; I didn't do it that way because it seemed simpler to take the newest config as a typical update rather than introduce cancellation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to factor out this pattern? Divergence seems scary :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear what you mean in terms of divergence. For this do you mean actual code refactoring (I'm not sure about the ROI here) or just making sure the logic is the same? What are your thoughts on the LDS (cancel the warming listener and block the current update) vs. CDS approach (accept the current update)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly on how we do it as long as we make them consistent and keep them that way (code comments explaining when to update what? Structural is always nicer if it's not too much complexity).

I think whatever is simplest here makes sense, but CC @adisuissa @dmitri-d

if (existing_warming_cluster->second->blockUpdate(new_hash)) {
return false;
}
// NB: https://github.com/envoyproxy/envoy/issues/14598
// Always proceed if the cluster is different from the existing warming cluster.
} else if (existing_active_cluster != active_clusters_.end() &&
existing_active_cluster->second->blockUpdate(new_hash)) {
return false;
}

Expand Down
81 changes: 81 additions & 0 deletions test/common/upstream/cluster_manager_impl_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1446,6 +1446,87 @@ TEST_F(ClusterManagerImplTest, ModifyWarmingCluster) {
EXPECT_TRUE(Mock::VerifyAndClearExpectations(cluster2.get()));
}

// Regression test for https://github.com/envoyproxy/envoy/issues/14598.
// Make sure the revert isn't blocked due to being the same as the active version.
TEST_F(ClusterManagerImplTest, TestRevertWarmingCluster) {
time_system_.setSystemTime(std::chrono::milliseconds(1234567891234));
create(defaultConfig());

InSequence s;
ReadyWatcher initialized;
EXPECT_CALL(initialized, ready());
cluster_manager_->setInitializedCb([&]() -> void { initialized.ready(); });

const std::string cluster_json1 = defaultStaticClusterJson("cds_cluster");
const std::string cluster_json2 = fmt::sprintf(kDefaultStaticClusterTmpl, "cds_cluster",
R"EOF(
"socket_address": {
"address": "127.0.0.1",
"port_value": 11002
})EOF");

std::shared_ptr<MockClusterMockPrioritySet> cluster1(new NiceMock<MockClusterMockPrioritySet>());
std::shared_ptr<MockClusterMockPrioritySet> cluster2(new NiceMock<MockClusterMockPrioritySet>());
std::shared_ptr<MockClusterMockPrioritySet> cluster3(new NiceMock<MockClusterMockPrioritySet>());
cluster1->info_->name_ = "cds_cluster";
cluster2->info_->name_ = "cds_cluster";
cluster3->info_->name_ = "cds_cluster";

// Initialize version1.
EXPECT_CALL(factory_, clusterFromProto_(_, _, _, _))
.WillOnce(Return(std::make_pair(cluster1, nullptr)));
EXPECT_CALL(*cluster1, initialize(_));
checkStats(0 /*added*/, 0 /*modified*/, 0 /*removed*/, 0 /*active*/, 0 /*warming*/);

cluster_manager_->addOrUpdateCluster(parseClusterFromV3Json(cluster_json1), "version1");
checkStats(1 /*added*/, 0 /*modified*/, 0 /*removed*/, 0 /*active*/, 1 /*warming*/);

cluster1->initialize_callback_();
checkStats(1 /*added*/, 0 /*modified*/, 0 /*removed*/, 1 /*active*/, 0 /*warming*/);

// Start warming version2.
EXPECT_CALL(factory_, clusterFromProto_(_, _, _, _))
.WillOnce(Return(std::make_pair(cluster2, nullptr)));
EXPECT_CALL(*cluster2, initialize(_));
cluster_manager_->addOrUpdateCluster(parseClusterFromV3Json(cluster_json2), "version2");
checkStats(1 /*added*/, 1 /*modified*/, 0 /*removed*/, 1 /*active*/, 1 /*warming*/);

// Start warming version3 instead, which is the same as version1.
EXPECT_CALL(factory_, clusterFromProto_(_, _, _, _))
.WillOnce(Return(std::make_pair(cluster3, nullptr)));
EXPECT_CALL(*cluster3, initialize(_));
cluster_manager_->addOrUpdateCluster(parseClusterFromV3Json(cluster_json1), "version3");
checkStats(1 /*added*/, 2 /*modified*/, 0 /*removed*/, 1 /*active*/, 1 /*warming*/);

// Finish warming version3.
cluster3->initialize_callback_();
checkStats(1 /*added*/, 2 /*modified*/, 0 /*removed*/, 1 /*active*/, 0 /*warming*/);
checkConfigDump(R"EOF(
dynamic_active_clusters:
- version_info: "version3"
cluster:
"@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
name: "cds_cluster"
type: STATIC
connect_timeout: 0.25s
load_assignment:
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 11001
last_updated:
seconds: 1234567891
nanos: 234000000
)EOF");

EXPECT_TRUE(Mock::VerifyAndClearExpectations(cluster1.get()));
EXPECT_TRUE(Mock::VerifyAndClearExpectations(cluster2.get()));
EXPECT_TRUE(Mock::VerifyAndClearExpectations(cluster3.get()));
}

// Verify that shutting down the cluster manager destroys warming clusters.
TEST_F(ClusterManagerImplTest, ShutdownWithWarming) {
create(defaultConfig());
Expand Down