Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns: configuring a basic key value store to persist to disk #17745

Merged
merged 17 commits into from
Aug 25, 2021

Conversation

alyssawilk
Copy link
Contributor

@alyssawilk alyssawilk commented Aug 17, 2021

Risk Level: low (config guarded)
Testing: new unit, integration tests
Docs Changes: n/a (hidden)
Release Notes: n/a
Part of envoyproxy/envoy-mobile#1587

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
envoyproxy/api-shepherds assignee is @htuch
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #17745 was opened by alyssawilk.

see: more, trace.

@alyssawilk
Copy link
Contributor Author

/wait on CI

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this overall feature make sense as part of the DNS extensibility being worked on in #17479?

I.e. we add a caching DNS resolver extension (that maybe wraps another DNS resolver like c-ares or platform specific)?

@mattklein123
Copy link
Member

I.e. we add a caching DNS resolver extension (that maybe wraps another DNS resolver like c-ares or platform specific)?

I think this would potentially be a nice solution to think about it. FWIW we had discussed something similar today for wrapping DNS in order to get DNS specific analytic events. cc @snowp

@alyssawilk
Copy link
Contributor Author

I'm not sure what the ask is for this PR. I don't want this tied to DNS but I think it'd be reasonable to make sure the DNS refactor allows for all the platform resolvers to use the cache.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@htuch
Copy link
Member

htuch commented Aug 18, 2021

I'm suggesting the DNS resolver itself is a cache.

@htuch
Copy link
Member

htuch commented Aug 18, 2021

i.e. that we build DNS caching inside the DNS resolver framework. I'm not sure why we want to reuse the existing data plane caching framework for this.

Copy link
Contributor

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small questions to get started...

//
// All keys and values are flushed to a single file as
// [length]\n[key][length]\n[value]
class FileBasedKeyValueStore : public KeyValueStoreBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit surprised to see that the interface here is Envoy::KeyValueStore but the concrete class is Envoy::Extensions::Cache::KeyValueCache. I had expected them to both live in the same namespace, roughly. But maybe this is a typical way to lay things out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, per offline conversation with harvey I moved all the things to a location he prefered, so now it's KeyValue and then KeyValueStore which is at least a substring?

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Contributor Author

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there's an outstanding issue with both clang tidy and docs build, but API should be ready for review. PTAL.

//
// All keys and values are flushed to a single file as
// [length]\n[key][length]\n[value]
class FileBasedKeyValueStore : public KeyValueStoreBase {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, per offline conversation with harvey I moved all the things to a location he prefered, so now it's KeyValue and then KeyValueStore which is at least a substring?

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API looks good modulo comments.


// [#not-implemented-hide:]
// Configuration to flush the DNS cache to long term storage.
key_value.v3.KeyValueStoreConfig persistent_cache_config = 13;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit on naming: given this is referencing an extension point, do we know that there will always be persistence guaranteed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I can't think of a use case where we'd add a key-value store and not persist it. We already have a DNS cache so the point of this is to persist. That said, just because I can't think of a use case it certainly doesn't hurt to future-proof :-)


// The interval at which the key value store should be flushed to long term
// storage.
google.protobuf.Duration flush_interval = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this gets to my earlier question, do we know that all implementations will flush? Or will flushing be a property of a specific key-value story, e.g. the FileBasedKeyValueStoreConfig?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto - current uses all involve persisting but can't hurt to move it to file based and just dup the single value for mobile based.

Copy link
Contributor

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Contributor Author

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at docs pr - must have failed a string somewhere


// [#not-implemented-hide:]
// Configuration to flush the DNS cache to long term storage.
key_value.v3.KeyValueStoreConfig persistent_cache_config = 13;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I can't think of a use case where we'd add a key-value store and not persist it. We already have a DNS cache so the point of this is to persist. That said, just because I can't think of a use case it certainly doesn't hurt to future-proof :-)


// The interval at which the key value store should be flushed to long term
// storage.
google.protobuf.Duration flush_interval = 2;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto - current uses all involve persisting but can't hurt to move it to file based and just dup the single value for mobile based.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@@ -0,0 +1,27 @@
syntax = "proto3";

package envoy.extensions.common.key_value.file_based.v3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this should be in api/envoy/extensions/key_value/filed_based/v3/config.proto and same package namespace.

The rationale is that the KeyValueConfig is a common extension point declaration reused across protos. The FileBasedKkeyValueStoreConfig is a key_value extension, and all extension classes have a root like api/envoy/extensions/<extension class>.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm api

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Contributor Author

argh sorry, force pushed format fixes. Bad @alyssawilk no biscuit

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks pretty good, just a few nits and a question on the interaction between cache load and DNS resolution

virtual KeyValueStorePtr createStore(const Protobuf::Message& config,
ProtobufMessage::ValidationVisitor& validation_visitor,
Event::Dispatcher& dispatcher,
Filesystem::Instance& file_system) PURE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just pass ServerFactoryContext instead of specifically passing the file system? Might future proof against some other use cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we can't - the DNS store is created both from ServerFactoryContext and from the cluster's TransportSocketFactoryContextImpl which have wide overlap but no common base class.
I have that filed in away in my clean up list but I think resolving it is out of scope for this PR

// TypedFactory
ProtobufTypes::MessagePtr createEmptyConfigProto() override {
return ProtobufTypes::MessagePtr{
new envoy::extensions::cache::key_value_cache::v3::FileBasedKeyValueCacheConfig()};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_unique

// TODO(alyssawilk) change finishResolve to actually use the TTL rather than
// putting 0 here, return the remaining TTL or indicate the result is stale.
response.emplace_back(Network::DnsResponse(address, std::chrono::seconds(0) /* ttl */));
startCacheLoad(key, address->ip()->port());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that familiar with this code, but it looks like this will kick off a DNS query? And then we immediately resolve it?

If so I wonder if this will cause unforeseen consequences, like having entries being resolved twice (once from cache and once from DNS) and confusing stats (e.g. stat increment for DNS resolution and cache load for same entry)

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works today (see integration test) but I agree it's not great, hence the TODO :-)

What I want to have happen is be able to differentiate stale cached results from fresh. I do want to immediately kick off a resolve, but also have stale data in there, config for how long we wait for fresh data before we use stale data. I plan to do that in the follow up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, would a DNS resolution then overwrite this value once it arrives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, fresh resolve always overwrites older data (if it's different)
in this case it'd be the same so not trigger any changes.

Comment on lines +1029 to +1041
MockKeyValueStoreFactory factory;
EXPECT_CALL(factory, createEmptyConfigProto()).WillRepeatedly(Invoke([]() {
return std::make_unique<
envoy::extensions::key_value::file_based::v3::FileBasedKeyValueStoreConfig>();
}));
MockKeyValueStore* store{};
EXPECT_CALL(factory, createStore(_, _, _, _)).WillOnce(Invoke([&store]() {
auto ret = std::make_unique<NiceMock<MockKeyValueStore>>();
store = ret.get();
// Make sure there's an attempt to load from the key value store.
EXPECT_CALL(*store, iterate);
return ret;
}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be implemented on the MockKeyValueStoreFactory itself? Having them in the test case here is a bit distracting from what the test is actually testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe? Honestly I regularly get bit by expectations in mock timer where I don't understand what's going on, so I'm hesitant to add complexity like that to the mock base classes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure not a big deal, just what I usually see done with mocks. Agree that MockTimer is especially confusing :)

InSequence s;
ASSERT(store != nullptr);

// Make sure the store gets the first insert.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hard to follow how the below code block verifies anything with the store, maybe make that clearer with comments? (I think it's L1067 but there is a lot of boiler plate here)

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Contributor Author

@davinci26 @wrowe any ideas on the windows failure? I've commented out with TODO to unblock this but I'm wondering if you have ideas what went wrong. For example are there known gotchas with file paths on windows CI?

snowp
snowp previously approved these changes Aug 24, 2021
Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG! We definitely need to figure out the Windows thing, but I'm okay with doing that in a follow up

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@davinci26
Copy link
Member

@alyssawilk I am not sure, I will have to debug locally but it is super weird because the same test passes on windows with clang.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Contributor Author

coverage flake unrelated: from a combination of quic refactor and setec patches
I've got #17846 to hep so I'm going to merge over it.

@alyssawilk alyssawilk merged commit 16879db into envoyproxy:main Aug 25, 2021
mpuncel added a commit to mpuncel/envoy that referenced this pull request Aug 25, 2021
* main:
  config: fix dfp config validation (envoyproxy#17835)
  docs: updating where meetings are uploaded (envoyproxy#17832)
  h2: moving a comment (envoyproxy#17846)
  quiche: early fail listener config with both quic and connection_balencer (envoyproxy#17834)
  dns: configuring a basic key value store to persist to disk (envoyproxy#17745)
  quic: fix receiving STOP_SENDING (envoyproxy#17815)
  tooling: Add Github release manager (envoyproxy#17741)
  tooling: Use upstream pytest-patches (envoyproxy#17809)
  Remove `hidden_envoy_deprecated_use_http2` (envoyproxy#17805)
  kafka: produce request for mesh-filter (envoyproxy#17818)

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
jmarantz pushed a commit that referenced this pull request Sep 14, 2021
Clean up inspired by #17745

Risk Level: low (interface refactor)
Testing: n/a
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk alyssawilk deleted the dns_cache branch August 4, 2022 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants