Batch repository writes, improve repository scalability, and implement paged items retrieval #225

BMurri · 2023-05-12T20:21:44Z

Addresses #138

paged items retrieval

MattMcL4475 · 2023-05-12T20:36:47Z

Wow massive PR (functionality wise), thanks for doing this. This does result in two different caching mechanisms though, why did we need a distributed cache if TES is never currently run with multiple concurrent instances? What are your thoughts about having two separate caching mechanisms that don't coordinate on memory usage? Also can you run the Common Workflows integration test on this, as well as run a 100 workflow mutect integration test, and check the logs for any exceptions? With such extensive changes to the data layer, it will be good to put this through extensive testing and log analysis. Also, it seems that TES task creates from the API are no longer atomic and could result in loss of data, since the writes are added to a queue, right? Are there other new loss of data risks? What can be done to eliminate or reduce those possibilities?

BMurri · 2023-05-12T21:06:26Z

why did we need a distributed cache if TES is never currently run with multiple concurrent instances?

Our TES, for a very long time, has supported standing up multiple instances pointing to the same repository providing API services to clients, with all but one configured with batchScheduling.disable (aka DisableBatchScheduling) set to true. Since every shared instance must be able to write to the repository (in order to service cancellations) if we cache at all that cache needs to be distributed. If that's something we really don't want to support, we should remove that configuration option.

What are your thoughts about having two separate caching mechanisms that don't coordinate on memory usage?

Since the one is distributed and the other is always local process, I'm not sure how they can ever coordinate. There are knobs we can twist on the instances and things we can do to monitor them. So far, I've never seen more than ~19% memory usage (and that's during spikes).

Also can you run the Common Workflows integration test on this, as well as run a 100 workflow mutect integration test, and check the logs for any exceptions?

Absolutely. I've run the 100 mutect integration test quite a few times with this code in the context of the parallelization solution. but I will run it separately via this PR.

MattMcL4475 · 2023-05-12T21:29:57Z

Our TES, for a very long time, has supported standing up multiple instances pointing to the same repository providing API services to clients, with all but one configured with batchScheduling.disable (aka DisableBatchScheduling) set to true. Since every shared instance must be able to write to the repository (in order to service cancellations) if we cache at all that cache needs to be distributed. If that's something we really don't want to support, we should remove that configuration option.

Not sure why this got added in the first place though, and it's not a scenario we are currently supporting to my knowledge.

Since the one is distributed and the other is always local process, I'm not sure how they can ever coordinate. There are knobs we can twist on the instances and things we can do to monitor them. So far, I've never seen more than ~19% memory usage (and that's during spikes).

Wouldn't you agree that it doesn't make sense to have two cache implementations? It seems by adding the distributed cache, but not implementing it for everything else, it's only half implemented. And also, the distributed cache will use local memory right? You don't plan to add a redis cache to the deployer (cost) right? Agreed I've never seen memory be an issue in TES. It's more a question about consistency and simplicity and what is requiring these changes.

MattMcL4475 · 2023-05-12T21:40:13Z

Also what are your thoughts on this?

Also, it seems that TES task creates from the API are no longer atomic and could result in loss of data, since the writes are added to a queue, right? Are there other new loss of data risks? What can be done to eliminate or reduce those possibilities? Maybe the API controller shouldn't use the cached writes.

BMurri · 2023-05-12T22:24:10Z

Also, it seems that TES task creates from the API are no longer atomic and could result in loss of data, since the writes are added to a queue, right? Are there other new loss of data risks? What can be done to eliminate or reduce those possibilities? Maybe the API controller shouldn't use the cached writes.

By removing the batchScheduling.disable support, we eliminate the need to have the cache distributed and we share the process cache, then we simplify the task of assuring atomic transactions. I'll add that to this PR.

Thinking about this, writes should actually update the cache before writing to the DB (instead of after), making the cache the one true source for all objects it contains. We currently add/update the cache with every item we read, so that would give us an opportunity to check if the cache/DB are out-of-sync. That leaves two parallel operations (one (or more) from the controller and one from the scheduler) that might be in flight at the same time. Right now, we don't have an atomic way of knowing that we're queueing the same task more than once, but if we accept a little locking, we could discover that fact and deal with it then.

BMurri · 2023-05-12T22:27:55Z

Not sure why this got added in the first place though, and it's not a scenario we are currently supporting to my knowledge.

It was present before I joined the team. But I'd be happy to remove it. Removing it would get rid of any/all reason to have a distributed cache in the first place. I looked for other options for this cache before realizing it needed to be distributed, but by removing this feature that need vanishes entirely.

Tony knew about it. I discussed the intended purpose of it when I discovered it in the code with him, not long after I started.

Batch repository writes, improve repository scalability, and implement

6d13a20

paged items retrieval

BMurri requested a review from MattMcL4475 May 12, 2023 20:21

BMurri added 4 commits May 15, 2023 15:36

Merge branch 'main' into bmurri/batch-repository-writes

78b1e4b

Remove disabling scheduler, find update collisions, restore lazycache

ad0f728

formatting

2852d96

remove orphaned code

80df43f

MattMcL4475 approved these changes May 17, 2023

View reviewed changes

Merge branch 'main' into bmurri/batch-repository-writes

336040f

BMurri marked this pull request as draft May 18, 2023 21:06

BMurri marked this pull request as ready for review May 18, 2023 21:07

BMurri added 4 commits May 18, 2023 14:09

format

7388da4

Merge branch 'main' into bmurri/batch-repository-writes

4978580

Merge branch 'main' into bmurri/batch-repository-writes

5eacc96

Merge branch 'main' into bmurri/batch-repository-writes

15d50ec

BMurri merged commit 10a75b5 into main Jul 7, 2023

BMurri deleted the bmurri/batch-repository-writes branch July 7, 2023 00:38

MattMcL4475 restored the bmurri/batch-repository-writes branch July 17, 2023 18:11

MattMcL4475 deleted the bmurri/batch-repository-writes branch October 23, 2023 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch repository writes, improve repository scalability, and implement paged items retrieval #225

Batch repository writes, improve repository scalability, and implement paged items retrieval #225

BMurri commented May 12, 2023 •

edited

Loading

MattMcL4475 commented May 12, 2023 •

edited

Loading

BMurri commented May 12, 2023

MattMcL4475 commented May 12, 2023 •

edited

Loading

MattMcL4475 commented May 12, 2023

BMurri commented May 12, 2023

BMurri commented May 12, 2023 •

edited

Loading

Batch repository writes, improve repository scalability, and implement paged items retrieval #225

Batch repository writes, improve repository scalability, and implement paged items retrieval #225

Conversation

BMurri commented May 12, 2023 • edited Loading

MattMcL4475 commented May 12, 2023 • edited Loading

BMurri commented May 12, 2023

MattMcL4475 commented May 12, 2023 • edited Loading

MattMcL4475 commented May 12, 2023

BMurri commented May 12, 2023

BMurri commented May 12, 2023 • edited Loading

BMurri commented May 12, 2023 •

edited

Loading

MattMcL4475 commented May 12, 2023 •

edited

Loading

MattMcL4475 commented May 12, 2023 •

edited

Loading

BMurri commented May 12, 2023 •

edited

Loading