Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch repository writes, improve repository scalability, and implement paged items retrieval #225

Merged
merged 10 commits into from
Jul 7, 2023

Conversation

BMurri
Copy link
Collaborator

@BMurri BMurri commented May 12, 2023

Addresses #138

@BMurri BMurri requested a review from MattMcL4475 May 12, 2023 20:21
@MattMcL4475
Copy link
Collaborator

MattMcL4475 commented May 12, 2023

Wow massive PR (functionality wise), thanks for doing this. This does result in two different caching mechanisms though, why did we need a distributed cache if TES is never currently run with multiple concurrent instances? What are your thoughts about having two separate caching mechanisms that don't coordinate on memory usage? Also can you run the Common Workflows integration test on this, as well as run a 100 workflow mutect integration test, and check the logs for any exceptions? With such extensive changes to the data layer, it will be good to put this through extensive testing and log analysis. Also, it seems that TES task creates from the API are no longer atomic and could result in loss of data, since the writes are added to a queue, right? Are there other new loss of data risks? What can be done to eliminate or reduce those possibilities?

@BMurri
Copy link
Collaborator Author

BMurri commented May 12, 2023

why did we need a distributed cache if TES is never currently run with multiple concurrent instances?

Our TES, for a very long time, has supported standing up multiple instances pointing to the same repository providing API services to clients, with all but one configured with batchScheduling.disable (aka DisableBatchScheduling) set to true. Since every shared instance must be able to write to the repository (in order to service cancellations) if we cache at all that cache needs to be distributed. If that's something we really don't want to support, we should remove that configuration option.

What are your thoughts about having two separate caching mechanisms that don't coordinate on memory usage?

Since the one is distributed and the other is always local process, I'm not sure how they can ever coordinate. There are knobs we can twist on the instances and things we can do to monitor them. So far, I've never seen more than ~19% memory usage (and that's during spikes).

Also can you run the Common Workflows integration test on this, as well as run a 100 workflow mutect integration test, and check the logs for any exceptions?

Absolutely. I've run the 100 mutect integration test quite a few times with this code in the context of the parallelization solution. but I will run it separately via this PR.

@MattMcL4475
Copy link
Collaborator

MattMcL4475 commented May 12, 2023

Our TES, for a very long time, has supported standing up multiple instances pointing to the same repository providing API services to clients, with all but one configured with batchScheduling.disable (aka DisableBatchScheduling) set to true. Since every shared instance must be able to write to the repository (in order to service cancellations) if we cache at all that cache needs to be distributed. If that's something we really don't want to support, we should remove that configuration option.

Not sure why this got added in the first place though, and it's not a scenario we are currently supporting to my knowledge.

Since the one is distributed and the other is always local process, I'm not sure how they can ever coordinate. There are knobs we can twist on the instances and things we can do to monitor them. So far, I've never seen more than ~19% memory usage (and that's during spikes).

Wouldn't you agree that it doesn't make sense to have two cache implementations? It seems by adding the distributed cache, but not implementing it for everything else, it's only half implemented. And also, the distributed cache will use local memory right? You don't plan to add a redis cache to the deployer (cost) right? Agreed I've never seen memory be an issue in TES. It's more a question about consistency and simplicity and what is requiring these changes.

@MattMcL4475
Copy link
Collaborator

Also what are your thoughts on this?

Also, it seems that TES task creates from the API are no longer atomic and could result in loss of data, since the writes are added to a queue, right? Are there other new loss of data risks? What can be done to eliminate or reduce those possibilities? Maybe the API controller shouldn't use the cached writes.

@BMurri
Copy link
Collaborator Author

BMurri commented May 12, 2023

Also, it seems that TES task creates from the API are no longer atomic and could result in loss of data, since the writes are added to a queue, right? Are there other new loss of data risks? What can be done to eliminate or reduce those possibilities? Maybe the API controller shouldn't use the cached writes.

By removing the batchScheduling.disable support, we eliminate the need to have the cache distributed and we share the process cache, then we simplify the task of assuring atomic transactions. I'll add that to this PR.

Thinking about this, writes should actually update the cache before writing to the DB (instead of after), making the cache the one true source for all objects it contains. We currently add/update the cache with every item we read, so that would give us an opportunity to check if the cache/DB are out-of-sync. That leaves two parallel operations (one (or more) from the controller and one from the scheduler) that might be in flight at the same time. Right now, we don't have an atomic way of knowing that we're queueing the same task more than once, but if we accept a little locking, we could discover that fact and deal with it then.

@BMurri
Copy link
Collaborator Author

BMurri commented May 12, 2023

Not sure why this got added in the first place though, and it's not a scenario we are currently supporting to my knowledge.

It was present before I joined the team. But I'd be happy to remove it. Removing it would get rid of any/all reason to have a distributed cache in the first place. I looked for other options for this cache before realizing it needed to be distributed, but by removing this feature that need vanishes entirely.

Tony knew about it. I discussed the intended purpose of it when I discovered it in the code with him, not long after I started.

@BMurri BMurri marked this pull request as draft May 18, 2023 21:06
@BMurri BMurri marked this pull request as ready for review May 18, 2023 21:07
@BMurri BMurri merged commit 10a75b5 into main Jul 7, 2023
@BMurri BMurri deleted the bmurri/batch-repository-writes branch July 7, 2023 00:38
@MattMcL4475 MattMcL4475 restored the bmurri/batch-repository-writes branch July 17, 2023 18:11
@MattMcL4475 MattMcL4475 deleted the bmurri/batch-repository-writes branch October 23, 2023 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants