-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch repository writes, improve repository scalability, and implement paged items retrieval #225
Conversation
paged items retrieval
Wow massive PR (functionality wise), thanks for doing this. This does result in two different caching mechanisms though, why did we need a distributed cache if TES is never currently run with multiple concurrent instances? What are your thoughts about having two separate caching mechanisms that don't coordinate on memory usage? Also can you run the Common Workflows integration test on this, as well as run a 100 workflow mutect integration test, and check the logs for any exceptions? With such extensive changes to the data layer, it will be good to put this through extensive testing and log analysis. Also, it seems that TES task creates from the API are no longer atomic and could result in loss of data, since the writes are added to a queue, right? Are there other new loss of data risks? What can be done to eliminate or reduce those possibilities? |
Our TES, for a very long time, has supported standing up multiple instances pointing to the same repository providing API services to clients, with all but one configured with
Since the one is distributed and the other is always local process, I'm not sure how they can ever coordinate. There are knobs we can twist on the instances and things we can do to monitor them. So far, I've never seen more than ~19% memory usage (and that's during spikes).
Absolutely. I've run the 100 mutect integration test quite a few times with this code in the context of the parallelization solution. but I will run it separately via this PR. |
Not sure why this got added in the first place though, and it's not a scenario we are currently supporting to my knowledge.
Wouldn't you agree that it doesn't make sense to have two cache implementations? It seems by adding the distributed cache, but not implementing it for everything else, it's only half implemented. And also, the distributed cache will use local memory right? You don't plan to add a redis cache to the deployer (cost) right? Agreed I've never seen memory be an issue in TES. It's more a question about consistency and simplicity and what is requiring these changes. |
Also what are your thoughts on this?
|
By removing the Thinking about this, writes should actually update the cache before writing to the DB (instead of after), making the cache the one true source for all objects it contains. We currently add/update the cache with every item we read, so that would give us an opportunity to check if the cache/DB are out-of-sync. That leaves two parallel operations (one (or more) from the controller and one from the scheduler) that might be in flight at the same time. Right now, we don't have an atomic way of knowing that we're queueing the same task more than once, but if we accept a little locking, we could discover that fact and deal with it then. |
It was present before I joined the team. But I'd be happy to remove it. Removing it would get rid of any/all reason to have a distributed cache in the first place. I looked for other options for this cache before realizing it needed to be distributed, but by removing this feature that need vanishes entirely. Tony knew about it. I discussed the intended purpose of it when I discovered it in the code with him, not long after I started. |
Addresses #138