-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pd doesn't behave well under heavy sync load #2867
Comments
Maybe https://github.com/koute/bytehound would be useful here? I think a good goal would be that |
I don't think that the use of an explicit On the assumption that each connection is using the same amount of memory (a reasonable prior, though it could certainly not be the case), it might be sufficient to collect a heap profile of what happens when a single client connection is held open. If there's significant per-connection memory overhead, that's a bug, and it could show up without having to load many connections. |
Used this script to generate loadtests against a local pd instance, for memory profiling efforts. Steps I used to test: pd testnet unsafe-reset-all pd testnet join https://rpc.testnet-preview.penumbra.zone # start pd & tendermint, wait for genesis to complete ./deployments/scripts/pcli-client-test -n 1000 -p 100 Refs #2867.
It seems we do have a memory leak in pd. I can fairly easily get pd to consume a few gigs of memory if I bombard it with sync requests from multiple clients. Using bytehound to profile as recommended above, we see some never-freed allocations: We can also show that bytehound believes these are leaks: Unfortunately I don't yet have a root cause, but will spend more time with the stack traces and try to piece together a clearer story. The bytehound guide gives a walkthrough about how to perform this kind of investigation. I pushed the testing script I used (very simple loop over |
Nice digging! Not sure if it will be helpful, but one thought about isolating a cause could be:
This way, we might be able to get information about what memory is used by a single long-lived connection. |
Ahoy, thar she blows: Reading through the stack trace associated with that leak, I see a lot of rocksdb references, so I'm guessing that we're not dropping a db handle in a service worker somewhere. full stack trace for graphed leak
I also saw a smaller leak that may be related to tracing instrumentation, or else I'm mistaken in reading the backtrace. Here's a PDF of the full report I generated this morning, mostly for posterity in reproducing these steps in future debugging sessions: Formatting's a bit wonky. Separately I'll paste in the console code I adapted from the bytehound guide, since that'll be fairly easily copy/pasteable in the future. bytehound memleak console scripting
|
Initial drops in linked PR seem to help, but aren't sufficient. Encountered a new leak: Stack trace:
Currently pairing with @erwanor. |
The second leak is more mysterious to me, since the allocation is happening inside the Tonic stack, and I'm not sure why it would be growing a |
Used this script to generate loadtests against a local pd instance, for memory profiling efforts. Steps I used to test: pd testnet unsafe-reset-all pd testnet join https://rpc.testnet-preview.penumbra.zone # start pd & tendermint, wait for genesis to complete ./deployments/scripts/pcli-client-test -n 1000 -p 100 Refs #2867.
Trying to resolve a memory leak in pd. These manual drops are a first pass, and appear to reduce memory consumption, but there's still a leak, according to bytehound. We'll continue to investigate. Includes a missed `to_proto` that's a nice to have, but likely doesn't- constitute a fix for our problem. Refs #2867. Co-Authored-By: Henry de Valence <hdevalence@penumbralabs.xyz>
Used this script to generate loadtests against a local pd instance, for memory profiling efforts. Steps I used to test: pd testnet unsafe-reset-all pd testnet join https://rpc.testnet-preview.penumbra.zone # start pd & tendermint, wait for genesis to complete ./deployments/scripts/pcli-client-test -n 1000 -p 100 Refs #2867.
Trying to resolve a memory leak in pd. These manual drops are a first pass, and appear to reduce memory consumption, but there's still a leak, according to bytehound. We'll continue to investigate. Includes a missed `to_proto` that's a nice to have, but likely doesn't- constitute a fix for our problem. Refs #2867. Co-Authored-By: Henry de Valence <hdevalence@penumbralabs.xyz>
Trying to resolve a memory leak in pd. These manual drops are a first pass, and appear to reduce memory consumption, but there's still a leak, according to bytehound. We'll continue to investigate. Includes a missed `to_proto` that's a nice to have, but likely doesn't- constitute a fix for our problem. Refs #2867. Co-Authored-By: Henry de Valence <hdevalence@penumbralabs.xyz> (cherry picked from commit 4b7c386)
Recent related changes:
We shipped point releases as 0.56.1 and 0.57.1 to evaluate performance improvements. At least one more PR should land in time for 0.58.0 (#2888). |
Moving this issue back to |
Closing as completed since we addressed the memory leaks that were causing the original problem. While there is more work to do, it can be tracked in later issues. |
Today on Testnet 56 we observed a large spike in client traffic to the
pd
endpoint athttps://grpc.testnet.penumbra.zone
.As for provenance of the traffic, let's assume it's organic interest, in the form of many people downloading the web extension and synchronizing blocks for the first time. After about an hour or two, memory consumption—in the
pd
container specifically—balloons to the point that OOMkiller kicks in and terminates the pod. An example of resource consumption shortly before kill:According to the logs,
pd
is serving a lot of two types of requests, CompactBlockRange and ValidatorInfoStream:Intriguingly both those types are
Box
ed return values in our RPCs. Also intriguing is this comment:penumbra/crates/bin/pd/src/info/oblivious.rs
Lines 332 to 336 in bfda3a8
We need to understand why pd consumes large amounts of memory when handling these types of concurrent requests. For now, I'm assuming the traffic is well-formed, honest clients trying to synchronize.
The text was updated successfully, but these errors were encountered: