pd doesn't behave well under heavy sync load #2867

conorsch · 2023-07-24T21:04:38Z

Today on Testnet 56 we observed a large spike in client traffic to the pd endpoint at https://grpc.testnet.penumbra.zone.

As for provenance of the traffic, let's assume it's organic interest, in the form of many people downloading the web extension and synchronizing blocks for the first time. After about an hour or two, memory consumption—in the pd container specifically—balloons to the point that OOMkiller kicks in and terminates the pod. An example of resource consumption shortly before kill:

❯ kubectl top pod penumbra-testnet-fn-0-54d9f68d76-kl9gm --containers
POD                                      NAME   CPU(cores)   MEMORY(bytes)   
penumbra-testnet-fn-0-54d9f68d76-kl9gm   pd     3055m        11249Mi         
penumbra-testnet-fn-0-54d9f68d76-kl9gm   tm     47m          162Mi

According to the logs, pd is serving a lot of two types of requests, CompactBlockRange and ValidatorInfoStream:

❯ wc -l logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt
8329 logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt
❯ rg validator_info logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt -c
4639
❯ rg -i compact_block_range logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt -c
2107

Intriguingly both those types are Boxed return values in our RPCs. Also intriguing is this comment:

penumbra/crates/bin/pd/src/info/oblivious.rs

Lines 332 to 336 in bfda3a8

    
           // TODO: eventually, we may want to register joinhandles or something 
        
           // and be able to track how many open connections we have, drop them to 
        
           // manage load, etc. 
        
           // 
        
           // for now, assume that we can do c10k or whatever and don't worry about it.

We need to understand why pd consumes large amounts of memory when handling these types of concurrent requests. For now, I'm assuming the traffic is well-formed, honest clients trying to synchronize.

The text was updated successfully, but these errors were encountered:

hdevalence · 2023-07-24T21:20:17Z

Maybe https://github.com/koute/bytehound would be useful here? I think a good goal would be that pd can maintain many active connections without using excess memory.

hdevalence · 2023-07-24T22:21:32Z

I don't think that the use of an explicit Box is a meaningful indicator of what might be consuming significant per-connection memory. We should investigate using a heap profiler.

On the assumption that each connection is using the same amount of memory (a reasonable prior, though it could certainly not be the case), it might be sufficient to collect a heap profile of what happens when a single client connection is held open. If there's significant per-connection memory overhead, that's a bug, and it could show up without having to load many connections.

Used this script to generate loadtests against a local pd instance, for memory profiling efforts. Steps I used to test: pd testnet unsafe-reset-all pd testnet join https://rpc.testnet-preview.penumbra.zone # start pd & tendermint, wait for genesis to complete ./deployments/scripts/pcli-client-test -n 1000 -p 100 Refs #2867.

conorsch · 2023-07-25T00:46:48Z

It seems we do have a memory leak in pd. I can fairly easily get pd to consume a few gigs of memory if I bombard it with sync requests from multiple clients. Using bytehound to profile as recommended above, we see some never-freed allocations:

We can also show that bytehound believes these are leaks:

Unfortunately I don't yet have a root cause, but will spend more time with the stack traces and try to piece together a clearer story. The bytehound guide gives a walkthrough about how to perform this kind of investigation. I pushed the testing script I used (very simple loop over pcli operations) for visibility.

hdevalence · 2023-07-25T12:55:08Z

Nice digging!

Not sure if it will be helpful, but one thought about isolating a cause could be:

run pd + tendermint
shut off tendermint so pd is not processing any consensus messages or doing anything other than serving rpc requests
sync pclientd against the stalled pd until chain tip, then turn it off
start measuring pd
start pclientd, causing it to open a single long-lived connection
stop measuring pd

This way, we might be able to get information about what memory is used by a single long-lived connection.

conorsch · 2023-07-25T16:47:11Z

This way, we might be able to get information about what memory is used by a single long-lived connection.

Ahoy, thar she blows:

Reading through the stack trace associated with that leak, I see a lot of rocksdb references, so I'm guessing that we're not dropping a db handle in a service worker somewhere.

full stack trace for graphed leak

#00 [libc.so.6] __clone3
#01 [libc.so.6] start_thread
#02 [libstdc++.so.6.0.30] 7f57e02dbc42
#03 [pd] std::_Function_handler<void (), rocksdb::VersionBuilder::Rep::LoadTableHandlers(rocksdb::InternalStats*, int, bool, bool, std::shared_ptr<rocksdb::SliceTransform const> const&, unsigned long)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
#04 [pd] rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::BasicTypedCacheInterface<rocksdb::TableReader, (rocksdb::CacheEntryRole)13, rocksdb::Cache*>::TypedHandle**, std::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long, rocksdb::Temperature)
#05 [pd] rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, bool, bool, rocksdb::HistogramImpl*, std::unique_ptr<rocksdb::TableReader, std::default_delete<rocksdb::TableReader>>*, std::shared_ptr<rocksdb::SliceTransform const> const&, bool, int, bool, unsigned long, rocksdb::Temperature)
#06 [pd] rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::unique_ptr<rocksdb::RandomAccessFileReader, std::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::unique_ptr<rocksdb::TableReader, std::default_delete<rocksdb::TableReader>>*, bool) const [clone .localalias]
#07 [pd] rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::unique_ptr<rocksdb::RandomAccessFileReader, std::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::unique_ptr<rocksdb::TableReader, std::default_delete<rocksdb::TableReader>>*, std::shared_ptr<rocksdb::CacheReservationManager>, std::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long, std::string const&, unsigned long, std::array<unsigned long, (unsigned long)2>)
#08 [pd] rocksdb::BlockBasedTable::PrefetchIndexAndFilterBlocks(rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, rocksdb::BlockBasedTable*, bool, rocksdb::BlockBasedTableOptions const&, int, unsigned long, unsigned long, rocksdb::BlockCacheLookupContext*)
#09 [pd] rocksdb::BlockBasedTable::CreateIndexReader(rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, bool, bool, bool, rocksdb::BlockCacheLookupContext*, std::unique_ptr<rocksdb::BlockBasedTable::IndexReader, std::default_delete<rocksdb::BlockBasedTable::IndexReader>>*)
#10 [pd] rocksdb::BinarySearchIndexReader::Create(rocksdb::BlockBasedTable const*, rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, bool, bool, bool, rocksdb::BlockCacheLookupContext*, std::unique_ptr<rocksdb::BlockBasedTable::IndexReader, std::default_delete<rocksdb::BlockBasedTable::IndexReader>>*)
#11 [pd] rocksdb::BlockBasedTable::IndexReaderCommon::ReadIndexBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::Block>*)
#12 [pd] std::enable_if<((rocksdb::Block_kIndex::kCacheEntryRole)==((rocksdb::CacheEntryRole)13))||(true), rocksdb::Status>::type rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::Block_kIndex>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::Block_kIndex>*, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool, bool) const
#13 [pd] rocksdb::Status rocksdb::(anonymous namespace)::ReadAndParseBlockFromFile<rocksdb::Block_kIndex>(rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, std::unique_ptr<rocksdb::Block_kIndex, std::default_delete<rocksdb::Block_kIndex>>*, rocksdb::ImmutableOptions const&, rocksdb::BlockCreateContext&, bool, rocksdb::UncompressionDict const&, rocksdb::PersistentCacheOptions const&, rocksdb::MemoryAllocator*, bool, bool)
#14 [pd] rocksdb::BlockFetcher::ReadBlockContents()
#15 [pd] rocksdb::UncompressSerializedBlock(rocksdb::UncompressionInfo const&, char const*, unsigned long, rocksdb::BlockContents*, unsigned int, rocksdb::ImmutableOptions const&, rocksdb::MemoryAllocator*)
#16 [pd] rocksdb::UncompressBlockData(rocksdb::UncompressionInfo const&, char const*, unsigned long, rocksdb::BlockContents*, unsigned int, rocksdb::ImmutableOptions const&, rocksdb::MemoryAllocator*)
#17 [libstdc++.so.6.0.30] operator new(unsigned long)
#18 [libbytehound.so] malloc [api.rs:294]

I also saw a smaller leak that may be related to tracing instrumentation, or else I'm mistaken in reading the backtrace. Here's a PDF of the full report I generated this morning, mostly for posterity in reproducing these steps in future debugging sessions:
pd-memory-profiling-testnet-56-report-1.pdf

Formatting's a bit wonky. Separately I'll paste in the console code I adapted from the bytehound guide, since that'll be fairly easily copy/pasteable in the future.

bytehound memleak console scripting

println("Temporary vs leaked allocations:");
graph()
    .add("Leaked", allocations().only_leaked())
    .add("Temporary", allocations())
    .save();

println("Sort leaks by size:");   
let groups = allocations()
    .only_leaked()
    .group_by_backtrace()
        .sort_by_size();
graph().add(groups).save();


// Function to group leaks based on common backtraces
fn analyze_group(list) {
    let list_all = allocations().only_matching_backtraces(list);
    graph()
        .add("Leaked", list_all.only_leaked())
        .add("Temporary", list_all)
        .save();
    println("Total: {}", list_all.len());
    println("Leaked: {}", list_all.only_leaked().len());
    println();
    println("Backtrace:");
    println(list_all[0].backtrace().strip());
}

println("Analyzing most common leak by backtrace:");
analyze_group(groups[0]);
println("Analyzing second most common leak by backtrace:");
analyze_group(groups[1]);
println("Analyzing third most common leak by backtrace:");
analyze_group(groups[2]);

conorsch · 2023-07-25T19:19:39Z

Initial drops in linked PR seem to help, but aren't sufficient. Encountered a new leak:

Stack trace:

#00 [libc.so.6] __clone3
#01 [libc.so.6] start_thread
#02 [pd] std::sys::unix::thread::Thread::new::thread_start [thread.rs:108]
#03 [pd] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1985]
#04 [pd] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1985]
#05 [pd] core::ops::function::FnOnce::call_once{{vtable.shim}}
#07 [pd] tokio::runtime::blocking::pool::Inner::run
#08 [pd] tokio::runtime::task::harness::Harness<T,S>::poll
#09 [pd] tokio::runtime::task::core::Core<T,S>::poll
#10 [pd] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
#11 [pd] <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
#12 [pd] tokio::runtime::scheduler::multi_thread::worker::run
#13 [pd] tokio::runtime::context::runtime::enter_runtime
#14 [pd] tokio::runtime::context::scoped::Scoped<T>::set
#15 [pd] tokio::runtime::scheduler::multi_thread::worker::Context::run
#16 [pd] tokio::runtime::scheduler::multi_thread::worker::Context::run_task
#17 [pd] tokio::runtime::task::harness::Harness<T,S>::poll
#18 [pd] tokio::runtime::task::core::Core<T,S>::poll
#19 [pd] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
#20 [pd] <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
#21 [pd] <hyper::proto::h2::server::H2Stream<F,B> as core::future::future::Future>::poll
#22 [pd] <hyper::proto::h2::PipeToSendStream<S> as core::future::future::Future>::poll
#23 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#24 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#25 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#26 [pd] <tonic::codec::encode::EncodeBody<S> as http_body::Body>::poll_data
#27 [pd] <T as futures_util::fns::FnMut1<A>>::call_mut
#28 [pd] prost::message::Message::encode
#29 [pd] <penumbra_proto::penumbra::core::chain::v1alpha1::CompactBlock as prost::message::Message>::encode_raw
#30 [pd] prost::encoding::message::encode
#31 [pd] prost::encoding::<impl prost::encoding::sealed::BytesAdapter for alloc::vec::Vec<u8>>::append_to
#32 [pd] bytes::bytes_mut::BytesMut::reserve_inner
#33 [pd] alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
#34 [pd] alloc::raw_vec::finish_grow
#35 [libbytehound.so] realloc [api.rs:378]

Currently pairing with @erwanor.

hdevalence · 2023-07-25T22:35:57Z

The second leak is more mysterious to me, since the allocation is happening inside the Tonic stack, and I'm not sure why it would be growing a BytesMut and then keeping it around. I think it's worth merging a fix for the part of the problem that's clearly in our part of the stack.

Used this script to generate loadtests against a local pd instance, for memory profiling efforts. Steps I used to test: pd testnet unsafe-reset-all pd testnet join https://rpc.testnet-preview.penumbra.zone # start pd & tendermint, wait for genesis to complete ./deployments/scripts/pcli-client-test -n 1000 -p 100 Refs #2867.

Trying to resolve a memory leak in pd. These manual drops are a first pass, and appear to reduce memory consumption, but there's still a leak, according to bytehound. We'll continue to investigate. Includes a missed `to_proto` that's a nice to have, but likely doesn't- constitute a fix for our problem. Refs #2867. Co-Authored-By: Henry de Valence <hdevalence@penumbralabs.xyz>

Used this script to generate loadtests against a local pd instance, for memory profiling efforts. Steps I used to test: pd testnet unsafe-reset-all pd testnet join https://rpc.testnet-preview.penumbra.zone # start pd & tendermint, wait for genesis to complete ./deployments/scripts/pcli-client-test -n 1000 -p 100 Refs #2867.

Trying to resolve a memory leak in pd. These manual drops are a first pass, and appear to reduce memory consumption, but there's still a leak, according to bytehound. We'll continue to investigate. Includes a missed `to_proto` that's a nice to have, but likely doesn't- constitute a fix for our problem. Refs #2867. Co-Authored-By: Henry de Valence <hdevalence@penumbralabs.xyz>

Trying to resolve a memory leak in pd. These manual drops are a first pass, and appear to reduce memory consumption, but there's still a leak, according to bytehound. We'll continue to investigate. Includes a missed `to_proto` that's a nice to have, but likely doesn't- constitute a fix for our problem. Refs #2867. Co-Authored-By: Henry de Valence <hdevalence@penumbralabs.xyz> (cherry picked from commit 4b7c386)

conorsch · 2023-08-04T16:49:16Z

Recent related changes:

We shipped point releases as 0.56.1 and 0.57.1 to evaluate performance improvements. At least one more PR should land in time for 0.58.0 (#2888).

@erwan

Prepared by @erwan while working on #2867. Exported from the preview grafana instance, and storing in version control, so it's part of deployments going forward.

@erwan

Prepared by @erwan while working on #2867. Exported from the preview grafana instance, and storing in version control, so it's part of deployments going forward.

erwanor · 2023-08-24T00:36:23Z

Moving this issue back to Future since we are prioritizing other items. Over the medium term, we should aim for pd to completely isolate the consensus module from inbound load on its RPCs. Although the current performance pattern is far from optimal, it is probably sufficient when combined with network level rate limiters and load balancers.

hdevalence · 2024-02-08T23:27:25Z

Closing as completed since we addressed the memory leaks that were causing the original problem. While there is more work to do, it can be tracked in later issues.

conorsch self-assigned this Jul 24, 2023

github-project-automation bot added this to Testnets Jul 24, 2023

conorsch moved this to In Progress (Already claimed) in Testnets Jul 24, 2023

conorsch mentioned this issue Jul 25, 2023

fix(pd): drop unused snapshots #2868

Merged

This was referenced Jul 26, 2023

Testnet 57: Ganymede #2827

Closed

measure: add connection opener tool #2878

Merged

conorsch assigned erwanor Jul 28, 2023

conorsch added a commit that referenced this issue Aug 7, 2023

metrics: add new sync dashboard

c7badd3

Prepared by @erwan while working on #2867. Exported from the preview grafana instance, and storing in version control, so it's part of deployments going forward.

conorsch mentioned this issue Aug 7, 2023

metrics: add new sync dashboard #2904

Merged

conorsch added a commit that referenced this issue Aug 7, 2023

metrics: add new sync dashboard

48cd7bd

Prepared by @erwan while working on #2867. Exported from the preview grafana instance, and storing in version control, so it's part of deployments going forward.

conorsch mentioned this issue Aug 11, 2023

client sync fails around 30k blocks #2918

Closed

erwanor mentioned this issue Aug 24, 2023

pd: use a range query for client synchronization #2881

Merged

erwanor moved this from In Progress (Already claimed) to Future in Testnets Aug 24, 2023

aubrika added this to Penumbra Oct 30, 2023

github-project-automation bot moved this to 🗄️ Backlog in Penumbra Oct 30, 2023

hdevalence closed this as completed Feb 8, 2024

github-project-automation bot moved this from 🗄️ Backlog to Done in Penumbra Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd doesn't behave well under heavy sync load #2867

pd doesn't behave well under heavy sync load #2867

conorsch commented Jul 24, 2023

hdevalence commented Jul 24, 2023

hdevalence commented Jul 24, 2023

conorsch commented Jul 25, 2023

hdevalence commented Jul 25, 2023

conorsch commented Jul 25, 2023

conorsch commented Jul 25, 2023 •

edited

Loading

hdevalence commented Jul 25, 2023

conorsch commented Aug 4, 2023

erwanor commented Aug 24, 2023

hdevalence commented Feb 8, 2024

pd doesn't behave well under heavy sync load #2867

pd doesn't behave well under heavy sync load #2867

Comments

conorsch commented Jul 24, 2023

hdevalence commented Jul 24, 2023

hdevalence commented Jul 24, 2023

conorsch commented Jul 25, 2023

hdevalence commented Jul 25, 2023

conorsch commented Jul 25, 2023

conorsch commented Jul 25, 2023 • edited Loading

hdevalence commented Jul 25, 2023

conorsch commented Aug 4, 2023

erwanor commented Aug 24, 2023

hdevalence commented Feb 8, 2024

conorsch commented Jul 25, 2023 •

edited

Loading