Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd doesn't behave well under heavy sync load #2867

Closed
conorsch opened this issue Jul 24, 2023 · 10 comments
Closed

pd doesn't behave well under heavy sync load #2867

conorsch opened this issue Jul 24, 2023 · 10 comments
Assignees

Comments

@conorsch
Copy link
Contributor

Today on Testnet 56 we observed a large spike in client traffic to the pd endpoint at https://grpc.testnet.penumbra.zone.

active-client-conns-2

As for provenance of the traffic, let's assume it's organic interest, in the form of many people downloading the web extension and synchronizing blocks for the first time. After about an hour or two, memory consumption—in the pd container specifically—balloons to the point that OOMkiller kicks in and terminates the pod. An example of resource consumption shortly before kill:

❯ kubectl top pod penumbra-testnet-fn-0-54d9f68d76-kl9gm --containers
POD                                      NAME   CPU(cores)   MEMORY(bytes)   
penumbra-testnet-fn-0-54d9f68d76-kl9gm   pd     3055m        11249Mi         
penumbra-testnet-fn-0-54d9f68d76-kl9gm   tm     47m          162Mi 

According to the logs, pd is serving a lot of two types of requests, CompactBlockRange and ValidatorInfoStream:

❯ wc -l logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt
8329 logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt
❯ rg validator_info logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt -c
4639
❯ rg -i compact_block_range logs-penumbra-testnet-fn-0-54d9f68d76-kl9gm.txt -c
2107

Intriguingly both those types are Boxed return values in our RPCs. Also intriguing is this comment:

// TODO: eventually, we may want to register joinhandles or something
// and be able to track how many open connections we have, drop them to
// manage load, etc.
//
// for now, assume that we can do c10k or whatever and don't worry about it.

We need to understand why pd consumes large amounts of memory when handling these types of concurrent requests. For now, I'm assuming the traffic is well-formed, honest clients trying to synchronize.

@conorsch conorsch self-assigned this Jul 24, 2023
@conorsch conorsch moved this to In Progress (Already claimed) in Testnets Jul 24, 2023
@hdevalence
Copy link
Member

Maybe https://github.com/koute/bytehound would be useful here? I think a good goal would be that pd can maintain many active connections without using excess memory.

@hdevalence
Copy link
Member

I don't think that the use of an explicit Box is a meaningful indicator of what might be consuming significant per-connection memory. We should investigate using a heap profiler.

On the assumption that each connection is using the same amount of memory (a reasonable prior, though it could certainly not be the case), it might be sufficient to collect a heap profile of what happens when a single client connection is held open. If there's significant per-connection memory overhead, that's a bug, and it could show up without having to load many connections.

conorsch added a commit that referenced this issue Jul 25, 2023
Used this script to generate loadtests against a local pd instance, for
memory profiling efforts. Steps I used to test:

  pd testnet unsafe-reset-all
  pd testnet join https://rpc.testnet-preview.penumbra.zone
  # start pd & tendermint, wait for genesis to complete
  ./deployments/scripts/pcli-client-test -n 1000 -p 100

Refs #2867.
@conorsch
Copy link
Contributor Author

It seems we do have a memory leak in pd. I can fairly easily get pd to consume a few gigs of memory if I bombard it with sync requests from multiple clients. Using bytehound to profile as recommended above, we see some never-freed allocations:

pd-mem-leak-bytehound-1

We can also show that bytehound believes these are leaks:

pd-mem-leak-bytehound-2

Unfortunately I don't yet have a root cause, but will spend more time with the stack traces and try to piece together a clearer story. The bytehound guide gives a walkthrough about how to perform this kind of investigation. I pushed the testing script I used (very simple loop over pcli operations) for visibility.

@hdevalence
Copy link
Member

Nice digging!

Not sure if it will be helpful, but one thought about isolating a cause could be:

  • run pd + tendermint
  • shut off tendermint so pd is not processing any consensus messages or doing anything other than serving rpc requests
  • sync pclientd against the stalled pd until chain tip, then turn it off
  • start measuring pd
  • start pclientd, causing it to open a single long-lived connection
  • stop measuring pd

This way, we might be able to get information about what memory is used by a single long-lived connection.

@conorsch
Copy link
Contributor Author

This way, we might be able to get information about what memory is used by a single long-lived connection.

Ahoy, thar she blows:

pd-pclientd-single-conn-leak

Reading through the stack trace associated with that leak, I see a lot of rocksdb references, so I'm guessing that we're not dropping a db handle in a service worker somewhere.

full stack trace for graphed leak
#00 [libc.so.6] __clone3
#01 [libc.so.6] start_thread
#02 [libstdc++.so.6.0.30] 7f57e02dbc42
#03 [pd] std::_Function_handler<void (), rocksdb::VersionBuilder::Rep::LoadTableHandlers(rocksdb::InternalStats*, int, bool, bool, std::shared_ptr<rocksdb::SliceTransform const> const&, unsigned long)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
#04 [pd] rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::BasicTypedCacheInterface<rocksdb::TableReader, (rocksdb::CacheEntryRole)13, rocksdb::Cache*>::TypedHandle**, std::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long, rocksdb::Temperature)
#05 [pd] rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, bool, bool, rocksdb::HistogramImpl*, std::unique_ptr<rocksdb::TableReader, std::default_delete<rocksdb::TableReader>>*, std::shared_ptr<rocksdb::SliceTransform const> const&, bool, int, bool, unsigned long, rocksdb::Temperature)
#06 [pd] rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::unique_ptr<rocksdb::RandomAccessFileReader, std::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::unique_ptr<rocksdb::TableReader, std::default_delete<rocksdb::TableReader>>*, bool) const [clone .localalias]
#07 [pd] rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::unique_ptr<rocksdb::RandomAccessFileReader, std::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::unique_ptr<rocksdb::TableReader, std::default_delete<rocksdb::TableReader>>*, std::shared_ptr<rocksdb::CacheReservationManager>, std::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long, std::string const&, unsigned long, std::array<unsigned long, (unsigned long)2>)
#08 [pd] rocksdb::BlockBasedTable::PrefetchIndexAndFilterBlocks(rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, rocksdb::BlockBasedTable*, bool, rocksdb::BlockBasedTableOptions const&, int, unsigned long, unsigned long, rocksdb::BlockCacheLookupContext*)
#09 [pd] rocksdb::BlockBasedTable::CreateIndexReader(rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, bool, bool, bool, rocksdb::BlockCacheLookupContext*, std::unique_ptr<rocksdb::BlockBasedTable::IndexReader, std::default_delete<rocksdb::BlockBasedTable::IndexReader>>*)
#10 [pd] rocksdb::BinarySearchIndexReader::Create(rocksdb::BlockBasedTable const*, rocksdb::ReadOptions const&, rocksdb::FilePrefetchBuffer*, bool, bool, bool, rocksdb::BlockCacheLookupContext*, std::unique_ptr<rocksdb::BlockBasedTable::IndexReader, std::default_delete<rocksdb::BlockBasedTable::IndexReader>>*)
#11 [pd] rocksdb::BlockBasedTable::IndexReaderCommon::ReadIndexBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::Block>*)
#12 [pd] std::enable_if<((rocksdb::Block_kIndex::kCacheEntryRole)==((rocksdb::CacheEntryRole)13))||(true), rocksdb::Status>::type rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::Block_kIndex>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::Block_kIndex>*, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool, bool) const
#13 [pd] rocksdb::Status rocksdb::(anonymous namespace)::ReadAndParseBlockFromFile<rocksdb::Block_kIndex>(rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, std::unique_ptr<rocksdb::Block_kIndex, std::default_delete<rocksdb::Block_kIndex>>*, rocksdb::ImmutableOptions const&, rocksdb::BlockCreateContext&, bool, rocksdb::UncompressionDict const&, rocksdb::PersistentCacheOptions const&, rocksdb::MemoryAllocator*, bool, bool)
#14 [pd] rocksdb::BlockFetcher::ReadBlockContents()
#15 [pd] rocksdb::UncompressSerializedBlock(rocksdb::UncompressionInfo const&, char const*, unsigned long, rocksdb::BlockContents*, unsigned int, rocksdb::ImmutableOptions const&, rocksdb::MemoryAllocator*)
#16 [pd] rocksdb::UncompressBlockData(rocksdb::UncompressionInfo const&, char const*, unsigned long, rocksdb::BlockContents*, unsigned int, rocksdb::ImmutableOptions const&, rocksdb::MemoryAllocator*)
#17 [libstdc++.so.6.0.30] operator new(unsigned long)
#18 [libbytehound.so] malloc [api.rs:294]

I also saw a smaller leak that may be related to tracing instrumentation, or else I'm mistaken in reading the backtrace. Here's a PDF of the full report I generated this morning, mostly for posterity in reproducing these steps in future debugging sessions:
pd-memory-profiling-testnet-56-report-1.pdf

Formatting's a bit wonky. Separately I'll paste in the console code I adapted from the bytehound guide, since that'll be fairly easily copy/pasteable in the future.

bytehound memleak console scripting
println("Temporary vs leaked allocations:");
graph()
    .add("Leaked", allocations().only_leaked())
    .add("Temporary", allocations())
    .save();

println("Sort leaks by size:");   
let groups = allocations()
    .only_leaked()
    .group_by_backtrace()
        .sort_by_size();
graph().add(groups).save();


// Function to group leaks based on common backtraces
fn analyze_group(list) {
    let list_all = allocations().only_matching_backtraces(list);
    graph()
        .add("Leaked", list_all.only_leaked())
        .add("Temporary", list_all)
        .save();
    println("Total: {}", list_all.len());
    println("Leaked: {}", list_all.only_leaked().len());
    println();
    println("Backtrace:");
    println(list_all[0].backtrace().strip());
}

println("Analyzing most common leak by backtrace:");
analyze_group(groups[0]);
println("Analyzing second most common leak by backtrace:");
analyze_group(groups[1]);
println("Analyzing third most common leak by backtrace:");
analyze_group(groups[2]);

@conorsch
Copy link
Contributor Author

conorsch commented Jul 25, 2023

Initial drops in linked PR seem to help, but aren't sufficient. Encountered a new leak:

pd-pclientd-leak-partially-fixed-1

Stack trace:

#00 [libc.so.6] __clone3
#01 [libc.so.6] start_thread
#02 [pd] std::sys::unix::thread::Thread::new::thread_start [thread.rs:108]
#03 [pd] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1985]
#04 [pd] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1985]
#05 [pd] core::ops::function::FnOnce::call_once{{vtable.shim}}
#07 [pd] tokio::runtime::blocking::pool::Inner::run
#08 [pd] tokio::runtime::task::harness::Harness<T,S>::poll
#09 [pd] tokio::runtime::task::core::Core<T,S>::poll
#10 [pd] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
#11 [pd] <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
#12 [pd] tokio::runtime::scheduler::multi_thread::worker::run
#13 [pd] tokio::runtime::context::runtime::enter_runtime
#14 [pd] tokio::runtime::context::scoped::Scoped<T>::set
#15 [pd] tokio::runtime::scheduler::multi_thread::worker::Context::run
#16 [pd] tokio::runtime::scheduler::multi_thread::worker::Context::run_task
#17 [pd] tokio::runtime::task::harness::Harness<T,S>::poll
#18 [pd] tokio::runtime::task::core::Core<T,S>::poll
#19 [pd] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
#20 [pd] <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
#21 [pd] <hyper::proto::h2::server::H2Stream<F,B> as core::future::future::Future>::poll
#22 [pd] <hyper::proto::h2::PipeToSendStream<S> as core::future::future::Future>::poll
#23 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#24 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#25 [pd] <http_body::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_data
#26 [pd] <tonic::codec::encode::EncodeBody<S> as http_body::Body>::poll_data
#27 [pd] <T as futures_util::fns::FnMut1<A>>::call_mut
#28 [pd] prost::message::Message::encode
#29 [pd] <penumbra_proto::penumbra::core::chain::v1alpha1::CompactBlock as prost::message::Message>::encode_raw
#30 [pd] prost::encoding::message::encode
#31 [pd] prost::encoding::<impl prost::encoding::sealed::BytesAdapter for alloc::vec::Vec<u8>>::append_to
#32 [pd] bytes::bytes_mut::BytesMut::reserve_inner
#33 [pd] alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
#34 [pd] alloc::raw_vec::finish_grow
#35 [libbytehound.so] realloc [api.rs:378]

Currently pairing with @erwanor.

@hdevalence
Copy link
Member

The second leak is more mysterious to me, since the allocation is happening inside the Tonic stack, and I'm not sure why it would be growing a BytesMut and then keeping it around. I think it's worth merging a fix for the part of the problem that's clearly in our part of the stack.

conorsch added a commit that referenced this issue Jul 25, 2023
Used this script to generate loadtests against a local pd instance, for
memory profiling efforts. Steps I used to test:

  pd testnet unsafe-reset-all
  pd testnet join https://rpc.testnet-preview.penumbra.zone
  # start pd & tendermint, wait for genesis to complete
  ./deployments/scripts/pcli-client-test -n 1000 -p 100

Refs #2867.
conorsch added a commit that referenced this issue Jul 25, 2023
Trying to resolve a memory leak in pd. These manual drops are a first
pass, and appear to reduce memory consumption, but there's still a leak,
according to bytehound. We'll continue to investigate.

Includes a missed `to_proto` that's a nice to have, but likely doesn't-
constitute a fix for our problem.

Refs #2867.

Co-Authored-By: Henry de Valence <hdevalence@penumbralabs.xyz>
conorsch added a commit that referenced this issue Jul 25, 2023
Used this script to generate loadtests against a local pd instance, for
memory profiling efforts. Steps I used to test:

  pd testnet unsafe-reset-all
  pd testnet join https://rpc.testnet-preview.penumbra.zone
  # start pd & tendermint, wait for genesis to complete
  ./deployments/scripts/pcli-client-test -n 1000 -p 100

Refs #2867.
conorsch added a commit that referenced this issue Jul 25, 2023
Trying to resolve a memory leak in pd. These manual drops are a first
pass, and appear to reduce memory consumption, but there's still a leak,
according to bytehound. We'll continue to investigate.

Includes a missed `to_proto` that's a nice to have, but likely doesn't-
constitute a fix for our problem.

Refs #2867.

Co-Authored-By: Henry de Valence <hdevalence@penumbralabs.xyz>
conorsch added a commit that referenced this issue Jul 25, 2023
Trying to resolve a memory leak in pd. These manual drops are a first
pass, and appear to reduce memory consumption, but there's still a leak,
according to bytehound. We'll continue to investigate.

Includes a missed `to_proto` that's a nice to have, but likely doesn't-
constitute a fix for our problem.

Refs #2867.

Co-Authored-By: Henry de Valence <hdevalence@penumbralabs.xyz>
(cherry picked from commit 4b7c386)
@conorsch
Copy link
Contributor Author

conorsch commented Aug 4, 2023

Recent related changes:

We shipped point releases as 0.56.1 and 0.57.1 to evaluate performance improvements. At least one more PR should land in time for 0.58.0 (#2888).

conorsch added a commit that referenced this issue Aug 7, 2023
Prepared by @erwan while working on #2867.
Exported from the preview grafana instance, and storing in version
control, so it's part of deployments going forward.
conorsch added a commit that referenced this issue Aug 7, 2023
Prepared by @erwan while working on #2867.
Exported from the preview grafana instance, and storing in version
control, so it's part of deployments going forward.
@erwanor
Copy link
Member

erwanor commented Aug 24, 2023

Moving this issue back to Future since we are prioritizing other items. Over the medium term, we should aim for pd to completely isolate the consensus module from inbound load on its RPCs. Although the current performance pattern is far from optimal, it is probably sufficient when combined with network level rate limiters and load balancers.

@erwanor erwanor moved this from In Progress (Already claimed) to Future in Testnets Aug 24, 2023
@aubrika aubrika added this to Penumbra Oct 30, 2023
@github-project-automation github-project-automation bot moved this to 🗄️ Backlog in Penumbra Oct 30, 2023
@hdevalence
Copy link
Member

Closing as completed since we addressed the memory leaks that were causing the original problem. While there is more work to do, it can be tracked in later issues.

@github-project-automation github-project-automation bot moved this from 🗄️ Backlog to Done in Penumbra Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Status: Future
Development

No branches or pull requests

3 participants