Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Page-pin packet memory for cuda #4250

Merged
merged 10 commits into from
Jun 27, 2019
Merged

Conversation

sakridge
Copy link
Contributor

@sakridge sakridge commented May 10, 2019

Problem

cuda cannot overlap memory transfers and kernels if memory is not page-pinned.

Summary of Changes

use cudaHostRegister and cudaHostUnregister to page-pin memory.

Introduce PinnedVec wrapper class which has its memory pinned by the above cuda functions. It registers on creation, unregister on drop, also on push and resize it checks the pointer and the capacity of the vec for any change and then does the unregister/register if changed.

Add back the Recycler so that these PinnedVec's are re-used since we cannot call cudaHostRegister/cudHostUnregister in the middle of computation since it will cause a GPU sync.

Fixes #4235

@sakridge sakridge force-pushed the pin-packet-memory branch 3 times, most recently from d4e39a8 to 5904b62 Compare May 10, 2019 21:15
@codecov
Copy link

codecov bot commented May 10, 2019

Codecov Report

Merging #4250 into master will decrease coverage by 7%.
The diff coverage is 62.9%.

@@           Coverage Diff            @@
##           master   #4250     +/-   ##
========================================
- Coverage    78.7%   71.7%   -7.1%     
========================================
  Files         192     193      +1     
  Lines       34262   37355   +3093     
========================================
- Hits        26985   26799    -186     
- Misses       7277   10556   +3279

@sakridge sakridge force-pushed the pin-packet-memory branch 10 times, most recently from 79991ff to ade4a8d Compare May 14, 2019 03:57
@sakridge sakridge added the work in progress This isn't quite right yet label May 14, 2019
@sakridge sakridge force-pushed the pin-packet-memory branch from ade4a8d to 7bcc327 Compare May 14, 2019 04:11
@@ -0,0 +1,42 @@
use std::sync::{Arc, Mutex};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess whos back!

@sakridge sakridge force-pushed the pin-packet-memory branch from 7bcc327 to c839a10 Compare May 14, 2019 22:01
@sakridge
Copy link
Contributor Author

@garious @rob-solana can you guys think of any cleaner way to do this?

@garious
Copy link
Contributor

garious commented May 14, 2019

I'm not able to look at this closely at the moment, but are you aware of this? https://doc.rust-lang.org/std/pin/index.html

@sakridge
Copy link
Contributor Author

sakridge commented May 14, 2019

@garious Hmm. Haven't seen it before, but I don't see a straightforward way to apply it here. Presumably you could return a Pin<Vec<T>> but then you can't do a push or anything on that vec which is a bit annoying.

@@ -804,7 +804,9 @@ mod tests {
#[test]
fn test_banking_stage_entries_only() {
solana_logger::setup();
let (genesis_block, mint_keypair) = create_genesis_block(10);
let (mut genesis_block, mint_keypair) = create_genesis_block(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this change for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocator takes longer, so the test needed more time to avoid the first slot running out of ticks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be able to rearrange some stuff so that the tick start happens closer to the processing of transactions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pre-allocate some amount in recycler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I can try that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I don't need it since I added a flag to only pin when necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did that work out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag works so I didn't need to modify the test. I didn't try pre-allocating in the recycler but I didn't see an easy way to do that either.

@rob-solana
Copy link
Contributor

have we root-caused our prior OOM from recycler yet? I didn't think we had...

@sakridge
Copy link
Contributor Author

@rob-solana No, but we have OOM issues now anyway from the regular allocator so I can't make it worse :)

@rob-solana
Copy link
Contributor

I think you might be able to make it worse. Remember OOM in 20-30min?

@sakridge
Copy link
Contributor Author

@rob-solana last time I ran it that was where we are at with a single-node setup today. It's just that we switch leaders so we don't notice it as much now.

@sakridge
Copy link
Contributor Author

Also the blob recycler was I think the biggest problem. This change only needs to use the recycler for Packets and these signature offsets and output buffers. Just anything that goes to the GPU for sigverify.

@rob-solana
Copy link
Contributor

as I recall, we removed the blob recycler and the leaks continued unabated

@sakridge
Copy link
Contributor Author

Ok, well we have to test it out then. I thought of a clever way to add it to minimize change and also allow for easy one-line removal.

@sakridge sakridge force-pushed the pin-packet-memory branch from c839a10 to db5ffc8 Compare May 16, 2019 17:16
@sakridge sakridge force-pushed the pin-packet-memory branch 3 times, most recently from 2f38251 to 84f276a Compare May 29, 2019 21:30
@sakridge sakridge force-pushed the pin-packet-memory branch from 84f276a to 6890dda Compare May 31, 2019 20:39
@sakridge sakridge force-pushed the pin-packet-memory branch 6 times, most recently from d14d147 to 69e8821 Compare June 24, 2019 18:08
@sakridge sakridge removed the work in progress This isn't quite right yet label Jun 25, 2019
sakridge and others added 10 commits June 24, 2019 20:17
Bring back recyclers and pin offset buffers
* Add test for recycler and reduce the gc lock critical section
* Add comments/tests to cuda_runtime
Otherwise buffered becomes millions of packets which causes
memory issues for nodes and problems where forwarding all
those packets can take 10s of seconds.
@sakridge sakridge force-pushed the pin-packet-memory branch from 69e8821 to 69b0ae3 Compare June 25, 2019 03:19
@@ -804,6 +804,9 @@ impl BankingStage {
packet_indexes: Vec<usize>,
) {
if !packet_indexes.is_empty() {
if unprocessed_packets.len() > 400 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgarg66 I found this has much better memory characteristics for fullnodes with capping the buffered packets size, even with the original code. Any concerns?

@sakridge sakridge requested a review from rob-solana June 25, 2019 03:26
@sakridge sakridge merged commit fbea9d8 into solana-labs:master Jun 27, 2019
@sakridge sakridge deleted the pin-packet-memory branch June 27, 2019 07:32
brooksprumo pushed a commit to brooksprumo/solana that referenced this pull request Jan 8, 2025
Revert "send-transactions: optimize retry pool (solana-labs#31)"

This reverts commit 0abfe27.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Experiment with GPU page-pinned memory
4 participants