Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make unified schedler abort on tx execution errors #1211

Merged
merged 11 commits into from
May 31, 2024

Conversation

ryoqun
Copy link
Collaborator

@ryoqun ryoqun commented May 7, 2024

Problem

As a quick background, the unified scheduler currently doesn't terminate at all. In other words, it leaks shamelessly. :p It means its memory usage is unbounded and indeed causes the process to be reaped by the out-of-memory killer after a while (like a week), even while it's pooled for reuse.

Let alone the memory issue, it needs proper shutdown mechanism for other reasons as well. Specifically, there are following several termination conditions which it must handle respectively:

  1. Unified scheduler encounters on transaction errors while running. And it needs to abort immediately:
  2. One of the handler threads could panic for extreme situations due to newly discovered LoA-like bugs. In that case, unified scheduler should propagate panic! promptly to terminate the whole process. This is an edge case variant of 1. (note: I'm against resuming normal operation after panic):
  3. A unified scheduler can be idling in the scheduler pool for very long time. And it needs to be disposed of:
  4. Because UsageQueueLoader doesn't never evict unused entries. it can become too big with many entries. Its design relies on some external mechanism to mitigate the unbounded growth. Thus, the unified scheduler needs to drop the scheduler itself depending on the size of UsageQueueLoader:
  5. There could be many active (= taken-out-of-the-pool) unified schedulers under very forky network conditions for extended duration of lack of rooting. In that case, many native os threads will be created just to sit idling collectively because each unified scheduler instance separately creates and manages its own set of threads individually (for perf reasons). So, idling scheduler should be returned back to the pool. Then, eventually those pooled-back schedulers will be retired according to the (3) after the forky situation is resolved:

As for the (1), there's currently also no way for the unified scheduler to propagate errors back to the callers (the replay stage) until the bank freezing. So, the dead-block marking by the replay stage could be delayed by maliciously-crafted blocks even if the unified scheduler immediately aborts internally.

Summary of Changes

This pr specifically addresses (1).

To that end, this pr makes the new task code-path return Results to abruptly propagate previously-scheduled transaction error when unrelated new tasks are about to be submitted to the unified scheduler, in order to notify the replay stage earlier than reaching block boundaries.

After that, this pr introduces crossbeam-channel disconnection based cross-thread coordination for graceful termination of the unified scheduler instance itself. In this way, there's almost no runtime overhead other than bunch of additional ifs. Also, this means there's no new potential bottleneck and synchronization. On the other hand, this incurs code complexities.

Lastly, a pool-owned (i.e. singleton) auxiliary background thread is introduced to actually drop the aborted scheduler.

Other termination conditions will be addressed in upcoming prs.

context: extracted from #1122

@codecov-commenter
Copy link

codecov-commenter commented May 7, 2024

Codecov Report

Attention: Patch coverage is 91.11675% with 70 lines in your changes are missing coverage. Please review.

Project coverage is 81.6%. Comparing base (af6930d) to head (649bce3).
Report is 111 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #1211    +/-   ##
========================================
  Coverage    81.6%    81.6%            
========================================
  Files         867      868     +1     
  Lines      368900   369593   +693     
========================================
+ Hits       301052   301656   +604     
- Misses      67848    67937    +89     

@ryoqun ryoqun marked this pull request as ready for review May 7, 2024 13:36
@ryoqun ryoqun requested a review from apfitzge May 7, 2024 13:37
@ryoqun ryoqun force-pushed the unified-scheduler-abort-on-error branch from ff13fc3 to a8d8736 Compare May 14, 2024 00:49
Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nits. I need to take another pass, specifically to review the trashed scheduler concept.

@ryoqun ryoqun requested a review from apfitzge May 15, 2024 14:32
@ryoqun
Copy link
Collaborator Author

ryoqun commented May 15, 2024

thanks for review! I've addressed all comments so far. re-requesting code-review.

@apfitzge
Copy link

I'm not sure I understand the concept of trashed schedulers.

When we return schedulers to the pool, if all exection threads are not joined it is considered "trashed". We will clean it up later.
But afaict we always call wait_for_termination before we return to pool, so i'm uncertain how a scheduler can ever become trashed.

Can you give an example of when a scheduler would become trashed? Think I'm still missing something here.

@ryoqun ryoqun force-pushed the unified-scheduler-abort-on-error branch 7 times, most recently from 7bf597a to 8212761 Compare May 21, 2024 06:30
@ryoqun
Copy link
Collaborator Author

ryoqun commented May 21, 2024

I'm not sure I understand the concept of trashed schedulers.

When we return schedulers to the pool, if all exection threads are not joined it is considered "trashed". We will clean it up later. But afaict we always call wait_for_termination before we return to pool, so i'm uncertain how a scheduler can ever become trashed.

Can you give an example of when a scheduler would become trashed? Think I'm still missing something here.

Thanks for the question. seems more explanation is needed... I won't intentionally gave such an example to verify others can decipher my code by themselves. So, I instead added comments rather extensively as much as possible: 8212761

Could you give another try to grasp the trashed schedulers?

@ryoqun ryoqun force-pushed the unified-scheduler-abort-on-error branch 2 times, most recently from 450aa8c to 649bce3 Compare May 26, 2024 14:06
@ryoqun
Copy link
Collaborator Author

ryoqun commented May 26, 2024

all of test sleeps has gone: 649bce3

Copy link
Collaborator Author

@ryoqun ryoqun May 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dunno how you like this test library. but I'm planning to publish this as a crate in the future. so, the mod name is fancy. ;)

@ryoqun ryoqun force-pushed the unified-scheduler-abort-on-error branch from 649bce3 to 66b9f32 Compare May 29, 2024 05:59
Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely happy with the scheduler changes themselves.
The additional sleepless testing utilities are a nice addition; had some nits on the implementation, but since you mentioned this may be split out into a separate crate/library I'm not going to block on it. (would love to be involved with the library if/when you split it out!)

Comment on lines +181 to +183
} else if current().name().unwrap_or_default().starts_with("test_") {
panic!("seems setup() isn't called yet?");
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit hacky to have this rely on rust's current test-thread naming scheme to me.
Why not just panic if this is called when the registry is not set up? is there a reason why at would be called w/o the thread being registered?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this could be more strict.

pub(crate) fn at<T: Debug>(check_point: T) {
let mut registry = THREAD_REGISTRY.lock().unwrap();
if let Some(progress) = registry.get_mut(&current().id()).cloned() {
drop(registry);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intentional? we drop the lock before the checkpoint is actually changed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i think dead lock is possible otherwise iirc

Comment on lines +121 to +124
lazy_static! {
static ref THREAD_REGISTRY: Mutex<HashMap<ThreadId, Arc<Progress>>> =
Mutex::new(HashMap::new());
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a global. won't it cause contention between all tests using this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'm leaning towards switching to thread_local! impl soon.

Comment on lines +54 to +55
#[derive(Debug)]
struct JustCreated;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from library perspective, it is not obvious the user cannot use "JustCreated" as one of their check-points

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +64 to +65
let check_points_set = check_points.iter().collect::<HashSet<_>>();
assert_eq!(check_points.len(), check_points_set.len());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a bit odd that we cannot have duplicate checkpoint names. If I had some sort of state-machine it would seem reasonable to me that I could use at(NAME_OF_STATE) to test the state-transitions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so true. the current impl is too primitive. this is another improvement which i'd like to have when creating a crate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this limitation will be lifted by this pr: #1797

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants