Skip to content

Commit

Permalink
[cherrypick] Fix possible (but probably rare) race condition (#19951) (
Browse files Browse the repository at this point in the history
…#19967)

Fix crashes in execution_driver due to inability to execute
transactions.

--------

We must hold the lock for the object entry while inserting to the
`object_by_id_cache`. Otherwise, a surprising bug can occur:

1. A thread executing TX1 can write object (O,1) to the dirty set and
then pause.
2. TX2, which reads (O,1) can begin executing, because
TransactionManager immediately
schedules transactions if their inputs are available. It does not matter
that TX1
   hasn't finished executing yet.
3. TX2 can write (O,2) to both the dirty set and the object_by_id_cache.
4. The thread executing TX1 can resume and write (O,1) to the
object_by_id_cache.

Now, any subsequent attempt to get the latest version of O will return
(O,1) instead of
(O,2).

This seems very unlikely, but it may be more likely under the following
circumstances:
- While a thread is unlikely to pause for so long, moka cache uses
optimistic
lock-free algorithms that have retry loops. Possibly, under high
contention, this
  code might spin for a surprisingly long time.
- Additionally, many concurrent re-executions of the same tx could
happen due to
the tx finalizer, plus checkpoint executor, consensus, and RPCs from
fullnodes.

Unfortunately I have not been able to reproduce this bug, so we cannot
be sure that
this fixes the crashes we've seen. But this is certainly a possible bug.

## Description 

Describe the changes or additions included in this PR.

## Test plan 

How did you test the new or updated feature?

---

## Release notes

Check each box that your changes affect. If none of the boxes relate to
your changes, release notes aren't required.

For each box you select, include information after the relevant heading
that describes the impact of your changes that a user might notice and
any actions they must take to implement updates.

- [ ] Protocol: 
- [ ] Nodes (Validators and Full nodes): 
- [ ] Indexer: 
- [ ] JSON-RPC: 
- [ ] GraphQL: 
- [ ] CLI: 
- [ ] Rust SDK:
- [ ] REST API:
  • Loading branch information
mystenmark authored Oct 22, 2024
1 parent a85ea98 commit 675ea8c
Showing 1 changed file with 28 additions and 6 deletions.
34 changes: 28 additions & 6 deletions crates/sui-core/src/execution_cache/writeback_cache.rs
Original file line number Diff line number Diff line change
Expand Up @@ -463,15 +463,37 @@ impl WritebackCache {
trace!(?object_id, ?version, ?object, "inserting object entry");
fail_point_async!("write_object_entry");
self.metrics.record_cache_write("object");
self.dirty
.objects
.entry(*object_id)
.or_default()
.insert(version, object.clone());

// We must hold the lock for the object entry while inserting to the
// object_by_id_cache. Otherwise, a surprising bug can occur:
//
// 1. A thread executing TX1 can write object (O,1) to the dirty set and then pause.
// 2. TX2, which reads (O,1) can begin executing, because TransactionManager immediately
// schedules transactions if their inputs are available. It does not matter that TX1
// hasn't finished executing yet.
// 3. TX2 can write (O,2) to both the dirty set and the object_by_id_cache.
// 4. The thread executing TX1 can resume and write (O,1) to the object_by_id_cache.
//
// Now, any subsequent attempt to get the latest version of O will return (O,1) instead of
// (O,2).
//
// This seems very unlikely, but it may be possible under the following circumstances:
// - While a thread is unlikely to pause for so long, moka cache uses optimistic
// lock-free algorithms that have retry loops. Possibly, under high contention, this
// code might spin for a surprisingly long time.
// - Additionally, many concurrent re-executions of the same tx could happen due to
// the tx finalizer, plus checkpoint executor, consensus, and RPCs from fullnodes.
let mut entry = self.dirty.objects.entry(*object_id).or_default();

self.cached.object_by_id_cache.insert(
*object_id,
Arc::new(Mutex::new(LatestObjectCacheEntry::Object(version, object))),
Arc::new(Mutex::new(LatestObjectCacheEntry::Object(
version,
object.clone(),
))),
);

entry.insert(version, object);
}

async fn write_marker_value(
Expand Down

0 comments on commit 675ea8c

Please sign in to comment.