Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): Fix crashes in execution_driver due to inability to execute transactions #4928

Merged
merged 2 commits into from
Jan 22, 2025

Conversation

jkrvivian
Copy link
Contributor

@jkrvivian jkrvivian commented Jan 20, 2025

Description of change

Merge bug fix from upstream: MystenLabs/sui@675ea8c

From upstream PR description:

Fix crashes in execution_driver due to inability to execute transactions.

We must hold the lock for the object entry while inserting to the
object_by_id_cache. Otherwise, a surprising bug can occur:

  1. A thread executing TX1 can write object (O,1) to the dirty set and then pause.
  2. TX2, which reads (O,1) can begin executing, because TransactionManager immediately schedules transactions if their inputs are available. It does not matter that TX1 hasn't finished executing yet.
  3. TX2 can write (O,2) to both the dirty set and the object_by_id_cache.
  4. The thread executing TX1 can resume and write (O,1) to the object_by_id_cache.

Now, any subsequent attempt to get the latest version of O will return (O,1) instead of
(O,2).

This seems very unlikely, but it may be more likely under the following circumstances:

  • While a thread is unlikely to pause for so long, moka cache uses optimistic lock-free algorithms that have retry loops. Possibly, under high contention, this code might spin for a surprisingly long time.
  • Additionally, many concurrent re-executions of the same tx could happen due to the tx finalizer, plus checkpoint executor, consensus, and RPCs from fullnodes.

Unfortunately I have not been able to reproduce this bug, so we cannot be sure that
this fixes the crashes we've seen. But this is certainly a possible bug.

Links to any relevant issues

Close #4675

Type of change

  • Bug fix

Change checklist

  • I have followed the contribution guidelines for this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have checked that new and existing unit tests pass locally with my changes

@jkrvivian jkrvivian requested review from a team as code owners January 20, 2025 12:21
Copy link

vercel bot commented Jan 20, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
apps-backend ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 21, 2025 7:41am
apps-ui-kit ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 21, 2025 7:41am
rebased-explorer ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 21, 2025 7:41am
wallet-dashboard ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 21, 2025 7:41am

@iota-ci iota-ci added core-protocol node Issues related to the Core Node team labels Jan 20, 2025
@jkrvivian jkrvivian self-assigned this Jan 20, 2025
@jkrvivian jkrvivian merged commit 94b801f into develop Jan 22, 2025
39 checks passed
@jkrvivian jkrvivian deleted the node/fix/crashes_in_execution_driver branch January 22, 2025 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol node Issues related to the Core Node team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[cherrypick] Fix possible (but probably rare) race condition (#19951)… · MystenLabs/sui@675ea8c
4 participants