Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compute nodes stuck at bidding #921

Merged
merged 3 commits into from
Oct 21, 2022
Merged

Conversation

wdbaruni
Copy link
Member

A compute node can bid on a job even after another bid was accepted and results were published, which will result in that compute node getting stuck in bidding state with capacity reserved as the requester node will ignore the bid and won't reject it. There are couple bugs that were fixed to mitigate this issue:

  1. Fix a bug in the compute node that checks if a shard has reached capacity before bidding on. Previously it checks if there are enough nodes in bid accepted state, but it should also check if any state post bid accepted, such as completed!
  2. Requester node now returns a InvalidRequest event back to the compute node if it bids on a job that is no longer biddable (e.g. completed or reached capacity), or if the job doesn't exist (e.g. requester node restarted)

A long term solution is to implement timeouts in each state to let the compute node fail rather than getting stuck, if for example the requester node is no longer available. This is tracked in #663

@wdbaruni wdbaruni merged commit 6518704 into main Oct 21, 2022
@philwinder philwinder deleted the job-event-invalid-request branch October 21, 2022 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants