-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems #1721
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
tiagolobocastro
commented
Aug 14, 2024
•
edited
Loading
edited
3148374
to
b873525
Compare
When the rebuild backend is dropped, we must also drain the async channel. This covers a corner case where a message may be sent at the same time as we're dropping and in this case the message would hang. This is not a hang for prod as there we have timeouts which would eventually cancel the future and allow the drop, though this can still lead to timeouts and confusion. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>
b873525
to
83d28cc
Compare
This seems to have been mistakenly added as ms. In practice this would have caused no harm as this value is not currently being overrided by the helm chart. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>
dsharma-dc
approved these changes
Aug 21, 2024
abhilashshetty04
approved these changes
Aug 27, 2024
bors merge |
bors-openebs-mayastor bot
pushed a commit
that referenced
this pull request
Aug 27, 2024
1721: [ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems r=tiagolobocastro a=tiagolobocastro fix(nvmx/retire): disconnect failed controllers When we are pausing the nexus, all IO must get flushed before the subsystem pausing completes. If we can't flush the IO then pausing is stuck forever... The issue we have seen is that when IO's are stuck there's nothing which can fail them and allow pause to complete. One way this can happen is when the controller is failed as it seems in this case the io queues are not getting polled. A first fix that can be done is to piggy back on the adminq polling failure and use this to drive the removal of the failed child devices from the nexus per-core channels. A better approach might be needed in the future to be able to timeout the IOs even when no completions are processed in a given I/O qpair. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(opts): convert adminq poll period to us This seems to have been mistakenly added as ms. In practice this would have caused no harm as this value is not currently being overrided by the helm chart. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(rebuild): ensure comms channel is drained on drop When the rebuild backend is dropped, we must also drain the async channel. This covers a corner case where a message may be sent at the same time as we're dropping and in this case the message would hang. This is not a hang for prod as there we have timeouts which would eventually cancel the future and allow the drop, though this can still lead to timeouts and confusion. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>
Build failed: |
hmm weird build failed error... |
bors-openebs-mayastor bot
pushed a commit
that referenced
this pull request
Aug 27, 2024
1721: [ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems r=tiagolobocastro a=tiagolobocastro fix(nvmx/retire): disconnect failed controllers When we are pausing the nexus, all IO must get flushed before the subsystem pausing completes. If we can't flush the IO then pausing is stuck forever... The issue we have seen is that when IO's are stuck there's nothing which can fail them and allow pause to complete. One way this can happen is when the controller is failed as it seems in this case the io queues are not getting polled. A first fix that can be done is to piggy back on the adminq polling failure and use this to drive the removal of the failed child devices from the nexus per-core channels. A better approach might be needed in the future to be able to timeout the IOs even when no completions are processed in a given I/O qpair. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(opts): convert adminq poll period to us This seems to have been mistakenly added as ms. In practice this would have caused no harm as this value is not currently being overrided by the helm chart. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(rebuild): ensure comms channel is drained on drop When the rebuild backend is dropped, we must also drain the async channel. This covers a corner case where a message may be sent at the same time as we're dropping and in this case the message would hang. This is not a hang for prod as there we have timeouts which would eventually cancel the future and allow the drop, though this can still lead to timeouts and confusion. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>
Build failed: |
Disabled worker-2 for now... let's try again |
bors-openebs-mayastor bot
pushed a commit
that referenced
this pull request
Aug 27, 2024
1721: [ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems r=tiagolobocastro a=tiagolobocastro fix(nvmx/retire): disconnect failed controllers When we are pausing the nexus, all IO must get flushed before the subsystem pausing completes. If we can't flush the IO then pausing is stuck forever... The issue we have seen is that when IO's are stuck there's nothing which can fail them and allow pause to complete. One way this can happen is when the controller is failed as it seems in this case the io queues are not getting polled. A first fix that can be done is to piggy back on the adminq polling failure and use this to drive the removal of the failed child devices from the nexus per-core channels. A better approach might be needed in the future to be able to timeout the IOs even when no completions are processed in a given I/O qpair. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(opts): convert adminq poll period to us This seems to have been mistakenly added as ms. In practice this would have caused no harm as this value is not currently being overrided by the helm chart. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(rebuild): ensure comms channel is drained on drop When the rebuild backend is dropped, we must also drain the async channel. This covers a corner case where a message may be sent at the same time as we're dropping and in this case the message would hang. This is not a hang for prod as there we have timeouts which would eventually cancel the future and allow the drop, though this can still lead to timeouts and confusion. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>
Timed out. |
When we are pausing the nexus, all IO must get flushed before the subsystem pausing completes. If we can't flush the IO then pausing is stuck forever... The issue we have seen is that when IO's are stuck there's nothing which can fail them and allow pause to complete. One way this can happen is when the controller is failed as it seems in this case the io queues are not getting polled. A first fix that can be done is to piggy back on the adminq polling failure and use this to drive the removal of the failed child devices from the nexus per-core channels. A better approach might be needed in the future to be able to timeout the IOs even when no completions are processed in a given I/O qpair. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>
428121f
to
acb1462
Compare
bors merge |
Build succeeded: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.