-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler plugins need to be able to distinguish recoverable from irrecoverable errors #2955
Comments
Just to give an example of one specific case:
This should stop at first submission and return some kind of usable exit code that is passed along. |
Also, when we fix this, we most likely have to write a In fact, for e.g. Slurm, there is an API, so no parsing is really needed. There is even PySlurm. This is probably not true for the general scheduler, but we should focus on at least having full support for PBS and Slurm. |
Regarding your second comment, we have this open issue. Note, however, that part of the required code for that issue is indeed shared with his one, i.e. a parser for the |
In fact, this issue: #2955 (comment) now keeps the state at |
This is a duplicate of #2226 which was closed, but we can continue discussion in this issue |
+1 for adding parsing to the SLURM plugin to detect this error and return a corresponding exit code with instructions to provide a |
+1 Just ran into this again |
The exponential back off retry mechanism of the engine for calculation jobs has greatly improved the robustness with respect to temporary recoverable problems such as a loss of internet connection or clusters being down. However, since this indiscriminately applied to all errors during transport tasks, sometimes the tasks are restarted even though it will never work. Take for example where the wrong scheduler parameters are passed as inputs. No matter how often one restarts, the task will always fail. Retrying the task is futile.
To solve this, the scheduler plugin interfaces with the engine need to be improved, such that they can make a distinction between recoverable and irrecoverable errors. The default assumption for the engine will remain that the error is recoverable with a retry, but with this new option a scheduler plugin can instruct the engine to not even bother when it encounters certain errors. Since these errors are going to be scheduler and transport specific, their respective plugins should be responsible for this.
The text was updated successfully, but these errors were encountered: