-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SlurmScheduler
: Detect broken submission scripts for invalid account
#5850
SlurmScheduler
: Detect broken submission scripts for invalid account
#5850
Conversation
8004cdb
to
dfc4da7
Compare
This PR is a follow-up to #5849 which should be reviewed and merged first. This PR then adds a commit on top which implements the new feature for the |
If an invalid combination of the `account` and `partition` options are provided the submission will fail. Currently the scheduler plugin will raise a generic exception causing the expontential backoff mechanism to kick in. This is pointless, however, as the problem is not transient and the submission will always fail, unless the scheduler script is updated, which is not possible, since it breaks provenance. The solution is to make use of the recently introduced feature for the `_parse_submit_output` method to return an instance of `ExitCode` which will trigger the engine to terminate the process. If an invalid account or combination of account and partition are defined, the error will be: Invalid account or account/partition combination specified This error is printed to the `stderr`. When detected, the new exit code `ERROR_SCHEDULER_INVALID_ACCOUNT` is returned. The `ERROR_SCHEDULER_INVALID_ACCOUNT` exit code uses status `131`. The idea is that the range 130 - 139 is reserved for errors that occur when the job script submission fails. The status 130 is kept open for a more general exit code that may be defined in the future.
dfc4da7
to
d0903ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, looks good to me as well
|
||
if 'Invalid account' in stderr: | ||
return CalcJob.exit_codes.ERROR_SCHEDULER_INVALID_ACCOUNT | ||
|
||
raise SchedulerError(f'Error during submission, retval={retval}\nstdout={stdout}\nstderr={stderr}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so an exception-based mechanism is actually in place as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this gets trapped low-down in the expontential backoff mechanism. This was exactly the direct source of the linked issue. This exception was caught and the engine kept retrying until pausing the job. The point of the exit code is that we need to get out of that, and bubble all the way up.
@@ -422,8 +422,14 @@ def _parse_submit_output(self, retval, stdout, stderr): | |||
|
|||
Return a string with the JobID. | |||
""" | |||
from aiida.engine import CalcJob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could move to top of file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @sphuber. Same comment as already given. Thanks a lot for adding this.
Fixes #2955
If an invalid combination of the
account
andpartition
options areprovided the submission will fail. Currently the scheduler plugin will
raise a generic exception causing the expontential backoff mechanism to
kick in. This is pointless, however, as the problem is not transient and
the submission will always fail, unless the scheduler script is updated,
which is not possible, since it breaks provenance.
The solution is to make use of the recently introduced feature for the
_parse_submit_output
method to return an instance ofExitCode
whichwill trigger the engine to terminate the process. If an invalid account
or combination of account and partition are defined, the error will be:
This error is printed to the
stderr
. When detected, the new exit codeERROR_SCHEDULER_INVALID_ACCOUNT
is returned.The
ERROR_SCHEDULER_INVALID_ACCOUNT
exit code uses status131
. Theidea is that the range 130 - 139 is reserved for errors that occur when
the job script submission fails. The status 130 is kept open for a more
general exit code that may be defined in the future.