Skip to content

Commit

Permalink
SlurmScheduler: Detect broken submission scripts for invalid account
Browse files Browse the repository at this point in the history
If an invalid combination of the `account` and `partition` options are
provided the submission will fail. Currently the scheduler plugin will
raise a generic exception causing the expontential backoff mechanism to
kick in. This is pointless, however, as the problem is not transient and
the submission will always fail, unless the scheduler script is updated,
which is not possible, since it breaks provenance.

The solution is to make use of the recently introduced feature for the
`_parse_submit_output` method to return an instance of `ExitCode` which
will trigger the engine to terminate the process. If an invalid account
or combination of account and partition are defined, the error will be:

    Invalid account or account/partition combination specified

This error is printed to the `stderr`. When detected, the new exit code
`ERROR_SCHEDULER_INVALID_ACCOUNT` is returned.

The `ERROR_SCHEDULER_INVALID_ACCOUNT` exit code uses status `131`. The
idea is that the range 130 - 139 is reserved for errors that occur when
the job script submission fails. The status 130 is kept open for a more
general exit code that may be defined in the future.
  • Loading branch information
sphuber committed Dec 20, 2022
1 parent 9309678 commit d0903ac
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 2 deletions.
3 changes: 3 additions & 0 deletions aiida/engine/processes/calcjobs/calcjob.py
Original file line number Diff line number Diff line change
Expand Up @@ -452,6 +452,9 @@ def define(cls, spec: CalcJobProcessSpec) -> None: # type: ignore[override]
spec.exit_code(
120, 'ERROR_SCHEDULER_OUT_OF_WALLTIME', invalidates_cache=True, message='The job ran out of walltime.'
)
spec.exit_code(
131, 'ERROR_SCHEDULER_INVALID_ACCOUNT', invalidates_cache=True, message='The specified account is invalid.'
)
spec.exit_code(150, 'STOPPED_BY_MONITOR', invalidates_cache=True, message='{message}')

@classproperty
Expand Down
6 changes: 6 additions & 0 deletions aiida/schedulers/plugins/slurm.py
Original file line number Diff line number Diff line change
Expand Up @@ -422,8 +422,14 @@ def _parse_submit_output(self, retval, stdout, stderr):
Return a string with the JobID.
"""
from aiida.engine import CalcJob

if retval != 0:
self.logger.error(f'Error in _parse_submit_output: retval={retval}; stdout={stdout}; stderr={stderr}')

if 'Invalid account' in stderr:
return CalcJob.exit_codes.ERROR_SCHEDULER_INVALID_ACCOUNT

raise SchedulerError(f'Error during submission, retval={retval}\nstdout={stdout}\nstderr={stderr}')

try:
Expand Down
11 changes: 9 additions & 2 deletions tests/schedulers/test_slurm.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

import pytest

from aiida.engine import CalcJob
from aiida.schedulers import JobState, SchedulerError
from aiida.schedulers.plugins.slurm import SlurmJobResource, SlurmScheduler

Expand Down Expand Up @@ -416,8 +417,6 @@ def test_joblist_multi(self):

def test_parse_out_of_memory():
"""Test that for job that failed due to OOM `parse_output` return the `ERROR_SCHEDULER_OUT_OF_MEMORY` code."""
from aiida.engine import CalcJob

scheduler = SlurmScheduler()
stdout = ''
stderr = ''
Expand Down Expand Up @@ -454,3 +453,11 @@ def test_parse_output_valid():
scheduler = SlurmScheduler()

assert scheduler.parse_output(detailed_job_info, '', '') is None


def test_parse_submit_output_invalid_account():
"""Test ``SlurmScheduler._parse_submit_output`` returns exit code if stderr contains error about invalid account."""
scheduler = SlurmScheduler()
stderr = 'Batch job submission failed: Invalid account or account/partition combination specified'
result = scheduler._parse_submit_output(1, '', stderr) # pylint: disable=protected-access
assert result == CalcJob.exit_codes.ERROR_SCHEDULER_INVALID_ACCOUNT

0 comments on commit d0903ac

Please sign in to comment.