Scheduler plugins need to be able to distinguish recoverable from irrecoverable errors #2955

sphuber · 2019-06-03T09:18:11Z

The exponential back off retry mechanism of the engine for calculation jobs has greatly improved the robustness with respect to temporary recoverable problems such as a loss of internet connection or clusters being down. However, since this indiscriminately applied to all errors during transport tasks, sometimes the tasks are restarted even though it will never work. Take for example where the wrong scheduler parameters are passed as inputs. No matter how often one restarts, the task will always fail. Retrying the task is futile.

To solve this, the scheduler plugin interfaces with the engine need to be improved, such that they can make a distinction between recoverable and irrecoverable errors. The default assumption for the engine will remain that the error is recoverable with a retry, but with this new option a scheduler plugin can instruct the engine to not even bother when it encounters certain errors. Since these errors are going to be scheduler and transport specific, their respective plugins should be responsible for this.

espenfl · 2019-06-03T09:25:12Z

Just to give an example of one specific case:

File "/home/efl/work/devel/aiida_core/aiida/schedulers/plugins/slurm.py", line 431, in _parse_submit_output
    "stdout={}\nstderr={}".format(retval, stdout, stderr))
aiida.schedulers.scheduler.SchedulerError: Error during submission, retval=1
stdout=
stderr=sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

06/03/2019 11:01:38 AM <23614> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [WARNING] maximum attempts 5 of calling do_submit, exceeded
06/03/2019 11:01:39 AM <23614> aiida.engine.processes.calcjobs.tasks: [WARNING] submitting CalcJob<832> failed

This should stop at first submission and return some kind of usable exit code that is passed along.

espenfl · 2019-06-03T09:28:17Z

Also, when we fix this, we most likely have to write a stdout and stderr scheduler parser of some sort. Since we need this for other purposes it would be nice to make it modular. For error handlers and cluster monitoring, it would be nice to be able to request some status of the stdout and stderr from the scheduler at any given time (meaning, we should open for the possibility of this request to go through the transport and parse at the cluster side).

In fact, for e.g. Slurm, there is an API, so no parsing is really needed. There is even PySlurm. This is probably not true for the general scheduler, but we should focus on at least having full support for PBS and Slurm.

sphuber · 2019-06-03T09:36:32Z

Regarding your second comment, we have this open issue. Note, however, that part of the required code for that issue is indeed shared with his one, i.e. a parser for the stderr and stdout content returned by scheduler commands, allowing to act on those things while the calculation job is running is a quite a different problem and currently I am not sure how to implement this in the engine. Essentially when the calculation job is in the stage where it is querying the scheduler for a status update (the update transport task), it should also request the content of the stderr and stdout written on the cluster for that job, perform some parsing and optionally kill the job. This means there needs to be some hook on the CalcJob class that can implement this logic. If this logic is implemented (and not optionally disabled with some input setting on the CalcJob) then the engine can also call this transport task, in addition to the UPDATE one. I am not sure whether this should become a "second" transport task that is run at the same time as the UPDATE one, since that might be complicated from the engine's perspective. Or the UPDATE needs to be dynamically augmented with an additional step. To be discussed.

espenfl · 2019-10-02T09:37:58Z

In fact, this issue: #2955 (comment) now keeps the state at Waiting for the calculation. Finally it pauses. Before it failed after 5 tries, but now with the pause, it just hangs, even though there is no recovery possibly (given the info at hand). With the pause mechanism, it becomes even more important that we address #1925 as this error happens all the time if you go between different systems and accounts. Maybe we should consider to put this in before the release it published as I expect this issue to appear quite frequently.

sphuber · 2020-10-07T08:30:21Z

This is a duplicate of #2226 which was closed, but we can continue discussion in this issue

ltalirz · 2022-03-16T10:46:31Z

+1 for adding parsing to the SLURM plugin to detect this error and return a corresponding exit code with instructions to provide a metadata.options.queue_name to fix this

ltalirz · 2022-07-29T16:22:13Z

+1 Just ran into this again

sphuber added type/accepted feature approved feature request priority/nice-to-have topic/engine topic/calc-jobs aiida-core 1.x labels Jun 3, 2019

espenfl mentioned this issue Feb 25, 2020

Perform basic tests of the scheduler interfaces #3805

Open

sphuber removed the aiida-core 1.x label Jun 8, 2020

sphuber self-assigned this Jul 29, 2022

sphuber mentioned this issue Dec 20, 2022

SlurmScheduler: Detect broken submission scripts for invalid account #5850

Merged

sphuber added this to the v2.3.0 milestone Dec 20, 2022

sphuber closed this as completed in #5850 Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler plugins need to be able to distinguish recoverable from irrecoverable errors #2955

Scheduler plugins need to be able to distinguish recoverable from irrecoverable errors #2955

sphuber commented Jun 3, 2019

espenfl commented Jun 3, 2019

espenfl commented Jun 3, 2019 •

edited

Loading

sphuber commented Jun 3, 2019

espenfl commented Oct 2, 2019 •

edited

Loading

sphuber commented Oct 7, 2020

ltalirz commented Mar 16, 2022 •

edited

Loading

ltalirz commented Jul 29, 2022

Scheduler plugins need to be able to distinguish recoverable from irrecoverable errors #2955

Scheduler plugins need to be able to distinguish recoverable from irrecoverable errors #2955

Comments

sphuber commented Jun 3, 2019

espenfl commented Jun 3, 2019

espenfl commented Jun 3, 2019 • edited Loading

sphuber commented Jun 3, 2019

espenfl commented Oct 2, 2019 • edited Loading

sphuber commented Oct 7, 2020

ltalirz commented Mar 16, 2022 • edited Loading

ltalirz commented Jul 29, 2022

espenfl commented Jun 3, 2019 •

edited

Loading

espenfl commented Oct 2, 2019 •

edited

Loading

ltalirz commented Mar 16, 2022 •

edited

Loading