Tasks stuck in queued state forever after upgrade from 2.2.0 to 2.3.4 #26773

vilozio · 2022-09-29T11:32:05Z

Apache Airflow version

Other Airflow 2 version

What happened

We have Airflow (2.2.0) deployed to Kubernetes (1.21.14-gke.700) with the official Airflow Helm Chart (1.6.0). Database is PostgreSQL.

We use KubernetesExecutor.

Recently we upgraded Airflow to the recent (at that point of time) version 2.3.4 (I was waiting for this version because of the fix #24356).

Initially the upgrade went fine, but then we noticed that DAGs just stops running new tasks. The tasks are stuck in queued state. Actions like clear, delete of a run or manual run doesn't help. Only restart of Airflow scheduler helps, but then after one day (we have only daily jobs) tasks are stuck again.

After restart of scheduler, the next run of tasks finishes successfully, but on the 3rd day they are stuck again.

I have logs and some ideas why it could happen, I will write it below.

What you think should happen instead

Tasks run successfully on the scheduled time.

How to reproduce

Upgrade Airflow 2.2.0 with DAGs to the version 2.3.4.

Operating System

Windows 11

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Airflow 2.2.0
Official Airflow Helm Chart 1.6.0
Kubernetes 1.21.14-gke.700
PostgreSQL 13
executor: KubernetesExecutor

Anything else

This problem occurs every 3rd day. We downgraded the Airflow back to 2.2.0 until we find out the reason.

In logs the scheduler writes something like this (left mention of only one dag task, there are other 27 dag tasks)

[2022-09-14T08:01:30.656+0000] {dagbag.py:472} DEBUG - Loaded DAG <DAG: MY_DAG_1>
[2022-09-14T08:01:30.746+0000] {processor.py:650} INFO - DAG(s) dict_keys([MY_DAG_1']) retrieved from /opt/airflow/dags/create_dags_from_configs.py
[2022-09-14T08:01:30.747+0000] {dagbag.py:619} DEBUG - Running dagbag.sync_to_db with retries. Try 1 of 3
[2022-09-14T08:01:30.747+0000] {dagbag.py:624} DEBUG - Calling the DAG.bulk_sync_to_db method
[2022-09-14T08:01:30.966+0000] {scheduler_job.py:354} INFO - 28 tasks up for execution:
 <TaskInstance: MY_DAG_1.main scheduled__2022-09-13T04:00:00+00:00 [scheduled]>
 ...
[2022-09-14T08:01:31.037+0000] {scheduler_job.py:547} INFO - Sending TaskInstanceKey(dag_id='MY_DAG_1', task_id='main', run_id='scheduled__2022-09-13T04:00:00+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default
[2022-09-14T08:01:31.038+0000] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'MY_DAG_1', 'main', 'scheduled__2022-09-13T04:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/create_dags_from_configs.py']
...
[2022-09-14T08:01:31.054+0000] {base_executor.py:159} DEBUG - 28 running task instances
[2022-09-14T08:01:31.055+0000] {base_executor.py:160} DEBUG - 28 in queue
[2022-09-14T08:01:31.055+0000] {base_executor.py:161} DEBUG - 36 open slots

Then for every DAG task there appears these error messages in logs

[2022-09-12T16:32:45.654+0000] {kubernetes_executor.py:469} INFO - Found 2 queued task instances
[2022-09-12T16:32:45.673+0000] {kubernetes_executor.py:510} INFO - TaskInstance: <TaskInstance: MY_DAG_1.main scheduled__2022-08-03T07:00:00+00:00 [queued]> found in queued state but was not launched, rescheduling
[2022-09-12T16:32:46.336+0000] {scheduler_job.py:419} INFO - DAG MY_DAG_1 has 1/16 running and queued tasks
[2022-09-12T16:32:46.337+0000] {scheduler_job.py:505} INFO - Setting the following tasks to queued state:
<TaskInstance: MY_DAG_1.main scheduled__2022-08-03T07:00:00+00:00 [scheduled]>
[2022-09-12T16:32:49.649+0000] {base_executor.py:215} ERROR - could not queue task TaskInstanceKey(dag_id='MY_DAG_1', task_id='main', run_id='scheduled__2022-08-03T07:00:00+00:00', try_number=1, map_index=-1) (still running after 4 attempts)

This repeats until I restart the scheduler and delete all tasks to reschedule them.

At first I thought that this situation is related to #21225, but in the issue Celery is used instead of KubernetesExecutor.

I also have a suspicion this error may be caused by the changes in #23016 as there the trigger logic was changed for Celery, but not for Kubernetes Executor.

All the symptoms are similar to the issue #25728. It seems that Task Key was never removed from self.running after it initially rescheduled itself. This also happens in our case with KubernetesExecutor.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

flaviussn · 2022-10-03T15:50:47Z

The same thing happened to me related with this issue. The bug is fixed in version 2.4.0 and 2.4.1 for me

potiuk · 2022-10-10T05:53:26Z

Closng as fixed it 2.4.0. We can always reaopened if it is not. Please double-check @vilozio if this is fixed after migrating to 2.4.0 and comment here.

vilozio · 2023-02-20T10:38:14Z

Sorry for a late reply. This issue was resolved for me after migrating to 2.4.1.

vilozio added area:core kind:bug This is a clearly a bug labels Sep 29, 2022

potiuk closed this as completed Oct 10, 2022

potiuk added this to the Airflow 2.4.0 milestone Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks stuck in queued state forever after upgrade from 2.2.0 to 2.3.4 #26773

Tasks stuck in queued state forever after upgrade from 2.2.0 to 2.3.4 #26773

vilozio commented Sep 29, 2022

flaviussn commented Oct 3, 2022

potiuk commented Oct 10, 2022

vilozio commented Feb 20, 2023

Tasks stuck in queued state forever after upgrade from 2.2.0 to 2.3.4 #26773

Tasks stuck in queued state forever after upgrade from 2.2.0 to 2.3.4 #26773

Comments

vilozio commented Sep 29, 2022

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

flaviussn commented Oct 3, 2022

potiuk commented Oct 10, 2022

vilozio commented Feb 20, 2023