You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have Airflow (2.2.0) deployed to Kubernetes (1.21.14-gke.700) with the official Airflow Helm Chart (1.6.0). Database is PostgreSQL.
We use KubernetesExecutor.
Recently we upgraded Airflow to the recent (at that point of time) version 2.3.4 (I was waiting for this version because of the fix #24356).
Initially the upgrade went fine, but then we noticed that DAGs just stops running new tasks. The tasks are stuck in queued state. Actions like clear, delete of a run or manual run doesn't help. Only restart of Airflow scheduler helps, but then after one day (we have only daily jobs) tasks are stuck again.
After restart of scheduler, the next run of tasks finishes successfully, but on the 3rd day they are stuck again.
I have logs and some ideas why it could happen, I will write it below.
What you think should happen instead
Tasks run successfully on the scheduled time.
How to reproduce
Upgrade Airflow 2.2.0 with DAGs to the version 2.3.4.
Operating System
Windows 11
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Airflow 2.2.0
Official Airflow Helm Chart 1.6.0
Kubernetes 1.21.14-gke.700
PostgreSQL 13
executor: KubernetesExecutor
Anything else
This problem occurs every 3rd day. We downgraded the Airflow back to 2.2.0 until we find out the reason.
In logs the scheduler writes something like this (left mention of only one dag task, there are other 27 dag tasks)
[2022-09-14T08:01:30.656+0000] {dagbag.py:472} DEBUG - Loaded DAG <DAG: MY_DAG_1>
[2022-09-14T08:01:30.746+0000] {processor.py:650} INFO - DAG(s) dict_keys([MY_DAG_1']) retrieved from /opt/airflow/dags/create_dags_from_configs.py
[2022-09-14T08:01:30.747+0000] {dagbag.py:619} DEBUG - Running dagbag.sync_to_db with retries. Try 1 of 3
[2022-09-14T08:01:30.747+0000] {dagbag.py:624} DEBUG - Calling the DAG.bulk_sync_to_db method
[2022-09-14T08:01:30.966+0000] {scheduler_job.py:354} INFO - 28 tasks up for execution:
<TaskInstance: MY_DAG_1.main scheduled__2022-09-13T04:00:00+00:00 [scheduled]>
...
[2022-09-14T08:01:31.037+0000] {scheduler_job.py:547} INFO - Sending TaskInstanceKey(dag_id='MY_DAG_1', task_id='main', run_id='scheduled__2022-09-13T04:00:00+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default
[2022-09-14T08:01:31.038+0000] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'MY_DAG_1', 'main', 'scheduled__2022-09-13T04:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/create_dags_from_configs.py']
...
[2022-09-14T08:01:31.054+0000] {base_executor.py:159} DEBUG - 28 running task instances
[2022-09-14T08:01:31.055+0000] {base_executor.py:160} DEBUG - 28 in queue
[2022-09-14T08:01:31.055+0000] {base_executor.py:161} DEBUG - 36 open slots
Then for every DAG task there appears these error messages in logs
[2022-09-12T16:32:45.654+0000] {kubernetes_executor.py:469} INFO - Found 2 queued task instances
[2022-09-12T16:32:45.673+0000] {kubernetes_executor.py:510} INFO - TaskInstance: <TaskInstance: MY_DAG_1.main scheduled__2022-08-03T07:00:00+00:00 [queued]> found in queued state but was not launched, rescheduling
[2022-09-12T16:32:46.336+0000] {scheduler_job.py:419} INFO - DAG MY_DAG_1 has 1/16 running and queued tasks
[2022-09-12T16:32:46.337+0000] {scheduler_job.py:505} INFO - Setting the following tasks to queued state:
<TaskInstance: MY_DAG_1.main scheduled__2022-08-03T07:00:00+00:00 [scheduled]>
[2022-09-12T16:32:49.649+0000] {base_executor.py:215} ERROR - could not queue task TaskInstanceKey(dag_id='MY_DAG_1', task_id='main', run_id='scheduled__2022-08-03T07:00:00+00:00', try_number=1, map_index=-1) (still running after 4 attempts)
This repeats until I restart the scheduler and delete all tasks to reschedule them.
At first I thought that this situation is related to #21225, but in the issue Celery is used instead of KubernetesExecutor.
I also have a suspicion this error may be caused by the changes in #23016 as there the trigger logic was changed for Celery, but not for Kubernetes Executor.
All the symptoms are similar to the issue #25728. It seems that Task Key was never removed from self.running after it initially rescheduled itself. This also happens in our case with KubernetesExecutor.
Closng as fixed it 2.4.0. We can always reaopened if it is not. Please double-check @vilozio if this is fixed after migrating to 2.4.0 and comment here.
Apache Airflow version
Other Airflow 2 version
What happened
We have Airflow (2.2.0) deployed to Kubernetes (1.21.14-gke.700) with the official Airflow Helm Chart (1.6.0). Database is PostgreSQL.
We use KubernetesExecutor.
Recently we upgraded Airflow to the recent (at that point of time) version 2.3.4 (I was waiting for this version because of the fix #24356).
Initially the upgrade went fine, but then we noticed that DAGs just stops running new tasks. The tasks are stuck in queued state. Actions like clear, delete of a run or manual run doesn't help. Only restart of Airflow scheduler helps, but then after one day (we have only daily jobs) tasks are stuck again.
After restart of scheduler, the next run of tasks finishes successfully, but on the 3rd day they are stuck again.
I have logs and some ideas why it could happen, I will write it below.
What you think should happen instead
Tasks run successfully on the scheduled time.
How to reproduce
Upgrade Airflow 2.2.0 with DAGs to the version 2.3.4.
Operating System
Windows 11
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Anything else
This problem occurs every 3rd day. We downgraded the Airflow back to 2.2.0 until we find out the reason.
In logs the scheduler writes something like this (left mention of only one dag task, there are other 27 dag tasks)
Then for every DAG task there appears these error messages in logs
This repeats until I restart the scheduler and delete all tasks to reschedule them.
At first I thought that this situation is related to #21225, but in the issue Celery is used instead of KubernetesExecutor.
I also have a suspicion this error may be caused by the changes in #23016 as there the trigger logic was changed for Celery, but not for Kubernetes Executor.
All the symptoms are similar to the issue #25728. It seems that Task Key was never removed from self.running after it initially rescheduled itself. This also happens in our case with KubernetesExecutor.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: