Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat timeout docs #46257

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft
Changes from 4 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
bb8d47f
Emphasize task heartbeat timeout terminology in docs to match logs
karenbraganz Jan 29, 2025
7ab777a
Grammatical correction
karenbraganz Jan 29, 2025
8aa1f72
Grammatical correction
karenbraganz Jan 29, 2025
bdab0a7
Merge branch 'main' into heartbeat_timeout_docs
karenbraganz Jan 29, 2025
7f92de7
Merge branch 'apache:main' into heartbeat_timeout_docs
karenbraganz Feb 5, 2025
6b8e595
edit docs
karenbraganz Feb 5, 2025
5e0aa2d
redirect URL
karenbraganz Feb 5, 2025
b479fdc
Update docs/apache-airflow/core-concepts/tasks.rst
karenbraganz Feb 5, 2025
1d16f19
Update docs/apache-airflow/core-concepts/tasks.rst
karenbraganz Feb 5, 2025
5eb20ba
Update docs/apache-airflow/core-concepts/tasks.rst
karenbraganz Feb 5, 2025
270ccbf
Merge branch 'apache:main' into heartbeat_timeout_docs
karenbraganz Feb 15, 2025
1e554f8
Merge branch 'apache:main' into heartbeat_timeout_docs
karenbraganz Feb 22, 2025
bdcbeda
Edit docs
karenbraganz Feb 16, 2025
bc09224
Update config.yml with new config names
karenbraganz Feb 16, 2025
09d08fc
Update code to use heartbeat timeout terminology
karenbraganz Feb 24, 2025
6752b77
Update code to include heartbeat timeout terminology
karenbraganz Feb 24, 2025
4424e65
Fix incorrect config name in test_sync_orphaned_tasks
karenbraganz Feb 28, 2025
8887ba2
Merge branch 'main' into heartbeat_timeout_docs
karenbraganz Mar 4, 2025
58a2086
Change task_instance_heartbeat_timeout_threshold to task_instance_hea…
karenbraganz Mar 4, 2025
1409a84
Update supervisor.py
karenbraganz Mar 6, 2025
7339e4b
Merge branch 'main' into heartbeat_timeout_docs
karenbraganz Mar 6, 2025
954ab0e
Merge branch 'apache:main' into heartbeat_timeout_docs
karenbraganz Mar 6, 2025
7a8e846
Merge branch 'apache:main' into heartbeat_timeout_docs
karenbraganz Mar 6, 2025
e586bcb
Merge branch 'main' into heartbeat_timeout_docs
karenbraganz Mar 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions docs/apache-airflow/core-concepts/tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,25 +167,25 @@ These can be useful if your code has extra knowledge about its environment and w

.. _concepts:zombies:

Zombie Tasks
------------
Task Heartbeat Timeout (Zombie Tasks)
---------------------------------------

No system runs perfectly, and task instances are expected to die once in a while.

*Zombie tasks* are ``TaskInstances`` stuck in a ``running`` state despite their associated jobs being inactive
(e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these
periodically, clean them up, and either fail or retry the task depending on its settings. Tasks can become zombies for
``TaskInstances`` may get stuck in a ``running`` state despite their associated jobs being inactive
(e.g. their local task job did not send a recent heartbeat as it got killed, or the machine died). Such tasks are also known as zombie tasks. Airflow will find these
periodically, clean them up, and either fail or retry the task depending on its settings. The heartbeat of a local task job can timeout for
many reasons, including:

* The Airflow worker ran out of memory and was OOMKilled.
* The Airflow worker failed its liveness probe, so the system (for example, Kubernetes) restarted the worker.
* The system (for example, Kubernetes) scaled down and moved an Airflow worker from one node to another.


Reproducing zombie tasks locally
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reproducing task heartbeat timeouts locally
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you'd like to reproduce zombie tasks for development/testing processes, follow the steps below:
If you'd like to reproduce local task job heartbeat timeouts for development/testing processes, follow the steps below:

1. Set the below environment variables for your local Airflow setup (alternatively you could tweak the corresponding config values in airflow.cfg)

Expand Down Expand Up @@ -216,7 +216,7 @@ If you'd like to reproduce zombie tasks for development/testing processes, follo
sleep_dag()


Run the above DAG and wait for a while. You should see the task instance becoming a zombie task and then being killed by the scheduler.
Run the above DAG and wait for a while. You should see the task experience a heartbeat timeout and then get killed by the scheduler.



Expand Down
Loading