-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler crashes with psycopg2.errors.DeadlockDetected exception #23361
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
I faced the same issue with airflow 2.3.0rc2 from datetime import datetime
from airflow import DAG
from airflow.decorators import task
with DAG(dag_id="map-reduce", start_date=datetime(2022,4,22)) as dag:
@task
def add_one(x: int):
return x + 1
@task
def sum_it(values):
total = sum(values)
print (f"Total was {total}")
added_values = add_one.expand(x=[1,2,'a'])
sum_it(added_values)
added_values_correct = add_one.expand(x=[1, 2, 3])
sum_it (added_values_correct) Added scheduler logs in attachments. |
The deadlock issue is not the cause of the tasks being set to skipped -- the deadlock occurs when trying to. @abhishekbhakat what error do you get with that? The log you included there doesn't show any error. |
I experience exactly the same issue. Reducing the number of schedulers to one seems to have resolved the issue for my deployment, but now, tasks are scheduling a lot slower. I initially had 3 schedulers. Here are my new config variables: [scheduler]
job_heartbeat_sec = 30
scheduler_heartbeat_sec = 5
num_runs = -1
scheduler_idle_sleep_time = 1
min_file_process_interval = 30
dag_dir_list_interval = 120
print_stats_interval = 240
pool_metrics_interval = 5
scheduler_health_check_threshold = 60
orphaned_tasks_check_interval = 300.0
scheduler_zombie_task_threshold = 300
catchup_by_default = True
max_tis_per_query = 512
use_row_level_locking = False
max_dagruns_to_create_per_loop = 100
max_dagruns_per_loop_to_schedule = 200
schedule_after_task_execution = True
parsing_processes = 2
file_parsing_sort_mode = modified_time
use_job_schedule = True
allow_trigger_in_future = False
dependency_detector = airflow.serialization.serialized_objects.DependencyDetector
trigger_timeout_check_interval = 15
run_duration = 41460 |
It seems that after having passed the
|
Again, some more information about my last comments, because my scheduler again crashed when I clean a lot of tasks. By the way, I am running behind a pgbouncer Here is the failed log:
|
@V0lantis it would be great to see few more lines of the stacktrace to know where the exception is happening for you. Also any chance you can take a look at the database log to see what queries are acccosiated with the two processes/transactions. From this we can only see on half of the problem. What airflow version are you running? And are there any plugins/custom code interacting with airflow? |
Thank you @tanelk for your quick reaction time !
You can find here (just_scheduler_logs.log) the full stack trace of one of my scheduler with the issue discussed above.
I fortunately found it. (removing some logs because I found a lot of Here are postgresql logs:
I am running on an Airflow 2.2.4. There are some custom plugins (not many though), simply plugins which are creating new dynamics dags from a given template. Hope that will help understand what is the issue 👍 |
Are these logs from where you have set The The interesting situtuation is when |
Indeed, it is ! My bad then for having set this param, thinking that postgres would allow it. Thanks for the help, and sorry if I wasted your time 🙏 |
I would advise against doing this while running multiple schedulers -- if you do then it is entirely possible that Airflow will not correctly respect configured concurrency limits for DAGs/Tasks/Pools. Edit: oh, or it will crash |
Yep, that's effectively what my Airflow deployment did. I misinterpreted the documentation, sorry. |
@ashb The log file provided has that error in around line no. 5810 Attaching the error message below:
|
I have been noticing this error as well. PostgreSQL 14 with Airflow 2.3.1. My scheduler containers keep crashing (I run two of them on two different nodes).
|
I'm also receiving this error with a single Has anyone found an effective workaround for this?
|
I'm also seeing this issue, albeit with a slightly different query:
It happens pretty much every time I delete a dag. Only way I've found around it is to browse the dag runs, delete all but one of them, then delete the dag. I'm running multiple schedulers, and Our version path was 2.2.4 -> 2.3.2 (we skipped 2.3.0 and 2.3.1 because of reasons). 2.2.4 is fine, 2.3.2 is not. Anecdotally, db load appears much higher in 2.3.2 - the box running postgres is pretty much flat out all the time, versus ~25% previously. I don't have hard numbers, because several things changed at once in our testing env (mea culpa), but I will at some point be upgrading another env, and I'll be watching closely. |
N.B. I've tried shutting down all-but-one schedulers, and that also fixes it. 2 schedulers running: delete fails, as per above. 1 scheduler running, delete works. Is row-locking broken somehow in 3.2.3? |
Having this issue as well, we're only on 2.0.2. Looking through some of the Airflow configs I'm wondering if it could be related to the file_parsing_sort_mode? It seems like the default behavior could cause the schedulers to always start at the same place when parsing dag files, and I could see that causing a deadlock at scale. I'm wondering if flipping to This option doesn't seem to be availble in our version, does someone want to try it out for us 😃 |
Our DB load has increased a lot as well. I actually swapped from a Docker based postgresql (mounted volume), to installing it directly on a server instead to see if it would help. I also implemented pgbouncer and bunch of other changes, so I do not have any useful data or information due to a lot of troubleshooting and changes. |
Does anyone know if it comes from Websersver/Schedulers or Workers? I know it's a difficult question but maybe someone happens to have log of queries before/after and could make a simple stat what has changed ? |
We've made some progress in at least reducing the amount of deadlocks. We're running 2.0.2 on K8s and we've discovered the following:
We've been able to reduce deadlocks almost entirely simply by adding a startupProbe to the scheduler deployment in K8s and telling K8s to only roll out schedulers one at a time to avoid them starting at the same time. When they started at the same time all of them would deadlock and running tasks would get killed and rescheduled etc. Rolling out one at a time has almost entirely removed deadlocking, and the few times it does happen it's isolated to one scheduler where other schedulers can keep things moving The fact it happens more frequently when starting schedulers at the same time makes me think it might be related to the |
@ldacey @whitleykeith @argibbs @eitanme -> I spoke with some enlightened people :) (yeah talking about you @ashb and @bbovenzi ) -> and after the talk I have a hypothesis, that this is the new Grid view doing auto-refresh for a long running DAG. There was a fix by @ashb #24284 that is going to be released in 2.3.3 which decreases significantly a number of queries that are generated by the Grid view refresh. It's a huge improvement and might impact both - load on the DB and possibly memory usage of the webserver - especially if there are almost continuously running dags and a number of people leaves the browser open with "auto-refresh" on the Grid View. Is there a way some of you could test the hypothesis and see if there might be a correlation (requires a bit of coordination what your users do). (BTW. If that's it then Ash's fix is coming in 2.3.3). |
Hah, spooky. Without boring you with the details, I was running some-many-lots of DAGs from scratch today. I noticed that db load increased with the number of active dags and then experimented to see if it was the dags themselves, or just because I was watching them all. Turning off auto-refresh seemed to massively improve db load, and I just came here to update the thread, only to find you've beaten me to it. As an aside, what made me look at auto-refresh (@bbovenzi) was that the spinner is always spinning when auto-refresh is turned on. In the old view, the refresh dots would only animate every few seconds when it polled for an updated state. I don't know if the always-spinning thing means it's always refreshing, or if it's just meangingless. But long story short, yes, auto-refresh smells fishy to me. |
It's just always spinning but good point. I can fix that. Is the DB load still bad with the latest on main? If needed I can look into other ways to improve the autorefresh performance. |
I haven't tried with latest main, just 2.3.2; I tend to only run with the release builds, but I can try main if it'll help. |
I am also encountering this issue. I collected some details about both queries involved in the deadlock, hopefully this is helpful. Deployment details:
In the deadlocks there is an UPDATE statement deadlocking with a SELECT ... FOR UPDATE. Based on stack traces visible in the scheduler logs, the UPDATE originates from the main scheduler loop here: Based on the database logs, the SELECT statement has the form:
Searching the Airflow source code, the query that looks most similar to the SELECT from the database error is in Example scheduler logs showing the origins of the UPDATE statement:
Example Postgres logs showing a complete SELECT ... FOR UPDATE statement:
Unfortunately we are not able to repro this on a test instance so I have not been able to try on newer Airflow versions, but based on the discussion on this thread it sounds like the issue is present until at least 2.3.2. |
Very useful Thanks. I will take a look at it shortly. |
We have figured out the origins of the SELECT ... FOR UPDATE and a mechanism for the deadlocks. The short story is it originates from the The SELECT does indeed originate from
Line numbers both in the synopsis above and the stack trace below are for Airflow 2.2.5. Stack traces including the SELECT statements can be found in failed pod logs, I have included one below:
|
Regarding how the deadlock happens, as we know, we have two statements: UPDATE and SELECT ... FOR UPDATE that cause a deadlock. As described previously, the UPDATE statement is here https://github.com/apache/airflow/blob/2.2.5/airflow/models/dagrun.py#L903-L909. It needs to get a shared lock on matching rows of In the meantime, the SELECT ... FOR UPDATE statement Regarding how this might be fixed, the suspicion is |
cc: @ashb - I did not have a look at this yet (will shortly) , but maybe you can take a look and see if any of this ring a bell. |
I'm able to reliably force a DeadlockDetected exception by creating a DAG with many (i.e. over a hundred) concurrent tasks, setting |
Thanks for REALLY detailed investigation @dstaple. I finally had some time to take a look at this and I think your assesment was very correct. However the solution you proposed is not good, because I think we DO want to run "SELECT FOR UPDATE" on DagRun table. The whole scheduling is based on the fact that DagRun row gets locked and no changes are happening to DagRun and any TaskInstances of that DagRun while Scheduler processes those task instances. And since And in this case the "local_task_run" actually locks the DagRun table too (though I am not entirely sure why this is one thing that I do not understand completely - see below). So it does what it should but with one very little caveat - it locks the TaskInstance and DagRun in REVERSE ORDER comparing to what Scheduler does. This is actually the root cause of ALL Deadlocks (at least in Postgres, MySQL has it's own fair share of other kinds of deadlocks) - non-consistent order. The deadlock appears when two threads want two (or more) resources and gets lock on them in reverse order. This is actually the only reason for any kind of deadlocks and your investigation was really nicely showing what's going on. The solution to that is simple - since we are going to get the DagRun lock in a moment anyway in "refresh_from_db", we should simply get the lock on DagRun table FIRST. This should fix the problem as we will then perform lock grabbing in the same sequence in scheduler and task_run - > first DagRun, then TaskInstance. This is what my proposed #25266 does. The only thing I do not know is WHY the
The original query in the code looks like this:
And there is no obvious reason why the last line joins the dag_run table? I hope someone else in this thread might shed some light on it, I have a suspicion, that SQLALchemy will add the join in case there is a ForeignKey with ONCASCADE with the dag_id (which we have) - but I could not find any reference or documentation that would point to such behaviour. @RNHTTR - since you mentioned you can reproduce the issue - maybe you could apply my fix and see if it solves the problem (there is a bit of leap of faith with this change). |
Not related to Deadlocks, however is it necessary use |
Can you do it cross-db (MySQL/Postgres/MsSQL?) and in sqlalchemy? @Taragolis ? |
Also see the discussion I started today - https://lists.apache.org/thread/pgo1qxqsnzwg2do955961tkvxk76m8gw - we have enough trouble about "some" DB features not available in some databases. so we should strive for lowest-common-denominator I am afraid. |
@potiuk Thanks a ton for the feedback! Regarding why the dag_run table is locked when This behavior is described in the following sqlalchemy issue: The behavior was not changed after the above issue was filed, but the following warning was added to the SQLAlchemy documentation:
|
I do not actually know about is it supported by other DB engine and is it has exactly the same behaviour. Definitely it works with all modern PostgreSQL (9.3+) and sqlalchemy, basically we need to set One of the difference that with Might be |
Ahhhhh. That Would indeed explain it. I tink then that my solution is actually the right approach :) |
@Taragolis would be worth checking. The DagRun lock `SELECT FOR UPDATE SKIP LOCKED' is very much the "Key" (pun intended) to make multiple schedulers work and it also (as you can see) spilled a bit to mini-scheduler and task run" in form of just 'SELECT FOR UPDATE". The "SELECT FOR UPDATE SKIP LOCKED" is precisely the mechanism that allows multiple schedulers to run in parallel with basically no serialization and no "accidental state overrides". And we need to make sure that it works - for MySQL 8 and Postgres, because this is our 'baseline". We cannot rely on Postgres-only features, though we would love to - I started some threads in the past mostly starting along the lines "we are fed--up with MySQL, let's dump it". See for example this "Elephant in the Room" thread at the devlist https://lists.apache.org/thread/dp78j0ssyhx62008lbtblrc856nbmlfb . The answer so far and the wisdom of crowd is "No, as much as we would like to, we cannot get rid of MySQL". And if you see the results of our Survey https://airflow.apache.org/blog/airflow-survey-2022/ - while Postgres is by far strongest (also because it is now the only supported DB for ALL managed Airlfow services), there are still ~ 20% of people who use MySQL (or MariaDB but we finally decided that we explicitly exclude MariaDB from supported databases and actively encourage people to migrate out if they use it). So while I would love to start the dicsussion with "Can we use this Postgres feature ?". when we think about the product development, the question is "Is this feature supported in both Postgres AND MySQL 8+". If not - we won't even discuss it, because if we start actively using Postgres-only features to optimize stuff, we are going to impair our MySQL users and eventually we will implement things that only work for Postgres, and behaviours that will differ between Postgres and MySQL and we certainly do not want that. Something that looks good for you as a user (using Postgres only) might not be acceptable for product (with both Postgres and MySQL being supported). I looked (very briefly) if similar feature exists in MySQL, and it seems no, but I did not look too much. But If you think it is worth considering and if you think it's good to think of it, starting with deeper investigation and justifying both - benefits and cross-db-portability is something I would advise you to start with :). I think your question is phrased a bit wrongly:
It should rather be: "I see that we can use that new feature NO KEY in Postgres and also equivalent in MySQL. It has those and those benefits and we can make it cross-db - doc here, doc here". Is this good enough reason to switch to it ? |
This PR attempts to fix the deadlock that occurs when task instance is being run in parallel to running _do_scheduling operation executing get_next_dagruns_to_examine. The whole scheduling is based on actually locking DagRuns scheduler operats on - and it basically means that state of ANY task instances for that DagRun should not change during the scheduling. However there are some cases where task instance is locked FOR UPDATE without prior locking of the DagRun table - this happens for example when local task job executes the task and runs "check_and_change_state_before_execution" method on the task instance it runs. There is no earlier DagRun locking happening and the "refresh_from_db" run with lock_for_update will get the lock on both TaskInstance row as well as on the DagRun row. The problem is this locking happens in reverse sequence in this case: 1) get_next_dagruns_to_examine - locks DagRun first and THEN tries to locks some task instances for that DagRun 2) "check_and_change_state_before_execution" runs effectively the query: select ... from task_instance join dag_run ... for update which FIRST locks TaskInstance and then DagRun table. This reverse sequence of locking is what causes the deadlock. The fix is to force locking the DagRun before running the task instance query that joins dag_run to task_instance. Fixes: apache#23361
@potiuk I re-read your explanation and agree completely, the essence of the specific deadlock I did the deep-dive on above is Regarding @RNHTTR repro'ing a deadlock by deleting a running DAG run, it might be related, but it isn't exactly the same situation. At the minimum the deadlock forced by @RNHTTR probably involves a DELETE statement rather than an UPDATE. Indeed @argibbs reported deadlocks when deleting DAGs and they had DELETE statements for one of the queries (the other was not reported). It's possible that the "other query" in that case was the same SELECT ... FOR UPDATE from Regarding I can check your alternative proposed fix #25266 in a staging environment but it may be a week or two before I can deploy more widely and conclusively report on whether or not it fixes the scheduler UPDATE deadlocking with the task instance SELECT ... FOR UPDATE above. |
@dstaple - would be great if you check. I think we can merge it regardlless (it's super easy to revert) - so there is no problem with some later checking. I also was not sure if the DELETE issue is the same. It could be (and I have the scenario in my head): DELETE DagRun with CASCADE on TI - first creates lock on the DagRun an only THEN an the TaskInstamce - very similarly to what Scheduler does. And in this case the fix above should also help so @RNHTTR I'd appreciate checking it :) |
I'm on it! |
I have an update on the above. We are running a fork of 2.2.5 with the fix from #25312 cherry-picked and are no longer seeing the specific deadlock outlined above. We were seeing deadlocks every ~2h when running with vanilla Airflow 2.2.5, so this is a big improvement. Other types of deadlocks are still possible, and we have observed a handful of these even with the fix to
I believe it is also possible to get deadlocks when manually (via the UI) updating the state of large numbers of tasks, deleting DAG runs while they have many running tasks, and other invasive methods, but I don't have logs to show for that at the moment. I think it will be difficult to completely eliminate deadlocks as long as we have situations where multiple tables need to be simultaneously locked, and we don't standardize the order in which such mutual locks are applied across the Airflow codebase. Brainstorming along these lines, one approach that could reduce the potential for deadlocks would be to favor database access via methods like Regardless I would like to stress that we were seeing deadlocks every ~2h and are now seeing them rarely and under different circumstances. So the fix to |
Hi @eitanme, I am also using LocalExecutor and seeing deadlocks here and there. I am considering switching off Thanks |
Upgrading would be best, as this appears to have been fixed about a year ago in 2.3.4. Note that it's only safe to disable |
Fwiw, I believe we have able to run without issue using the |
The specific deadlock I raised in this issue was fixed, yes, which was a big win. I was happy to help with that. Unfortunately there are still other types of deadlocks that can occur. See for example #27473 , which was closed for some reason. |
Apache Airflow version
2.2.5 (latest released)
What happened
Customer has a dag that generates around 2500 tasks dynamically using a task group. While running the dag, a subset of the tasks (~1000) run successfully with no issue and (~1500) of the tasks are getting "skipped", and the dag fails. The same DAG runs successfully in Airflow v2.1.3 with same Airflow configuration.
While investigating the Airflow processes, We found that both the scheduler got restarted with below error during the DAG execution.
This issue seems to be related to #19957
What you think should happen instead
This issue was observed while running huge number of concurrent task created dynamically by a DAG. Some of the tasks are getting skipped due to restart of scheduler with Deadlock exception.
How to reproduce
DAG file:
Operating System
kubernetes cluster running on GCP linux (amd64)
Versions of Apache Airflow Providers
pip freeze | grep apache-airflow-providers
apache-airflow-providers-amazon==1!3.2.0
apache-airflow-providers-cncf-kubernetes==1!3.0.0
apache-airflow-providers-elasticsearch==1!2.2.0
apache-airflow-providers-ftp==1!2.1.2
apache-airflow-providers-google==1!6.7.0
apache-airflow-providers-http==1!2.1.2
apache-airflow-providers-imap==1!2.2.3
apache-airflow-providers-microsoft-azure==1!3.7.2
apache-airflow-providers-mysql==1!2.2.3
apache-airflow-providers-postgres==1!4.1.0
apache-airflow-providers-redis==1!2.0.4
apache-airflow-providers-slack==1!4.2.3
apache-airflow-providers-snowflake==2.6.0
apache-airflow-providers-sqlite==1!2.1.3
apache-airflow-providers-ssh==1!2.4.3
Deployment
Astronomer
Deployment details
Airflow v2.2.5-2
Scheduler count: 2
Scheduler resources: 20AU (2CPU and 7.5GB)
Executor used: Celery
Worker count : 2
Worker resources: 24AU (2.4 CPU and 9GB)
Termination grace period : 2mins
Anything else
This issue happens in all the dag runs. Some of the tasks are getting skipped and some are getting succeeded and the scheduler fails with the Deadlock exception error.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: