Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor bulk_save_to_db #42245

Merged
merged 9 commits into from
Sep 20, 2024
Merged

Refactor bulk_save_to_db #42245

merged 9 commits into from
Sep 20, 2024

Conversation

uranusjr
Copy link
Member

This function collects information from DAG objects, and creates/updates database rows against them. However, it handles A LOT of information, reading a lot of objects, touching a lot of models. The function is not very readable.

A new module has been introduced in airflow.dag_processing.collection to encapsulate the logic previously in bulk_save_to_db.

Some loops are broken down into multiple loops, so each loop does not do too much (which leads to a lot of long-living variables that reduce readability). Not much is changed aside from that, just mostly splitting things into separate steps to make things clearer.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Sep 16, 2024
@uranusjr uranusjr force-pushed the refactor-dag-load branch 4 times, most recently from 7ddfec5 to 5d83fe0 Compare September 18, 2024 09:57
This function collects information from DAG objects, and creates/updates
database rows against them. However, it handles A LOT of information,
reading a lot of objects, touching a lot of models. The function is not
very readable.

A new module has been introduced in airflow.dag_processing.collection to
encapsulate the logic previously in bulk_save_to_db.

Some loops are broken down into multiple loops, so each loop does not do
too much (which leads to a lot of long-living variables that reduce
readability). Not much is changed aside from that, just mostly splitting
things into separate steps to make things clearer.
The dataset manager already does this.
This saves a few calls when no dataset/alias references were found in
any DAGs.
@uranusjr
Copy link
Member Author

Note of self: Submit another PR to refactor _register_dataset_changes. There are at least a few db optimisations we can do.

@uranusjr
Copy link
Member Author

Also maybe we need a new method in DatasetManager for creating aliases. (Or even more for other things?)

@uranusjr
Copy link
Member Author

Here we gooooooo

@uranusjr uranusjr marked this pull request as ready for review September 19, 2024 01:41
Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
@uranusjr uranusjr merged commit 8d816fb into apache:main Sep 20, 2024
51 checks passed
@uranusjr uranusjr deleted the refactor-dag-load branch September 20, 2024 06:45
joaopamaral pushed a commit to joaopamaral/airflow that referenced this pull request Oct 21, 2024
Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants