Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory footprint of Cosmos DbtDag, DbtTaskGroup and operators #1471

Open
3 tasks
tatiana opened this issue Jan 16, 2025 · 5 comments
Open
3 tasks

Improve memory footprint of Cosmos DbtDag, DbtTaskGroup and operators #1471

tatiana opened this issue Jan 16, 2025 · 5 comments
Assignees
Labels
area:performance Related to performance, like memory usage, CPU usage, speed, etc
Milestone

Comments

@tatiana
Copy link
Collaborator

tatiana commented Jan 16, 2025

Context

Some users have reported high memory consumption per Airflow worker node with (and without) Cosmos, when using Airflow Celery Executors. Using Airflow Kubernetes Executor didn't help with the task-level memory consumption.

A customer shared numbers to run a trivial Airflow task is around 250 MB, without Cosmos. Using Cosmos increases the consumption to 300 MB. While running small DAGs with few parallel tasks may not be an issue, it can become a resource management issue for larger workflows.

An example of how the memory was tracked was to instantiate a sequence of PythonOperators that would execute the following function:

def write_temp_files():
    process = psutil.Process()
    temp_dir = tempfile.mkdtemp()
    print(f"Temporary directory created: {temp_dir}")

    try:
        for i in range(10):  # Create 10 temporary files
            file_path = os.path.join(temp_dir, f"temp_file_{i}.txt")
            with open(file_path, "w") as temp_file:
                temp_file.write("X" * 10**6)  # Write 1 MB of data

            mem_info = process.memory_info()
            print(f"Step {i + 1}: Memory Usage: {mem_info.rss / (1024 * 1024):.2f} MB")

            time.sleep(1)  # Simulate delay between file writes

    finally:
        for file_name in os.listdir(temp_dir):
            os.remove(os.path.join(temp_dir, file_name))
        os.rmdir(temp_dir)
        print(f"Temporary directory deleted: {temp_dir}")

What we can do

There is a limit to what we can do at a Cosmos level on improving performance since the plain Airflow baseline is already high.

The goal of this ticket is to:

  • confirm the memory consumption per task in a distributed environment, such as Astro
  • spike making imports local inside def and not at top-level modules
  • evaluate other possible improvements
@tatiana tatiana added this to the Cosmos 1.10.0 milestone Jan 16, 2025
@tatiana tatiana added the area:performance Related to performance, like memory usage, CPU usage, speed, etc label Jan 16, 2025
@josix
Copy link

josix commented Jan 16, 2025

Hello @tatiana, I'm interested in the performance enhancement topic. If the task isn't urgent, I'd be happy to start by investigating the current memory footprint and familiarizing myself with the codebase. Please feel free to assign it to me if possible, thanks!

@tatiana
Copy link
Collaborator Author

tatiana commented Jan 16, 2025

@josix, we'd love help on this; thank you very much! I'm assigning the ticket to you - please let us know if you need any support.

@sbaldassin
Copy link

@josix hey here's the customer that shared the numbers above :). please do let me know if there's anything you want me to test/try. cheers

Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 16, 2025
@josix
Copy link

josix commented Feb 16, 2025

Apologies for the delayed response because I got caught up with other tasks. I'll share some progress updates this week.

@github-actions github-actions bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:performance Related to performance, like memory usage, CPU usage, speed, etc
Projects
None yet
Development

No branches or pull requests

3 participants