Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improves deletion of old artifacts. #11079

Merged
merged 1 commit into from
Sep 22, 2020

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Sep 22, 2020

We introduced deletion of the old artifacts as this was a
suspected culprit of Kubernetes Job failures. It turned out
eventually that those Kubernetes Job failures were caused by
the #11017 change, but it's good to do housekeeping of the
artifacts anyway.

The delete workflow action introduced in a hurry had two problems:

  • it runs for every fork if they sync master. This is a bit
    too invasive

  • it fails continuously after 10 - 30 minutes every time
    as we have too many old artifacts to delete (GitHub has
    90 days retention policy so we have likely tens of
    thousands of artifacts to delete)

  • it runs every hour and it causes occasional API rate limit
    exhaution (because we have too many artifacts to loop trough)

This PR introduces filtering with the repo, changes frequency
of deletion to be 4 times a day and adds script that we are
running manualy to delete those excessive artifacts now. Eventually
when the number of artifacts goes down the regular job shoul delete
maybe few hundreds of artifacts appearing within the 6 hours window
and it should stop failing.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

We introduced deletion of the old artifacts as this was a
suspected culprit of Kubernetes Job failures. It turned out
eventually that those Kubernetes Job failures were caused by
the apache#11017 change, but it's good to do housekeeping of the
artifacts anyway.

The delete workflow action introduced in a hurry had two problems:

* it runs for every fork if they sync master. This is a bit
  too invasive

* it fails continuously after 10 - 30 minutes every time
  as we have too many old artifacts to delete (GitHub has
  90 days retention policy so we have likely tens of
  thousands of artifacts to delete)

* it runs every hour and it causes occasional API rate limit
  exhaution (because we have too many artifacts to loop trough)

This PR introduces filtering with the repo, changes frequency
of deletion to be 4 times a day and adds script that we are
running manualy to delete those excessive artifacts now. Eventually
when the number of artifacts goes down the regular job shoul delete
maybe few hundreds of artifacts appearing within the 6 hours window
and it should stop failing.
@@ -1,11 +1,12 @@
name: 'Delete old artifacts'
on:
schedule:
- cron: '0 * * * *' # every hour
- cron: '27 */6 * * *' # run every 6 hours
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "Every day" would be good enough, WDYT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bit of a problem - we cannot rate-limit the action and we occasionally exhausted the limit now.
I did some back-of-the-envelope calculations:

The limit is 5000 API calls / hr. So if we have > 3000 artifacts to delete, we are getting dangerously close to be able to exhaust our API calls within single run.

We have (tops) ~ 200 builds a day with (tops) ~ 50 artifacts each (assume intensive period and increasing number of artifacts) > 10.000 artifacts to delete a day. Running it 4 times/day is ~ 2.500 artifacts to delete for each run.

Copy link
Member Author

@potiuk potiuk Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW. I am running the deletion now using the script with ~ 1000 deletions/hour ( I also hit rate limit with my personal token). With those assumptions and 90 day retention period, it will take ~ 10 days to delete all old artifacts (assuming we have 30% of the 10.000/day = 3000/day * 90 /24 ~ 10 days :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW. I think my calculations were a bit too pessimistics. Seems that 24 hours ewere enough to delete the artifacts. The job started to succeed and it takes ~ 5 minutes to run when it runs. Also my script is now often finishing after some 30-50 artifacts so seems we got to a "clean state" much faster than I thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Now when I run "delete" and no jobs finish in the meantime there are no artifacts to delete :) @dimberman -> artifacts are cleaned up now..

@@ -0,0 +1,84 @@
#!/usr/bin/env bash
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we could have written it in python/octokit, but I've found the bash script that worked out of the box. Improved it slightly and refactored according to Google Shell Guide and it seems to do it''s job (running it now in the background).

@potiuk
Copy link
Member Author

potiuk commented Sep 22, 2020

Raw logs show "2020-09-22T12:02:38.4767818Z ##[error]Process completed with exit code 137." (seems a worker killed or OOM)

Merging.

@potiuk potiuk merged commit cea9e82 into apache:master Sep 22, 2020
@potiuk potiuk deleted the improve-artifact-deletion branch September 22, 2020 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants