Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubernetesJobWatcher no longer inherits from Process #11017

Merged
merged 1 commit into from
Sep 18, 2020

Conversation

dimberman
Copy link
Contributor

multiprocessing.Process is set up in a very unfortunate manner
that pretty much makes it impossible to test a class that inherits from
Process or use any of its internal functions. For this reason we decided
to seperate the actual process based functionality into a class member


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

multiprocessing.Process is set up in a very unfortunate manner
that pretty much makes it impossible to test a class that inherits from
Process or use any of its internal functions. For this reason we decided
to seperate the actual process based functionality into a class member
@boring-cyborg boring-cyborg bot added area:Scheduler including HA (high availability) scheduler k8s labels Sep 18, 2020
@dimberman dimberman requested review from potiuk, ashb and kaxil September 18, 2020 17:50
@dimberman dimberman merged commit 1539bd0 into apache:master Sep 18, 2020
@dimberman dimberman deleted the kubernetes-job-watcher-seperate branch September 18, 2020 18:33
dimberman added a commit that referenced this pull request Sep 21, 2020
dimberman added a commit that referenced this pull request Sep 21, 2020
potiuk added a commit to PolideaInternal/airflow that referenced this pull request Sep 22, 2020
We introduced deletion of the old artifacts as this was a
suspected culprit of Kubernetes Job failures. It turned out
eventually that those Kubernetes Job failures were caused by
the apache#11017 change, but it's good to do housekeeping of the
artifacts anyway.

The delete workflow action introduced in a hurry had two problems:

* it runs for every fork if they sync master. This is a bit
  too invasive

* it fails continuously after 10 - 30 minutes every time
  as we have too many old artifacts to delete (GitHub has
  90 days retention policy so we have likely tens of
  thousands of artifacts to delete)

* it runs every hour and it causes occasional API rate limit
  exhaution (because we have too many artifacts to loop trough)

This PR introduces filtering with the repo, changes frequency
of deletion to be 4 times a day and adds script that we are
running manualy to delete those excessive artifacts now. Eventually
when the number of artifacts goes down the regular job shoul delete
maybe few hundreds of artifacts appearing within the 6 hours window
and it should stop failing.
potiuk added a commit that referenced this pull request Sep 22, 2020
We introduced deletion of the old artifacts as this was
the suspected culprit of Kubernetes Job failures. It turned out
eventually that those Kubernetes Job failures were caused by
the #11017 change, but it's good to do housekeeping of the
artifacts anyway.

The delete workflow action introduced in a hurry had two problems:

* it runs for every fork if they sync master. This is a bit
  too invasive

* it fails continuously after 10 - 30 minutes every time
  as we have too many old artifacts to delete (GitHub has
  90 days retention policy so we have likely tens of
  thousands of artifacts to delete)

* it runs every hour and it causes occasional API rate limit
  exhaustion (because we have too many artifacts to loop trough)

This PR introduces filtering with the repo, changes the frequency
of deletion to be 4 times a day. Back of the envelope calculation
tops 4/day at 2500 artifacts to delete at every run so we have low risk 
of reaching 5000 API calls/hr rate limit. and adds script that we are
running manually to delete those excessive artifacts now. Eventually
when the number of artifacts goes down the regular job should delete
maybe a few hundreds of artifacts appearing within the 6 hours window
in normal circumstances and it should stop failing then.
@kaxil kaxil added provider:cncf-kubernetes Kubernetes provider related issues and removed area:k8s labels Nov 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants