Frequent failures of helm chart tests #24037

potiuk · 2022-05-31T08:08:25Z

Apache Airflow version

main (development)

What happened

We keep on getting very frequent failures of Helm Chart tests and seems that a big number of those errors are because of errors when pulling charts from bitnami for postgres:

Example here (but I saw it happening very often recently):

https://github.com/apache/airflow/runs/6666449965?check_suite_focus=true#step:9:314

  Save error occurred:  could not find : chart postgresql not found in https://charts.bitnami.com/bitnami: looks like "https://charts.bitnami.com/bitnami" is not a valid chart repository or cannot be reached: stream error: stream ID 1; INTERNAL_ERROR
  Deleting newly downloaded charts, restoring pre-update state
  Error: could not find : chart postgresql not found in https://charts.bitnami.com/bitnami: looks like "https://charts.bitnami.com/bitnami" is not a valid chart repository or cannot be reached: stream error: stream ID 1; INTERNAL_ERROR
  Dumping logs from KinD

It is not only a problem for our CI but it might be similar problem for our users who want to install the chart - they might also get the same kinds of error.

I guess we should either make it more resilient to intermittent problems with bitnami charts or use another chart (or maybe even host the chart ourselves somewhere within apache infrastructure. While the postgres chart is not really needed for most "production" users, it is still a dependency of our chart and it makes our chart depend on external and apparently flaky service.

What you think should happen instead

We should find (or host ourselves) more stable dependency or get rid of it.

How to reproduce

Look at some recent CI builds and see that they often fail in K8S tests and more often than not the reason is missing postgresql chart.

Operating System

any

Versions of Apache Airflow Providers

not relevant

Deployment

Other

Deployment details

CI

Anything else

Happy to make the change once we agree what's the best way :).

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

potiuk · 2022-05-31T08:13:44Z

cc: @kaxil @jedcunningham @ephraimbuddy - I think it would be great to find a better way. We do not really need that postgres chart to be frequently updated, so hosting it ourselves (say in a github repo) is I think the best option.

WDYT ?

jedcunningham · 2022-06-01T16:00:16Z

We can also vendor it in chart/charts. Have we tried to reach out to the bitnami folks to see if they are aware of the issues and working to resolve them though? That's the easiest for us, at least.

potiuk · 2022-06-01T18:16:41Z

Surely - we can ask, but they are part of vmware so I have no high expectations their free offering will be looked at carefully. But I definitely will.

However I think we should seriously consider moving it to Github (or vendoring in) regardless of their answers. What do we do if they answer "we fixed it" ? It happens intermittently so we are not even able to check if they did.

I really do not like situations where WE have to carry the burden of CI errors (and make our contributors unhappy) when someone else screwed up.

I am not sure if this is a serious effort - this is one-time effort really which needs zero maintenance (maybe upgrade from time to time). Comparing with multiple unforeseen errors that are yet another reson for our users to learn that "red" is normal and reaching out for help. I think the key to keep CI "nice" for the users is to eliminate relentlessly any reasons for potential problems - leaving only those that are "real" problems as false negatives undermine your trust in it and the more it happens the more you reach out to maintainers "the tests failed, please help" - we should keep the possibility down to minimum. It can never be eliminated entirely but if we do not have control over something we cannot improve it.

I've heard very similar concerns when I moved the images used during our integration tests.
And this worked beautifully after I did this for all the images we use in our integration tests in ghcr.io :

Do you remember the last time we had a problem with this image: https://github.com/orgs/apache/packages?tab=packages&q=airflow-openldap ?

Me neither.

And we had plenty of similar problems with those images from time time to time when they were pulled from dockerhub. It's literaly 0 overhead - I pushed it once 11 months ago and did not touch it since. I think the one time effort of moving stuff to GitHub (either as separate repo or vendor-in) is certainly worth it over the long period of time.

The nice thing of keeping everything in GitHub is that when it fails, it generally fails in the way that many things do not work (including actions or CI in general). So when something breaks in GitHub people generally are aware of it (and more often than not they simply cannot even submit or run their CI jobs).

And if we see intermittent errors like that we have enterprise level support with them via Apache agreement - they either commented with workarounds or fixed many isues I raised to them already.

potiuk · 2022-06-01T18:27:02Z

Someone opened an issue to bitnami charts about it iliterally 15 minutes ago. Commented it there: bitnami/charts#10535 (comment)

Any upvotes, cheering comments are most welcome

potiuk · 2022-06-01T18:40:53Z

Speaking of "supporting free use" - they just closed the issue and redirected it to issue they are "working on" bitnami/charts#8433 which has been open 16 Dec 2021.

@jedcunningham - do you still think it's the "easiest route" :P ?

potiuk · 2022-06-01T18:43:05Z

BTW. Rough looking at the issue about half of the ~50 comments there is "we have the same issue - here are the details" and the other half (exaggerating a bit for the effect of course) is "Thanks, we are working on it" from bitnami support - some of those "we are working on it" are from February.

jedcunningham · 2022-06-01T20:55:07Z

I've opened #24089 to point at github for the index instead. This will stop the bleeding at least and give us a little time to see if they resolve it (sounds like it is now causing them pain, so I'm hopeful) or if we need to do something else.

potiuk · 2022-06-02T14:07:07Z

So taking into account what happened here tonight bitnami/charts#10539 - I believe vendoring-in the chart seems to be the best idea.

I REALLY don't like when someone's arbitrary decision can break all our released versions of charts.

jedcunningham · 2022-06-02T15:12:12Z

It doesn't break our released versions thankfully (helm vendors it in the tarball already), but it does hose our CI and main for sure. +1 to vendoring it. Should we update to the newest version while we are at it?

potiuk · 2022-06-02T18:33:29Z

Yeah I think so. Happy to do it to learn a bit more. That's great it's vendored in in the tarball.

After the Bitnami fiasco bitnami/charts#10539 We lost trust in bitnami index being good and reliable source of charts. That's why we vendored-in the postgres chart needed for our Helm chart. Fixes: apache#24037

After the Bitnami fiasco bitnami/charts#10539 We lost trust in bitnami index being good and reliable source of charts. That's why we vendored-in the postgres chart needed for our Helm chart. Fixes: #24037

potiuk added kind:bug This is a clearly a bug area:core labels May 31, 2022

This was referenced May 31, 2022

Increase size of ARM build instance #24036

Merged

airflow db check-migrations -t 0 fails to check migrations in airflow 2.3 #24060

Closed

potiuk mentioned this issue Jun 1, 2022

Chart repository unreachable bitnami/charts#10535

Closed

jedcunningham mentioned this issue Jun 1, 2022

Use github for postgres chart index #24089

Merged

potiuk mentioned this issue Jun 11, 2022

Vendor in the bitnami chart #24395

Merged

potiuk closed this as completed in #24395 Jun 14, 2022

jedcunningham mentioned this issue Oct 10, 2022

Status of testing of Apache Airflow Helm Chart 1.7.0rc1 #26971

Closed

25 tasks

potiuk mentioned this issue Oct 29, 2022

Update bitnami/postgresql dependency chart to latest version #27357

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent failures of helm chart tests #24037

Frequent failures of helm chart tests #24037

potiuk commented May 31, 2022

potiuk commented May 31, 2022

jedcunningham commented Jun 1, 2022

potiuk commented Jun 1, 2022

potiuk commented Jun 1, 2022

potiuk commented Jun 1, 2022

potiuk commented Jun 1, 2022 •

edited

Loading

jedcunningham commented Jun 1, 2022

potiuk commented Jun 2, 2022 •

edited

Loading

jedcunningham commented Jun 2, 2022

potiuk commented Jun 2, 2022

Frequent failures of helm chart tests #24037

Frequent failures of helm chart tests #24037

Comments

potiuk commented May 31, 2022

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

potiuk commented May 31, 2022

jedcunningham commented Jun 1, 2022

potiuk commented Jun 1, 2022

potiuk commented Jun 1, 2022

potiuk commented Jun 1, 2022

potiuk commented Jun 1, 2022 • edited Loading

jedcunningham commented Jun 1, 2022

potiuk commented Jun 2, 2022 • edited Loading

jedcunningham commented Jun 2, 2022

potiuk commented Jun 2, 2022

potiuk commented Jun 1, 2022 •

edited

Loading

potiuk commented Jun 2, 2022 •

edited

Loading