-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent failures of helm chart tests #24037
Comments
cc: @kaxil @jedcunningham @ephraimbuddy - I think it would be great to find a better way. We do not really need that postgres chart to be frequently updated, so hosting it ourselves (say in a github repo) is I think the best option. WDYT ? |
We can also vendor it in chart/charts. Have we tried to reach out to the bitnami folks to see if they are aware of the issues and working to resolve them though? That's the easiest for us, at least. |
Surely - we can ask, but they are part of vmware so I have no high expectations their free offering will be looked at carefully. But I definitely will. However I think we should seriously consider moving it to Github (or vendoring in) regardless of their answers. What do we do if they answer "we fixed it" ? It happens intermittently so we are not even able to check if they did. I really do not like situations where WE have to carry the burden of CI errors (and make our contributors unhappy) when someone else screwed up. I am not sure if this is a serious effort - this is one-time effort really which needs zero maintenance (maybe upgrade from time to time). Comparing with multiple unforeseen errors that are yet another reson for our users to learn that "red" is normal and reaching out for help. I think the key to keep CI "nice" for the users is to eliminate relentlessly any reasons for potential problems - leaving only those that are "real" problems as false negatives undermine your trust in it and the more it happens the more you reach out to maintainers "the tests failed, please help" - we should keep the possibility down to minimum. It can never be eliminated entirely but if we do not have control over something we cannot improve it. I've heard very similar concerns when I moved the images used during our integration tests. Do you remember the last time we had a problem with this image: https://github.com/orgs/apache/packages?tab=packages&q=airflow-openldap ? Me neither. And we had plenty of similar problems with those images from time time to time when they were pulled from dockerhub. It's literaly 0 overhead - I pushed it once 11 months ago and did not touch it since. I think the one time effort of moving stuff to GitHub (either as separate repo or vendor-in) is certainly worth it over the long period of time. The nice thing of keeping everything in GitHub is that when it fails, it generally fails in the way that many things do not work (including actions or CI in general). So when something breaks in GitHub people generally are aware of it (and more often than not they simply cannot even submit or run their CI jobs). And if we see intermittent errors like that we have enterprise level support with them via Apache agreement - they either commented with workarounds or fixed many isues I raised to them already. |
Someone opened an issue to bitnami charts about it iliterally 15 minutes ago. Commented it there: bitnami/charts#10535 (comment) Any upvotes, cheering comments are most welcome |
Speaking of "supporting free use" - they just closed the issue and redirected it to issue they are "working on" bitnami/charts#8433 which has been open 16 Dec 2021. @jedcunningham - do you still think it's the "easiest route" :P ? |
BTW. Rough looking at the issue about half of the ~50 comments there is "we have the same issue - here are the details" and the other half (exaggerating a bit for the effect of course) is "Thanks, we are working on it" from bitnami support - some of those "we are working on it" are from February. |
I've opened #24089 to point at github for the index instead. This will stop the bleeding at least and give us a little time to see if they resolve it (sounds like it is now causing them pain, so I'm hopeful) or if we need to do something else. |
So taking into account what happened here tonight bitnami/charts#10539 - I believe vendoring-in the chart seems to be the best idea. I REALLY don't like when someone's arbitrary decision can break all our released versions of charts. |
It doesn't break our released versions thankfully (helm vendors it in the tarball already), but it does hose our CI and main for sure. +1 to vendoring it. Should we update to the newest version while we are at it? |
Yeah I think so. Happy to do it to learn a bit more. That's great it's vendored in in the tarball. |
After the Bitnami fiasco bitnami/charts#10539 We lost trust in bitnami index being good and reliable source of charts. That's why we vendored-in the postgres chart needed for our Helm chart. Fixes: apache#24037
After the Bitnami fiasco bitnami/charts#10539 We lost trust in bitnami index being good and reliable source of charts. That's why we vendored-in the postgres chart needed for our Helm chart. Fixes: #24037
Apache Airflow version
main (development)
What happened
We keep on getting very frequent failures of Helm Chart tests and seems that a big number of those errors are because of errors when pulling charts from bitnami for postgres:
Example here (but I saw it happening very often recently):
https://github.com/apache/airflow/runs/6666449965?check_suite_focus=true#step:9:314
It is not only a problem for our CI but it might be similar problem for our users who want to install the chart - they might also get the same kinds of error.
I guess we should either make it more resilient to intermittent problems with bitnami charts or use another chart (or maybe even host the chart ourselves somewhere within apache infrastructure. While the postgres chart is not really needed for most "production" users, it is still a dependency of our chart and it makes our chart depend on external and apparently flaky service.
What you think should happen instead
We should find (or host ourselves) more stable dependency or get rid of it.
How to reproduce
Look at some recent CI builds and see that they often fail in K8S tests and more often than not the reason is missing postgresql chart.
Operating System
any
Versions of Apache Airflow Providers
not relevant
Deployment
Other
Deployment details
CI
Anything else
Happy to make the change once we agree what's the best way :).
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: