You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
Our cluster has connectivity issues on K8s versions 1.29, 1.30, 1.31. Initial migration to nodes running those versions is smooth, and stays like that for between 45-60mins and then aws-node pods start to enter 'CrashLoopBackoff' and coredns starts to have errors resolving internal and local domains. No container logs show anything useful.
3 nodes in the cluster, each node with a maximum of 15 pods (well within the eni-max-pods.txt value of 58) provisioned with eksctl. I have tried running amazon linux 2 and al2023, both have the same issue.
Migrating to m6i.2xlarge mitigates the issue and workloads continue to run smoothly. Resource usage is low so ideally we don't want to be stuck on larger machines.
We see a lot of this line suggesting that something is restarting/crashing but couldn't see an indicator to what/why.
Do you see the same behavior without workloads on your cluster? I did an initial analysis of the logs and found lot of network connectivity stress and API server connectivity issue in the logs.
If you bring a cluster with the same eksctl config that you have, but with minimal workload; does this run into similar issue after 45-60mins?
Nodes are fine when there is no load and just the system pods running. There is a decent amount of network connectivity, but small in comparison to some of our other clusters, could the instance size combined with higher network activity be causing the connectivity issues?
Worth noting that the traffic wouldn't have increased over the time of upgrading, we never had connectivity issues on older k8s versions.
What happened:
Our cluster has connectivity issues on K8s versions 1.29, 1.30, 1.31. Initial migration to nodes running those versions is smooth, and stays like that for between 45-60mins and then aws-node pods start to enter 'CrashLoopBackoff' and coredns starts to have errors resolving internal and local domains. No container logs show anything useful.
3 nodes in the cluster, each node with a maximum of 15 pods (well within the eni-max-pods.txt value of 58) provisioned with eksctl. I have tried running amazon linux 2 and al2023, both have the same issue.
Migrating to m6i.2xlarge mitigates the issue and workloads continue to run smoothly. Resource usage is low so ideally we don't want to be stuck on larger machines.
We see a lot of this line suggesting that something is restarting/crashing but couldn't see an indicator to what/why.
Attach logs
Sent to k8s-awscni-triage@amazon.com
What you expected to happen:
Workloads to work normally on the m6i.xlarge nodes as they have done on all previous versions.
How to reproduce it (as minimally and precisely as possible):
We have a fairly standard eksctl setup:
Anything else we need to know?:
Environment:
kubectl version
): v1.31.4-eks-2d5f260cat /etc/os-release
): al2023 and amazon linux 2 both have the issueuname -a
): 6.1.119-129.201.amzn2023.z86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP PREEMPT_DYNAMIC Tue Dec 3 21:07:35 UTC 2024 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: