Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-node crashloopbackoff after delay on 1.29 onwards on specific instance type (m6i.xlarge) #3177

Open
guyfedwards opened this issue Jan 29, 2025 · 2 comments
Labels

Comments

@guyfedwards
Copy link

What happened:
Our cluster has connectivity issues on K8s versions 1.29, 1.30, 1.31. Initial migration to nodes running those versions is smooth, and stays like that for between 45-60mins and then aws-node pods start to enter 'CrashLoopBackoff' and coredns starts to have errors resolving internal and local domains. No container logs show anything useful.

3 nodes in the cluster, each node with a maximum of 15 pods (well within the eni-max-pods.txt value of 58) provisioned with eksctl. I have tried running amazon linux 2 and al2023, both have the same issue.

Migrating to m6i.2xlarge mitigates the issue and workloads continue to run smoothly. Resource usage is low so ideally we don't want to be stuck on larger machines.

We see a lot of this line suggesting that something is restarting/crashing but couldn't see an indicator to what/why.

{"level":"info","ts":"2025-01-14T16:44:59.617Z","caller":"aws-k8s-agent/main.go:42","msg":"Starting L-IPAMD   ..."}

Attach logs
Sent to k8s-awscni-triage@amazon.com

What you expected to happen:
Workloads to work normally on the m6i.xlarge nodes as they have done on all previous versions.

How to reproduce it (as minimally and precisely as possible):
We have a fairly standard eksctl setup:

nodeGroups:
  - name: nodes-123
    instanceType: m6i.xlarge
    minSize: 3
    maxSize: 3
    desiredCapacity: 3
    privateNetworking: true
    ssh:
      allow: true
    iam:
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/ElasticLoadBalancingFullAccess
        - arn:aws:iam::aws:policy/AmazonS3FullAccess
        - arn:aws:iam::aws:policy/AmazonSESFullAccess
      withAddonPolicies:
        autoScaler: true
        certManager: true
        cloudWatch: true
        ebs: true
        externalDNS: true
    availabilityZones: ["eu-west-2a"]
addons:
  - name: aws-ebs-csi-driver
    version: v1.38.1
    resolveConflicts: overwrite
  - name: coredns
    version: v1.11.4
    resolveConflicts: overwrite
  - name: kube-proxy
    version: v1.31.3
    resolveConflicts: overwrite
  - name: vpc-cni
    version: v1.19.0
    resolveConflicts: overwrite

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.31.4-eks-2d5f260
  • CNI Version 1.19.0
  • OS (e.g: cat /etc/os-release): al2023 and amazon linux 2 both have the issue
  • Kernel (e.g. uname -a): 6.1.119-129.201.amzn2023.z86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP PREEMPT_DYNAMIC Tue Dec 3 21:07:35 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
@orsenthil
Copy link
Member

Do you see the same behavior without workloads on your cluster? I did an initial analysis of the logs and found lot of network connectivity stress and API server connectivity issue in the logs.

If you bring a cluster with the same eksctl config that you have, but with minimal workload; does this run into similar issue after 45-60mins?

@guyfedwards
Copy link
Author

Nodes are fine when there is no load and just the system pods running. There is a decent amount of network connectivity, but small in comparison to some of our other clusters, could the instance size combined with higher network activity be causing the connectivity issues?

Worth noting that the traffic wouldn't have increased over the time of upgrading, we never had connectivity issues on older k8s versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants