aws-node crashloopbackoff after delay on 1.29 onwards on specific instance type (m6i.xlarge) #3177

guyfedwards · 2025-01-29T11:30:37Z

What happened:
Our cluster has connectivity issues on K8s versions 1.29, 1.30, 1.31. Initial migration to nodes running those versions is smooth, and stays like that for between 45-60mins and then aws-node pods start to enter 'CrashLoopBackoff' and coredns starts to have errors resolving internal and local domains. No container logs show anything useful.

3 nodes in the cluster, each node with a maximum of 15 pods (well within the eni-max-pods.txt value of 58) provisioned with eksctl. I have tried running amazon linux 2 and al2023, both have the same issue.

Migrating to m6i.2xlarge mitigates the issue and workloads continue to run smoothly. Resource usage is low so ideally we don't want to be stuck on larger machines.

We see a lot of this line suggesting that something is restarting/crashing but couldn't see an indicator to what/why.

{"level":"info","ts":"2025-01-14T16:44:59.617Z","caller":"aws-k8s-agent/main.go:42","msg":"Starting L-IPAMD   ..."}

Attach logs
Sent to k8s-awscni-triage@amazon.com

What you expected to happen:
Workloads to work normally on the m6i.xlarge nodes as they have done on all previous versions.

How to reproduce it (as minimally and precisely as possible):
We have a fairly standard eksctl setup:

nodeGroups:
  - name: nodes-123
    instanceType: m6i.xlarge
    minSize: 3
    maxSize: 3
    desiredCapacity: 3
    privateNetworking: true
    ssh:
      allow: true
    iam:
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/ElasticLoadBalancingFullAccess
        - arn:aws:iam::aws:policy/AmazonS3FullAccess
        - arn:aws:iam::aws:policy/AmazonSESFullAccess
      withAddonPolicies:
        autoScaler: true
        certManager: true
        cloudWatch: true
        ebs: true
        externalDNS: true
    availabilityZones: ["eu-west-2a"]
addons:
  - name: aws-ebs-csi-driver
    version: v1.38.1
    resolveConflicts: overwrite
  - name: coredns
    version: v1.11.4
    resolveConflicts: overwrite
  - name: kube-proxy
    version: v1.31.3
    resolveConflicts: overwrite
  - name: vpc-cni
    version: v1.19.0
    resolveConflicts: overwrite

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.31.4-eks-2d5f260
CNI Version 1.19.0
OS (e.g: cat /etc/os-release): al2023 and amazon linux 2 both have the issue
Kernel (e.g. uname -a): 6.1.119-129.201.amzn2023.z86_64 Initial commit of amazon-vpc-cni-k8s #1 SMP PREEMPT_DYNAMIC Tue Dec 3 21:07:35 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

orsenthil · 2025-01-29T16:39:56Z

Do you see the same behavior without workloads on your cluster? I did an initial analysis of the logs and found lot of network connectivity stress and API server connectivity issue in the logs.

If you bring a cluster with the same eksctl config that you have, but with minimal workload; does this run into similar issue after 45-60mins?

guyfedwards · 2025-01-31T15:32:00Z

Nodes are fine when there is no load and just the system pods running. There is a decent amount of network connectivity, but small in comparison to some of our other clusters, could the instance size combined with higher network activity be causing the connectivity issues?

Worth noting that the traffic wouldn't have increased over the time of upgrading, we never had connectivity issues on older k8s versions.

guyfedwards added the bug label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-node crashloopbackoff after delay on 1.29 onwards on specific instance type (m6i.xlarge) #3177

aws-node crashloopbackoff after delay on 1.29 onwards on specific instance type (m6i.xlarge) #3177

guyfedwards commented Jan 29, 2025

orsenthil commented Jan 29, 2025

guyfedwards commented Jan 31, 2025

aws-node crashloopbackoff after delay on 1.29 onwards on specific instance type (m6i.xlarge) #3177

aws-node crashloopbackoff after delay on 1.29 onwards on specific instance type (m6i.xlarge) #3177

Comments

guyfedwards commented Jan 29, 2025

orsenthil commented Jan 29, 2025

guyfedwards commented Jan 31, 2025