Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] maxPodsPerNode does't work with eks 1.22 #5134

Closed
mathieu-lemay opened this issue Apr 18, 2022 · 21 comments · Fixed by #5808
Closed

[Bug] maxPodsPerNode does't work with eks 1.22 #5134

mathieu-lemay opened this issue Apr 18, 2022 · 21 comments · Fixed by #5808
Assignees
Labels
area/nodegroup kind/bug priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases

Comments

@mathieu-lemay
Copy link

What were you trying to accomplish?

I'm trying to create a managed node group with a limit on the number of pods per node.

What happened?

The node group is created but maxPodsPerNode is ignored and the nodes use their default value instead (29 in my case for a m5.large node).

How to reproduce it?

$ cat > nodegroup.yaml << EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
  -
    name: test-max-pods
    desiredCapacity: 1
    minSize: 1
    maxSize: 5
    maxPodsPerNode: 12
    iam:
      withAddonPolicies:
        appMesh: true
        appMeshPreview: true
        autoScaler: true
        efs: true
metadata:
  name: my-eks-1-22-cluster
  region: ca-central-1
  version: auto
EOF

$ eksctl create nodegroup -f nodegroup.yaml

Logs
Creation log

2022-04-18 14:46:56 [ℹ]  using region ca-central-1
2022-04-18 14:46:57 [ℹ]  will use version 1.22 for new nodegroup(s) based on control plane version
2022-04-18 14:46:58 [ℹ]  nodegroup "test-max-pods" will use "" [AmazonLinux2/1.22]
2022-04-18 14:46:59 [!]  retryable error (Throttling: Rate exceeded
        status code: 400, request id: e73d37bd-b940-484c-ad78-6312b8b5e6d3) from cloudformation/DescribeStacks - will retry after delay of 6.20133802s
2022-04-18 14:47:06 [ℹ]  4 existing nodegroup(s) (my-eks-1-22-cluster-a,my-eks-1-22-cluster-b,my-eks-1-22-cluster-c,my-eks-1-22-cluster-d) will be excluded
2022-04-18 14:47:06 [ℹ]  1 nodegroup (test-max-pods) was included (based on the include/exclude rules)
2022-04-18 14:47:06 [ℹ]  will create a CloudFormation stack for each of 1 managed nodegroups in cluster "my-eks-1-22-cluster"
2022-04-18 14:47:06 [ℹ]
2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create managed nodegroup "test-max-pods" } }
}
2022-04-18 14:47:06 [ℹ]  checking cluster stack for missing resources
2022-04-18 14:47:07 [ℹ]  cluster stack has all required resources
2022-04-18 14:47:07 [!]  retryable error (Throttling: Rate exceeded
        status code: 400, request id: 1aaf9b5d-6bb7-4370-a4f5-c982f58dcc34) from cloudformation/DescribeStacks - will retry after delay of 5.132635276s
2022-04-18 14:47:13 [ℹ]  building managed nodegroup stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:13 [ℹ]  deploying stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:13 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:32 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:51 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:48:07 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:48:25 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:48:41 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:01 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:17 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:33 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:52 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:50:10 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:50:28 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:50:45 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:51:05 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:51:05 [ℹ]  no tasks
2022-04-18 14:51:05 [✔]  created 0 nodegroup(s) in cluster "my-eks-1-22-cluster"
2022-04-18 14:51:05 [ℹ]  nodegroup "test-max-pods" has 1 node(s)
2022-04-18 14:51:05 [ℹ]  node "ip-10-75-1-120.ca-central-1.compute.internal" is ready
2022-04-18 14:51:05 [ℹ]  waiting for at least 1 node(s) to become ready in "test-max-pods"
2022-04-18 14:51:05 [ℹ]  nodegroup "test-max-pods" has 1 node(s)
2022-04-18 14:51:05 [ℹ]  node "ip-10-75-1-120.ca-central-1.compute.internal" is ready
2022-04-18 14:51:05 [✔]  created 1 managed nodegroup(s) in cluster "my-eks-1-22-cluster"
2022-04-18 14:51:06 [ℹ]  checking security group configuration for all nodegroups
2022-04-18 14:51:06 [ℹ]  all godegroups have up-to-date cloudformation templates

kubect describe node/ip-10-75-1-120.ca-central-1.compute.internal

<== removed ==>
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         2
  ephemeral-storage:           83873772Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      7934440Ki
  pods:                        29  # <-- Should be 12
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         1930m
  ephemeral-storage:           76224326324
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      7244264Ki
  pods:                        29  # <-- Should be 12
System Info:
  Machine ID:                 ec2c7770b7e8fd8b2edd9808f7b986a6
  System UUID:                ec2c7770-b7e8-fd8b-2edd-9808f7b986a6
  Boot ID:                    9bbc3b1f-38e7-424b-ac45-b2d093438d75
  Kernel Version:             5.4.181-99.354.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.13
  Kubelet Version:            v1.22.6-eks-7d68063
  Kube-Proxy Version:         v1.22.6-eks-7d68063
<== removed ==>

Anything else we need to know?
Debian 11 with downloaded 0.93.0 binary

Versions

$ eksctl info
eksctl version: 0.93.0
kubectl version: v1.23.5
OS: linux
$ eksctl get clusters --name my-eks-1-22-cluster
2022-04-18 14:55:24 [ℹ]  eksctl version 0.93.0
2022-04-18 14:55:24 [ℹ]  using region ca-central-1
NAME                VERSION STATUS CREATED              VPC     SUBNETS    SECURITYGROUPS PROVIDER
my-eks-1-22-cluster 1.22    ACTIVE 2022-04-14T14:19:15Z vpc-xxx subnet-xxx sg-xxx         EKS
@Skarlso
Copy link
Contributor

Skarlso commented Apr 18, 2022

Hello, can you please also verify that the created launch template's user data contains MAX_POD setting to 12?

@mathieu-lemay
Copy link
Author

Looks like it. This is the user data:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=63096ae1a5df4c7b8a9e6a77290c89ef3f47a3a436b02df68a95bf6a8458

--63096ae1a5df4c7b8a9e6a77290c89ef3f47a3a436b02df68a95bf6a8458
Content-Type: text/x-shellscript
Content-Type: charset="us-ascii"

#!/bin/sh
set -ex
sed -i -E "s/^USE_MAX_PODS=\"\\$\{USE_MAX_PODS:-true}\"/USE_MAX_PODS=false/" /etc/eks/bootstrap.sh
KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet-config.json
echo "$(jq ".maxPods=12" $KUBELET_CONFIG)" > $KUBELET_CONFIG
--63096ae1a5df4c7b8a9e6a77290c89ef3f47a3a436b02df68a95bf6a8458--

@Skarlso
Copy link
Contributor

Skarlso commented Apr 18, 2022

Okay cool. That's something at least. :)

We'll take a look, but if we provide the right flags, I'm afraid there is little we can do.

Have you tried testing it with more than 12 pods? It might write 29, but it might not allow more than 12 using the controller, or something something AWS magic? :)

@mathieu-lemay
Copy link
Author

Fair enough, it could be something that changed within EKS.

I did test it already, unfortunatly, there was no AWS magic, and I ended up with about 27 pods. That's how I noticed the issue.

@Skarlso
Copy link
Contributor

Skarlso commented Apr 19, 2022

Thanks!

@cPu1
Copy link
Contributor

cPu1 commented Apr 19, 2022

Fair enough, it could be something that changed within EKS.

I did test it already, unfortunatly, there was no AWS magic, and I ended up with about 27 pods. That's how I noticed the issue.

I initially suspected that the script eksctl uses to set max pods for managed nodegroups no longer works in EKS 1.22, potentially because the bootstrap script in 1.22 AMIs has changed. But after testing, I can confirm that eksctl is still able to set maxPods in the kubelet config but it's not being honoured.

@cPu1
Copy link
Contributor

cPu1 commented Apr 19, 2022

Fair enough, it could be something that changed within EKS.
I did test it already, unfortunatly, there was no AWS magic, and I ended up with about 27 pods. That's how I noticed the issue.

I initially suspected that the script eksctl uses to set max pods for managed nodegroups no longer works in EKS 1.22, potentially because the bootstrap script in 1.22 AMIs has changed. But after testing, I can confirm that eksctl is still able to set maxPods in the kubelet config but it's not being honoured.

I have tracked it down to EKS supplying --max-pods as an argument to the kubelet. The implementation for maxPodsPerNode in eksctl writes the maxPods field to the kubelet config, but EKS is now passing --max-pods as an argument to the kubelet, overriding the field in the kubelet config.

We can also work around this but we'll discuss this with the EKS team first as there were some talks about deprecating max pods earlier.

@mathieu-lemay
Copy link
Author

I have tracked it down to EKS supplying --max-pods as an argument to the kubelet. The implementation for maxPodsPerNode in eksctl writes the maxPods field to the kubelet config, but EKS is now passing --max-pods as an argument to the kubelet, overriding the field in the kubelet config.

We can also work around this but we'll discuss this with the EKS team first as there were some talks about deprecating max pods earlier.

Thanks for the update! In the meantime, we could work around the issue by setting resource requests on our pods, instead of setting a hardcoded number of pods. We have been thinking about it for a while anyway, that was just the push we needed to take the time and do it.

@Himangini
Copy link
Contributor

@matthewdepietro tagging you here as per your request 👍🏻

@suket22
Copy link

suket22 commented May 3, 2022

Adding some context on Managed Nodegroups' behavior - if the VPC CNI is running on >= 1.9, Managed Nodegroups attempts to auto-calculate the value of maxPods and sets it on the kubelet as @cPu1 has found. Managed Nodegroups will look at the different environment variables on the VPC CNI to determine what value to set (it essentially emulates the logic in this calculator script). It takes into account PrefixDelegation, Max ENIs etc.

This logic should only be triggered when the ManagedNodegroups is being created without a custom AMI. When looking to override kubelet config, it's recommended to specify an AMI in the launch template passed to CreateNodegroup since you then get full control over all bootstrap parameters including max pods.

@Himangini
Copy link
Contributor

We need to come up with a plan to support this as cleanly as possible without hacks.

Timebox: 1-2 days
Document the outcomes here.

@cPu1 cPu1 self-assigned this Jun 13, 2022
@cPu1
Copy link
Contributor

cPu1 commented Jun 15, 2022

Looking into this more, a clean solution to support max pods in eksctl is to resolve the AMI using SSM, passing it as a custom AMI to the MNG API, and use a custom bootstrap script, setting --max-pods to the supplied value, when maxPodsPerNode is set. This approach, however, breaks eksctl upgrade nodegroup and requires eksctl to handle upgrades for nodegroups that have maxPodsPerNode set.

@cPu1
Copy link
Contributor

cPu1 commented Jun 22, 2022

Looking into this more, a clean solution to support max pods in eksctl is to resolve the AMI using SSM, passing it as a custom AMI to the MNG API, and use a custom bootstrap script, setting --max-pods to the supplied value, when maxPodsPerNode is set. This approach, however, breaks eksctl upgrade nodegroup and requires eksctl to handle upgrades for nodegroups that have maxPodsPerNode set.

Alternatively, we can use a workaround/hack that modifies the bootstrap.sh script and removes the --max-pods argument passed in the launch template's user data generated by EKS. This is similar to the how max-pods was implemented previously and requires less effort than the custom AMI approach.

@suket22
Copy link

suket22 commented Jul 5, 2022

and use a custom bootstrap script, setting --max-pods to the supplied value, when maxPodsPerNode is set

This is the approach I'd be in favor of. --maxPodsPerNode is essentially a property of the kubelet, and the only supported way to modify your kubeletConfiguration is using custom AMIs with your managed nodegroup so this approach makes sense to me.

I'm not sure I understood the mechanics of the workaround you'd mentioned. I think you meant you could edit the bootstrap script on the AMI itself and remove the max-pods argument that MNG API tries to set, but I'm not sure I understand how eksctl would set the value of maxPodsPerNode on the kubelet itself. Lmk what I'm missing here.

In the long term, we've been thinking of rewriting the EKS bootstrap script so that kubelet parameter overrides can be specified within your UserData section, and it'll be honored by however MNG bootstraps, but it's pending resourcing.

@Himangini
Copy link
Contributor

Looking into this more, a clean solution to support max pods in eksctl is to resolve the AMI using SSM, passing it as a custom AMI to the MNG API, and use a custom bootstrap script, setting --max-pods to the supplied value, when maxPodsPerNode is set. This approach, however, breaks eksctl upgrade nodegroup and requires eksctl to handle upgrades for nodegroups that have maxPodsPerNode set.

Alternatively, we can use a workaround/hack that modifies the bootstrap.sh script and removes the --max-pods argument passed in the launch template's user data generated by EKS. This is similar to the how max-pods was implemented previously and requires less effort than the custom AMI approach.

I am inclined to this approach as well instead of breaking eksctl upgrade nodegroup

@cPu1
Copy link
Contributor

cPu1 commented Jul 6, 2022

I'm not sure I understood the mechanics of the workaround you'd mentioned. I think you meant you could edit the bootstrap script on the AMI itself and remove the max-pods argument that MNG API tries to set

Correct.

but I'm not sure I understand how eksctl would set the value of maxPodsPerNode on the kubelet itself. Lmk what I'm missing here.

eksctl will set it in the kubelet config, which will then be read by kubelet.

In the long term, we've been thinking of rewriting the EKS bootstrap script so that kubelet parameter overrides can be specified within your UserData section, and it'll be honored by however awslabs/amazon-eks-ami#875, but it's pending resourcing.

Thanks for sharing this. I think we'll go with the workaround for now, given that we already have a similar workaround in place and it requires less effort than using a custom AMI with a custom bootstrap script. We'll revisit this approach after the EKS bootstrap script starts accepting kubelet parameter overrides.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 6, 2022

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Aug 6, 2022
@cPu1 cPu1 removed the stale label Aug 8, 2022
@Himangini Himangini added the priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases label Sep 2, 2022
@bryanasdev000
Copy link

bryanasdev000 commented Sep 27, 2022

Just dumping this for reference:

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/user_data.md#%EF%B8%8F-caveat
awslabs/amazon-eks-ami#873
awslabs/amazon-eks-ami#844

Also, maxPodsPerNode does not seem to work with latest 1.21 AMIs anymore (awslabs/amazon-eks-ami@v20220824...v20220914).

I am using this in new created clusters and its still working awslabs/amazon-eks-ami#844 (comment) (tested on 1.21, 1.22 and 1.23).

EDIT: It seems that is working again in 1.21 for me with ami-051aa0d5889741142 (EKS 1.21/us-east-2) as of 2022/10/07.

@cPu1
Copy link
Contributor

cPu1 commented Dec 1, 2022

Just dumping this for reference:

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/user_data.md#%EF%B8%8F-caveat awslabs/amazon-eks-ami#873 awslabs/amazon-eks-ami#844

Also, maxPodsPerNode does not seem to work with latest 1.21 AMIs anymore (awslabs/amazon-eks-ami@v20220824...v20220914).

I am using this in new created clusters and its still working awslabs/amazon-eks-ami#844 (comment) (tested on 1.21, 1.22 and 1.23).

EDIT: It seems that is working again in 1.21 for me with ami-051aa0d5889741142 (EKS 1.21/us-east-2) as of 2022/10/07.

This was fixed by #5808. You should not run into this issue with a recent version of eksctl.

@matti
Copy link
Contributor

matti commented Dec 10, 2022

@cPu1 are you sure that the fix in https://github.com/weaveworks/eksctl/pull/5808/files#diff-3a316f46904258df0dec1e9c9c1d6a89efb06e0637a5c0a6a930c162b5352498R99 is called - the sed appends it in KUBELET_EXTRA_ARGS which is only called if --kubelet-extra-args is passed in?

@matti
Copy link
Contributor

matti commented Dec 10, 2022

okay it does set it, but kubelet is running with --max-pods=110 --max-pods=123 where the latter is te maxPodsPerNode value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/nodegroup kind/bug priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants