Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Vsphere Failure domains reconciling changes #9239

Merged

Conversation

rajeshvenkata
Copy link
Member

@rajeshvenkata rajeshvenkata commented Feb 11, 2025

Issue #, if available:
https://github.com/aws/eks-anywhere-internal/issues/2995

Description of changes:
Add implementation changes for vsphere failure domains

  • Add feature to gate failure domains enablement in Vsphere
  • Add Vsphere Failure domains reconciling changes
  • Add new template for vsphere failure domain
  • Add vsphere failure domains to cluster RBAC

Testing (if applicable):
Created a vsphere cluster with worker node groups using failure domains and verified VMs are built from failure domain configs

Failure domains definitions in cluster spec. Worker node group 1 using fd-1 and worker node group 2 using fd-2

Screenshot 2025-02-10 at 11 52 28 PM

Vsphere VM from worker node group with fd-1 settings
Screenshot 2025-02-10 at 11 52 35 PM

Vsphere VM from worker node group with fd-2 settings
Screenshot 2025-02-10 at 11 52 47 PM

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot eks-distro-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 11, 2025
Copy link

codecov bot commented Feb 11, 2025

Codecov Report

Attention: Patch coverage is 83.02752% with 37 lines in your changes missing coverage. Please review.

Project coverage is 72.41%. Comparing base (f0fc3fa) to head (b58253e).
Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
pkg/providers/vsphere/reconciler/reconciler.go 61.70% 13 Missing and 5 partials ⚠️
pkg/providers/vsphere/template.go 82.92% 5 Missing and 2 partials ⚠️
pkg/providers/vsphere/failure_domains.go 80.64% 4 Missing and 2 partials ⚠️
pkg/providers/vsphere/vsphere.go 0.00% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9239      +/-   ##
==========================================
+ Coverage   72.35%   72.41%   +0.05%     
==========================================
  Files         587      589       +2     
  Lines       46132    46354     +222     
==========================================
+ Hits        33380    33565     +185     
- Misses      11003    11030      +27     
- Partials     1749     1759      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rajeshvenkata rajeshvenkata force-pushed the vsphere-failuredomains-initial-pr branch from e6c5e94 to 6c51e9e Compare February 12, 2025 19:29
@rajeshvenkata
Copy link
Member Author

/retest-required

@rajeshvenkata rajeshvenkata force-pushed the vsphere-failuredomains-initial-pr branch from 6c51e9e to 34df77e Compare February 12, 2025 23:41
@eks-distro-bot eks-distro-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 13, 2025
@rajeshvenkata rajeshvenkata force-pushed the vsphere-failuredomains-initial-pr branch 4 times, most recently from 230b9bf to 2d97d1b Compare February 13, 2025 22:07
@rajeshvenkata rajeshvenkata force-pushed the vsphere-failuredomains-initial-pr branch from 2d97d1b to 7ac7afd Compare February 13, 2025 22:24
@eks-distro-bot eks-distro-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 14, 2025
@rajeshvenkata rajeshvenkata force-pushed the vsphere-failuredomains-initial-pr branch from 3e94fb2 to 118e5de Compare February 14, 2025 05:56
@@ -1083,6 +1083,12 @@ func TestGetAndValidateClusterConfig(t *testing.T) {
wantCluster: nil,
wantErr: true,
},
{
testName: "valid cluster config with vsphere failure domains disabled",
fileName: "testdata/vsphere_cluster_valid_failuredomains.yaml",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use wantCluster instead of fileName? We want to stop managing file names as they are harder to maintain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File is being used to generate input cluster and this is not used for checking the cluster. This is how the current test works. To remove the dependency of file and use a cluster, it needs refactoring of all other tests in this file. We can keep this for now and do it later as part of refactoring.

@@ -88,6 +90,44 @@ func (v *Validator) ValidateVCenterConfig(ctx context.Context, datacenterConfig
return nil
}

// ValidateFailureDomains validates the provided list of failure domains.
func (v *Validator) ValidateFailureDomains(vsphereClusterSpec *Spec) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any validations on the fields themselves?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this change. I have an action item to validate each field exists in vsphere. https://github.com/aws/eks-anywhere-internal/issues/3004

@rajeshvenkata rajeshvenkata force-pushed the vsphere-failuredomains-initial-pr branch from 118e5de to b58253e Compare February 15, 2025 02:26
@rajeshvenkata
Copy link
Member Author

/approve

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rajeshvenkata

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eks-distro-bot eks-distro-bot merged commit cdaf14e into aws:main Feb 19, 2025
13 checks passed
@pawcykca
Copy link

pawcykca commented Mar 3, 2025

@rajeshvenkata Is a similar solution planned for Control Plane (distribute nodes into different failure domain/AZ)?

@rajeshvenkata
Copy link
Member Author

rajeshvenkata commented Mar 4, 2025

@pawcykca Thanks for the feedback and your interest in failure domains feature. We have just launched failure domains support for worker nodes in the new release v0.22

Few questions regarding control plane support ask.

  1. What specific requirements are you interested in for the control plane? Do you want to distribute control plane nodes across availability zones with different networks or keeping them in same network is acceptable?

  2. Are you also looking for external etcd cluster supported with failure domain feature?

@pawcykca
Copy link

pawcykca commented Mar 4, 2025

@rajeshvenkata thanks for answer ;-)

  1. I'm interested in feature that allows to distribute control plane nodes across availability zones (different clusters or cluster's resource pool defined in vCenter) with different storage (Datastore) and same network.

Definition can use same parameters as for Worker Nodes failure domains:

  • computeCluster
  • resourcePool
  • datastore
  • folder
  • network (for me optional but maybe there is some use case where it could be helpful)

=== Example for 2 AZ (failure domains) ===
Datacenter
AZ1 (computeCluster/resourcePool)
Hosts (ESXI)
AZ2 (computeCluster/resourcePool)
Hosts (ESXI)

Datastore:
AZ1_Datastore attached to ESXI in AZ1 (datastore)
AZ2_Datastore attached to ESXI in AZ2 (datastore)

EKSA Control Plane:
2 nodes in AZ1
1 node in AZ2

=== Example for 3 AZ (failure domains) ===
Datacenter
AZ1 (cluster)
Hosts (ESXI)
AZ2 (cluster)
Hosts (ESXI)
AZ3 (cluster)
Hosts (ESXI)

Datastore:
AZ1_Datastore (attached to ESXI in AZ1)
AZ2_Datastore (attached to ESXI in AZ2)
AZ3_Datastore (attached to ESXI in AZ3)

EKSA Control Plane:
1 node in AZ1
1 node in AZ2
1 node in AZ3

  1. I'm currently use mainly stacked etcd solution, but it would be nice if external etcd cluster also support failure domain feature as above.

@rajeshvenkata
Copy link
Member Author

@pawcykca Thank you for sharing the info and the use cases. We will let you know if we have any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants