Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable hardware Provisioning through ISO booting for baremetal Provider #9213

Merged
merged 1 commit into from
Feb 5, 2025

Conversation

rahulbabu95
Copy link
Member

@rahulbabu95 rahulbabu95 commented Jan 31, 2025

Issue #, if available:
https://github.com/aws/eks-anywhere-internal/issues/2212

Description of changes:
Upstream CAPT/Tink recently added support for booting the hardware through ISO. This change adds the necessary changes to our cluster spec notably TinkerbellDataCenterConfig to be able to provision the hardware using an ISO. ISO booting removes the dependency of the admin machine to provision the cluster to be in the same L2 as the hardware. Additionally offers static IPAM. The PR adds necessary changes required in our TinkerbellmachineTemplate to be able to ISO boot. The default behavior would still be to netboot. PR also adds required changes at the time of Tinkerbell stack installation so that Smee has all the required flags set to be able to serve ISO and handles toggling between the bootstrap IP and the actual TinkerbelIP.

Additionally, the previous implementation to handle the toggling between the Bootstrap IP to the actual Tinkerbell IP when moving the cluster from kind to the actual CAPI workload cluster had a bug in the sense the underlying eks-a cluster object always had the bootstrap-ip annotation, which meant the controller always also picked up the host IP for the Hegel URLs. This change also fixes that bug by actually updating the underlying eks-a cluster object.

Testing (if applicable):
Manually created a cluster using an Hook ISO image.

Documentation added/planned (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rahulbabu95. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eks-distro-bot eks-distro-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 31, 2025
@rahulbabu95
Copy link
Member Author

/hold

Copy link

codecov bot commented Jan 31, 2025

Codecov Report

Attention: Patch coverage is 54.87805% with 74 lines in your changes missing coverage. Please review.

Project coverage is 72.32%. Comparing base (0ac372d) to head (d360b25).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
pkg/clustermanager/cluster_manager.go 2.94% 64 Missing and 2 partials ⚠️
pkg/providers/tinkerbell/stack/stack.go 87.09% 4 Missing ⚠️
pkg/providers/tinkerbell/validate.go 33.33% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9213      +/-   ##
==========================================
- Coverage   72.37%   72.32%   -0.05%     
==========================================
  Files         585      587       +2     
  Lines       45708    46056     +348     
==========================================
+ Hits        33079    33309     +230     
- Misses      10890    10998     +108     
- Partials     1739     1749      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eks-distro-bot eks-distro-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 31, 2025
@rahulbabu95
Copy link
Member Author

/retest

@panktishah26
Copy link
Member

/test eks-anywhere-release-tooling-test-presubmit

@eks-distro-bot
Copy link
Collaborator

@rahulbabu95: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
eks-anywhere-release-tooling-test-presubmit 70428c3 link true /test eks-anywhere-release-tooling-test-presubmit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@rahulbabu95 rahulbabu95 force-pushed the l3/add-boot-options branch 2 times, most recently from 5f4532c to bfa2b34 Compare February 5, 2025 00:40
if config.Spec.IsoBoot {
if config.Spec.HookIsoURL != "" {
if _, err := url.ParseRequestURI(config.Spec.HookIsoURL); err != nil {
return fmt.Errorf("parsing isoURL: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing to hold up the PR, but something to think about. Because our code base doesn't have a good presentation layer, error messages like this most of the time get surfaced to the user. So while normally an error message that references some internal or local field or variable name is ok, you should think about what name makes sense from the user's perspective.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think providing the exact field name does help from a user's perspective in the sense that they can go back to the spec to the exact field that causes the error. That being said i will add more context around the error.

@@ -26,6 +26,13 @@ type TinkerbellDatacenterConfigSpec struct {
SkipLoadBalancerDeployment bool `json:"skipLoadBalancerDeployment,omitempty"`
// LoadBalancerInterface can be used to configure a load balancer interface for the Tinkerbell stack.
LoadBalancerInterface string `json:"loadBalancerInterface,omitempty"`
// IsoBoot can be used to indicate that the hardware must boot using an ISO.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must -> should

// IsoBoot can be used to indicate that the hardware must boot using an ISO.
//+optional
IsoBoot bool `json:"isoBoot,omitempty"`
// HookIsoURL is the URL of ISO image that will one time boot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one time boot is indeed accurate but is this how you would describe this to a user? is there other ways to describe the ISO, like being used during the provisioning process, etc.

Comment on lines 69 to 72
Install(ctx context.Context, bundle releasev1alpha1.TinkerbellBundle, tinkerbellIP, kubeconfig, hookOverride, isoOverride string, opts ...InstallOption) error
UninstallLocal(ctx context.Context) error
Uninstall(ctx context.Context, bundle releasev1alpha1.TinkerbellBundle, kubeconfig string) error
Upgrade(_ context.Context, _ releasev1alpha1.TinkerbellBundle, tinkerbellIP, kubeconfig, hookOverride string, opts ...InstallOption) error
Upgrade(_ context.Context, _ releasev1alpha1.TinkerbellBundle, tinkerbellIP, kubeconfig, hookOverride, isoOverride string, opts ...InstallOption) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two interface methods have ...InstallOption. This should allow extending these methods without modifying their function signatures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the field to the Installer struct and updated using the functional options pattern!

@@ -11,6 +11,9 @@ import (
"github.com/aws/eks-anywhere/pkg/constants"
)

// GofishProviderOption is the provider name for Redfish Provider in Rufio.
const GofishProviderOption = "gofish"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be exported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good qn. the test file uses this constant to test providerOptions. i do not want to maintain two copies of this constant value hence chose to export it.

Upstream CAPT/Tink recently added support for booting the hardware
through ISO. This change adds the necessary changes to our cluster spec
notably TinkerbellDataCenterConfig to be able to provision the hardware
using an ISO. ISO booting removes the dependency of the admin machine to
provision the cluster to be in the same L2 as the hardware. Additionally
offers static IPAM. The PR adds necessary changes required in our
TinkerbellmachineTemplate to be able to ISO boot. The default behavior
would still be to netboot. PR also adds required changes at the time of
Tinkerbell stack installation so that Smee has all the required flags
set to be able to serve ISO and handles toggling between the bootstrap
IP and the actual TinkerbelIP.

Signed-off-by: Rahul Ganesh <rahulgab@amazon.com>
@rahulbabu95
Copy link
Member Author

/unhold

@rahulbabu95
Copy link
Member Author

/override codecov/patch

@eks-distro-bot
Copy link
Collaborator

@rahulbabu95: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

  • codecov/patch

Only the following failed contexts/checkruns were expected:

  • eks-anywhere-cli-attribution-presubmit
  • eks-anywhere-cluster-controller-tooling-presubmit
  • eks-anywhere-e2e-presubmit
  • eks-anywhere-generate-files-presubmit
  • eks-anywhere-presubmit
  • tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

In response to this:

/override codecov/patch

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rahulbabu95 rahulbabu95 merged commit 21b4f95 into aws:main Feb 5, 2025
11 of 13 checks passed
@csplinter csplinter mentioned this pull request Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants