Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Cleanup separate unstructuredCachingClient #10692

Merged
merged 1 commit into from
May 30, 2024

Conversation

sbueringer
Copy link
Member

Signed-off-by: Stefan Büringer buringerst@vmware.com

What this PR does / why we need it:
Noticed that we still have the separate unstructuredCachingClient when reviewing Christians CRS PR.

This PR removes the separate client and instead configured the manager default client to cache all unstructured calls.

I've opened a separate PR (#10691) to check if we were still using uncached unstructured get/list calls anywhere (to see if thoes additional usages are safe).

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels May 28, 2024
@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-main

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 28, 2024
@sbueringer
Copy link
Member Author

/hold

to check logs of #10691 e2e-min for uncached unstructured get/list occurences first

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 28, 2024
@k8s-ci-robot k8s-ci-robot requested review from chrischdi and elmiko May 28, 2024 13:26
@sbueringer sbueringer added the area/provider/core Issues or PRs related to the core provider label May 28, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-area PR is missing an area label label May 28, 2024
@sbueringer
Copy link
Member Author

sbueringer commented May 28, 2024

/hold cancel

Test results from #10691

To summarize. The only places where we used uncached get/list calls for Unstructured on main are:

  • MachinePool controller
  • Machine controller isDeleteNodeAllowed
  • Cluster webhook validateTopologyControlPlaneVersion

I think in all 3 cases it's okay to also use cached calls. So I think it's fine to just always cache unstructured.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 28, 2024
@sbueringer
Copy link
Member Author

(I'll take a look at the unit test failures)

Signed-off-by: Stefan Büringer buringerst@vmware.com
@sbueringer sbueringer force-pushed the pr-cache-unstructured branch from 1824136 to 4714c48 Compare May 28, 2024 18:06
@@ -338,7 +338,7 @@ func TestCloneConfigsAndGenerateMachine(t *testing.T) {
},
},
}
g.Expect(env.Create(ctx, genericInfrastructureMachineTemplate)).To(Succeed())
g.Expect(env.CreateAndWait(ctx, genericInfrastructureMachineTemplate)).To(Succeed())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had to change this because now the tests are also using cached unstructured get/list calls (production code in KCP always did)

@sbueringer
Copy link
Member Author

/test pull-cluster-api-e2e-main

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 28, 2024

@sbueringer: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-apidiff-main 4714c48 link false /test pull-cluster-api-apidiff-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@enxebre
Copy link
Member

enxebre commented May 29, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 29, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ff72e4c0092ffbea477e7b93a5a8d3427ce933b7

@@ -42,29 +42,26 @@ import (

// ClusterReconciler reconciles a Cluster object.
type ClusterReconciler struct {
Client client.Client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the fields here again (with //Deprecated and without using them) if you want.

This would allow folks one more minor release to not set these fields. Not sure if it makes much difference to be honest

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Nice cleanup!
I have only one question: is there a way to check if before this change the regular Client was used for reading unstructured somewhere, and check if this code might be relying to the fact that the read was not cached?

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 30, 2024
@k8s-ci-robot k8s-ci-robot merged commit b677359 into kubernetes-sigs:main May 30, 2024
21 of 22 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.8 milestone May 30, 2024
@sbueringer sbueringer deleted the pr-cache-unstructured branch May 31, 2024 09:40
@sbueringer
Copy link
Member Author

sbueringer commented May 31, 2024

/lgtm

Nice cleanup! I have only one question: is there a way to check if before this change the regular Client was used for reading unstructured somewhere, and check if this code might be relying to the fact that the read was not cached?

I basically did the following:

This way I identified the following cases: (see diff on #10691 for the first two)

  • MachinePool controller
  • Machine controller isDeleteNodeAllowed
  • Cluster webhook validateTopologyControlPlaneVersion (
    cp, err := external.Get(ctx, ctrlClient, oldCluster.Spec.ControlPlaneRef, oldCluster.Namespace)
    )

After that I looked manually at the cases and they all seem safe to me.

I think aparat from that we would have the following options.

To identify those cases:

  • I imagine it's possible to write a tool to identify in which cases mgr.GetClient() is used to do unstructured get/list call
    • But it's definitely not trivial as we just pass client.Client's around in our code base
    • Manually assessing every single Client.Get/List call in our entire code base would also work. But this would take a lot of time because we always have to go up the entire stack trace to find out which client is used.

To assess if these cases are problematic:

  • I don't think there is an automated way to deduct if the logic behind our get/list calls still works if the object is not coming live from the apiserver or is potentially <1-2ms outdated (which is usually the diff). (even the object we get from the apiserver could already be outdated immediately after we retrieved it)
    • So I think the problematic cases are something like: create a Machine, do a list call very soon after, create another Machine because the first one was not observed
  • I think the only good option there is to manually assess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/core Issues or PRs related to the core provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants