-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scorecard does not wait for finalisers to complete before removing operator #2094
Comments
Hi @oseoin, Thank you for raise it. Could you please let us the following information?
|
I'm hitting a similar problem as well (stdout doesn't show anything): Scorecard command: # If I don't spec cr-manifest and csv-path it doesn't work.
GO111MODULE=on operator-sdk scorecard --config .osdk-scorecard.yaml --cr-manifest deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml --csv-path deploy/olm-catalog/ibm-spectrum-scale-csi-operator/0.0.1/ibm-spectrum-scale-csi-operator.v0.0.1.clusterserviceversion.yaml
scorecard:
init-timeout: 120
output: json
plugins:
- basic:
cr-manifest:
- "deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml"
- olm:
cr-manifest:
- "deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml"
csv-path: "deploy/olm-catalog/ibm-spectrum-scale-csi-operator/0.0.1/ibm-spectrum-scale-csi-operator.v0.0.1.clusterserviceversion.yaml"
|
I just did two configs of my I ran both versions through travis and got the following results: Scorecard 100%
Scorecard 50%
I've also done a run with the cluster-info before and after the scorecard run: https://travis-ci.org/mew2057/ibm-spectrum-scale-csi-operator/jobs/602927683 |
Hi @mew2057, Really thank you for the great info provide. I just would like to isolate the scenario to know if it is a scorecard or a finalizer issue with ansible operator. Then, could you check it out of the scorecard? I mean, could you install all (make install), then check that are all running and then, try to delete the objects (make uninstall) to see if it will get stuck? See here a Makefile with the commands |
Hi @mew2057, Also, checking your logs I notice that you are using python 2.
So, try to use Python3. Following how to add it in your .travis file. # Python 3.6 is the default Python when language: python.
# But when not language: python, Python 3 can not be used by default.
# There is pip (/usr/bin/pip), but not pip3.
# Because of it the global env is required.
env:
global:
- PATH=/opt/python/3.6.7/bin:$PATH
# Install python3 and utilities dependencies
addons:
apt:
packages:
- "python3"
- "python3-pip" I am guessing that it can be the root cause since because of the deprecation we upgrade it to use Python3. See the changelog.
Please, let us know if by using Python3 solve your scenario. |
MakefileFor the
$ time make install
....... Creating namespace .......
kubectl create namespace csiscaleoperator
namespace/csiscaleoperator created
....... Applying CRDS and Operator .......
kubectl apply -f deploy/crds/ibm_v1alpha1_csiscaleoperator_crd.yaml -n csiscaleoperator
customresourcedefinition.apiextensions.k8s.io/csiscaleoperators.scale.ibm.com created
....... Applying Rules and Service Account .......
kubectl apply -f deploy/role.yaml -n csiscaleoperator
role.rbac.authorization.k8s.io/ibm-spectrum-scale-csi-operator created
clusterrole.rbac.authorization.k8s.io/ibm-spectrum-scale-csi-operator created
clusterrole.rbac.authorization.k8s.io/csi-nodeplugin created
clusterrole.rbac.authorization.k8s.io/external-attacher-runner created
clusterrole.rbac.authorization.k8s.io/external-provisioner-runner created
kubectl apply -f deploy/role_binding.yaml -n csiscaleoperator
rolebinding.rbac.authorization.k8s.io/ibm-spectrum-scale-csi-operator created
clusterrolebinding.rbac.authorization.k8s.io/ibm-spectrum-scale-csi-operator created
clusterrolebinding.rbac.authorization.k8s.io/csi-nodeplugin created
clusterrolebinding.rbac.authorization.k8s.io/external-provisioner-runner created
clusterrolebinding.rbac.authorization.k8s.io/external-attacher-runner created
kubectl apply -f deploy/service_account.yaml -n csiscaleoperator
serviceaccount/ibm-spectrum-scale-csi-operator created
serviceaccount/external-attacher-runner created
serviceaccount/csi-nodeplugin created
serviceaccount/external-provisioner-runner created
....... Applying Operator .......
kubectl apply -f deploy/operator.yaml -n csiscaleoperator
deployment.apps/ibm-spectrum-scale-csi-operator created
....... Creating the Cluster .......
kubectl apply -f deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml -n csiscaleoperator
csiscaleoperator.scale.ibm.com/csi-scale-operator created
real 0m2.436s
user 0m1.427s
sys 0m0.227s
time make uninstall
....... Uninstalling .......
....... Deleting CRDs.......
kubectl delete -f deploy/crds/ibm_v1alpha1_csiscaleoperator_crd.yaml -n csiscaleoperator
warning: deleting cluster-scoped resources, not scoped to the provided namespace
customresourcedefinition.apiextensions.k8s.io "csiscaleoperators.scale.ibm.com" deleted
....... Deleting Rules and Service Account .......
kubectl delete -f deploy/role.yaml -n csiscaleoperator
role.rbac.authorization.k8s.io "ibm-spectrum-scale-csi-operator" deleted
warning: deleting cluster-scoped resources, not scoped to the provided namespace
clusterrole.rbac.authorization.k8s.io "ibm-spectrum-scale-csi-operator" deleted
clusterrole.rbac.authorization.k8s.io "csi-nodeplugin" deleted
clusterrole.rbac.authorization.k8s.io "external-attacher-runner" deleted
clusterrole.rbac.authorization.k8s.io "external-provisioner-runner" deleted
kubectl delete -f deploy/role_binding.yaml -n csiscaleoperator
rolebinding.rbac.authorization.k8s.io "ibm-spectrum-scale-csi-operator" deleted
warning: deleting cluster-scoped resources, not scoped to the provided namespace
clusterrolebinding.rbac.authorization.k8s.io "ibm-spectrum-scale-csi-operator" deleted
clusterrolebinding.rbac.authorization.k8s.io "csi-nodeplugin" deleted
clusterrolebinding.rbac.authorization.k8s.io "external-provisioner-runner" deleted
clusterrolebinding.rbac.authorization.k8s.io "external-attacher-runner" deleted
kubectl delete -f deploy/service_account.yaml -n csiscaleoperator
serviceaccount "ibm-spectrum-scale-csi-operator" deleted
serviceaccount "external-attacher-runner" deleted
serviceaccount "csi-nodeplugin" deleted
serviceaccount "external-provisioner-runner" deleted
....... Deleting Operator .......
kubectl delete -f deploy/operator.yaml -n csiscaleoperator
deployment.apps "ibm-spectrum-scale-csi-operator" deleted
....... Deleting namespace csiscaleoperator.......
kubectl delete namespace csiscaleoperator
namespace "csiscaleoperator" deleted
real 1m14.311s
user 0m0.732s
sys 0m0.099s Makefile (modified from example): [root ibm-spectrum-scale-csi-operator]# cat Makefile
.DEFAULT_GOAL:=help
SHELL:=/bin/bash
NAMESPACE=csiscaleoperator
# Modified from operator framework sample.
##@ Application
install: ## Install all resources (CR/CRD's, RBCA and Operator)
@echo ....... Creating namespace .......
- kubectl create namespace ${NAMESPACE}
@echo ....... Applying CRDS and Operator .......
- kubectl apply -f deploy/crds/ibm_v1alpha1_csiscaleoperator_crd.yaml -n ${NAMESPACE}
@echo ....... Applying Rules and Service Account .......
- kubectl apply -f deploy/role.yaml -n ${NAMESPACE}
- kubectl apply -f deploy/role_binding.yaml -n ${NAMESPACE}
- kubectl apply -f deploy/service_account.yaml -n ${NAMESPACE}
@echo ....... Applying Operator .......
- kubectl apply -f deploy/operator.yaml -n ${NAMESPACE}
@echo ....... Creating the Cluster .......
- kubectl apply -f deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml -n ${NAMESPACE}
uninstall: ## Uninstall all that all performed in the $ make install
@echo ....... Uninstalling .......
@echo ....... Deleting CRDs.......
- kubectl delete -f deploy/crds/ibm_v1alpha1_csiscaleoperator_crd.yaml -n ${NAMESPACE}
@echo ....... Deleting Rules and Service Account .......
- kubectl delete -f deploy/role.yaml -n ${NAMESPACE}
- kubectl delete -f deploy/role_binding.yaml -n ${NAMESPACE}
- kubectl delete -f deploy/service_account.yaml -n ${NAMESPACE}
@echo ....... Deleting Operator .......
- kubectl delete -f deploy/operator.yaml -n ${NAMESPACE}
@echo ....... Deleting namespace ${NAMESPACE}.......
- kubectl delete namespace ${NAMESPACE}
.PHONY: help
help: ## Display this help
@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n make \033[36m<target>\033[0m\n"} /^[a-zA-Z0-9_-]+:.*?##/ { printf " \033[36m%-15s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST) Environment ChangeI updated my config to match the recommendation and it still looks like I'm hitting the timeout on the resource cleanup: https://travis-ci.org/mew2057/ibm-spectrum-scale-csi-operator/jobs/603884581#L358 ThoughtsWhat the timeout is tripping on the delete because the timeout is not being properly loaded by the scorecard? If I don't run the scorecard with the $ operator-sdk scorecard --init-timeout 120 --config .osdk-scorecard.yaml --cr-manifest deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml --csv-path deploy/olm-catalog/ibm-spectrum-scale-csi-operator/0.0.1/ibm-spectrum-scale-csi-operator.v0.0.1.clusterserviceversion.yaml
INFO[0000] Using config file: .osdk-scorecard.yaml
WARN[0000] Plugin directory not found; skipping external plugins: stat scorecard: no such file or directory
WARN[0035] time="2019-10-28T06:02:14-07:00" level=info msg="a cleanup function failed with error: cleanup function failed: timed out waiting for the condition\n"
time="2019-10-28T06:02:29-07:00" level=info msg="a cleanup function failed with error: cleanup function failed: timed out waiting for the condition\n"
time="2019-10-28T06:02:29-07:00" level=error msg="Failed to cleanup resources: (a cleanup function failed; see stdout for more details)"
ERRO[0035] Plugin `olm-tests` failed with error (failed to create global resources: object is being deleted: customresourcedefinitions.apiextensions.k8s.io "csiscaleoperators.scale.ibm.com" already exists)
Basic Tests:
Spec Block Exists: 1/1
Status Block Exists: 1/1
Writing into CRs has an effect: 1/1
Failed Plugin: olm-tests:
Total Score: 50% Apparently not. AddendumI ran the make uninstall
....... Uninstalling .......
....... Deleting CRDs.......
kubectl delete -f deploy/crds/ibm_v1alpha1_csiscaleoperator_crd.yaml -n csiscaleoperator
warning: deleting cluster-scoped resources, not scoped to the provided namespace
customresourcedefinition.apiextensions.k8s.io "csiscaleoperators.scale.ibm.com" deleted I've been there for about five minutes. I'm going to look at the scorecard source and try to recreate these conditions by hand so I can trace where the hang is occuring, because the Custom Resource is not showing any information. |
Hi @oseoin, Could you share your role.yaml file? Also, could you try to use the same permissions that we have in the samples and lt us know if it solved your issue?. Note that if you edit the CRD when it gets stuck and removes the finalizer it will be deleted. So, it is possible that it can be related to the permissions to allow the operator to remove the finalizer. |
Is this related to #1513? I've tried a lot of different configs (bumped to 0.11.0, verified the role was present and even dug through the scorecard logic) and I can't identify any other sources. Additionally, clearing the CR's finalizer will clear the CR and CRD 100% of the time after the scorecard fails. |
Hi @camilamacedo86, apologies for the delay in getting back to you. I tested using |
I played around this morning and was able to recreate your error I think... jeffmc@doppio114 ~/projects/memcached-operator> operator-sdk scorecard -o text I did this by artificially putting a Sleep of 75 seconds in the finalizer function. this causes an issue where you can't remove the CR without doing this: adjusting the scorecard config to use init-timeout of a value greater than 75 seconds still INFO[0000] Using config file: /home/jeffmc/projects/memcached-operator/.osdk-scorecard.yaml |
based on my debugging so far, it appears the scorecard deletes resources in an order like this: what I'm suspecting is that scorecard assumes that when it removes the CRD that it will cause any associated CRs to be removed, this is not happening when a finalizer is specified, you instead see the CRD deletion hang and the CR not being removed. It seems to me the CR should be removed prior to the CRD being removed which works around this hang. |
The
Deleting the CRD causes all CRs of that type to be deleted, but if the CR has a finalizer, then the operator still needs to be running. So we would need to make sure that our code waits until the CRDs from step 1 are actually deleted before proceeding to step 2. |
@joelanford I confirmed that order does indeed work, so the fix here would be to change scorecard to delete in that order it sounds like. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale This is still very much a problem. |
fyi, I'm picking this one back up, should have a fix for this in a day or so. I'll make a change to scorecard's resource cleanup logic to remove the CRD resource first when doing its cleanup....I'll also change the deletion timeout from a hard-coded value to use the --init-timeout configuration setting, this will allow for finalizers to complete before the operator deployment is deleted. |
ok, the order of deletion was actually ok, but the initTimeout value was not being used in the deletion wait, I've added that, a PR is coming up soon. |
Bug Report
What did you do?
Created a simple Ansible Operator with SDK v0.11.0 to reproduce an issue we are seeing on our Ansible operators where finalisers do not complete within a few seconds.
Using the following tasks:
Set a finaliser set in watches:
Used the following scorecard configuration:
What did you expect to see?
With the sleep task disabled I see the following as expected:
What did you see instead? Under which circumstances?
Instead I get an error that cleanup does not complete:
Watching the operator deployment I can see that it gets removed before the finaliser completes.
Environment
operator-sdk version: "v0.11.0"
commit: "39c65c36159a9c249e5f3c178205cc6e86c16f8d"
go version:
go version: "go1.13.1 darwin/amd64"
Kubernetes version information:
Kubernetes cluster kind:
Docker desktop 2.1.4.0 on macOs 10.15
Are you writing your operator in ansible, helm, or go?
Ansible
Possible Solution
Scorecard should wait before removing the operator deployment if finalisers are set.
The text was updated successfully, but these errors were encountered: