Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition for helper-pod when multiple pvc's are provisioned #154

Closed
stefan-kiss opened this issue Nov 11, 2020 · 9 comments
Closed

Comments

@stefan-kiss
Copy link

stefan-kiss commented Nov 11, 2020

Hello. I think i have uncovered a bug.
If you provision multiple pv's in rapid succession the helper pod will only run for the first one.

First here will mean the helper pod that first get's created might be or not the same as the same pv that should be created.

How to reproduce:
On a single node kubernetes cluster:

kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml
  1. Instead of creating one pvc and one pod create 3 of them:
    pvcs.yaml:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: local-path-pvc1
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: local-path
  resources:
    requests:
      storage: 2Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: local-path-pvc2
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: local-path
  resources:
    requests:
      storage: 2Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: local-path-pvc3
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: local-path
  resources:
    requests:
      storage: 2Gi

pods.yaml:

---
apiVersion: v1
kind: Pod
metadata:
  name: volume-test01
  namespace: default
spec:
  containers:
  - name: volume-test
    image: nginx:stable-alpine
    imagePullPolicy: IfNotPresent
    volumeMounts:
    - name: volv
      mountPath: /data
    ports:
    - containerPort: 80
  volumes:
  - name: volv
    persistentVolumeClaim:
      claimName: local-path-pvc1
---
apiVersion: v1
kind: Pod
metadata:
  name: volume-test02
  namespace: default
spec:
  containers:
  - name: volume-test
    image: nginx:stable-alpine
    imagePullPolicy: IfNotPresent
    volumeMounts:
    - name: volv
      mountPath: /data
    ports:
    - containerPort: 80
  volumes:
  - name: volv
    persistentVolumeClaim:
      claimName: local-path-pvc2
---
apiVersion: v1
kind: Pod
metadata:
  name: volume-test03
  namespace: default
spec:
  containers:
  - name: volume-test
    image: nginx:stable-alpine
    imagePullPolicy: IfNotPresent
    volumeMounts:
    - name: volv
      mountPath: /data
    ports:
    - containerPort: 80
  volumes:
  - name: volv
    persistentVolumeClaim:
      claimName: local-path-pvc3

Then check the /opt/local-path-provisioner path on the node:

[skiss-dev2 ~]$ ls -la /opt/local-path-provisioner
total 0
drwxr-xr-x  5 root root 222 Nov 11 08:56 .
drwxr-xr-x. 5 root root  65 Nov 11 08:50 ..
drwxr-xr-x  2 root root   6 Nov 11 08:56 pvc-919db3fc-6c88-446e-b77c-01e7ae260289_default_local-path-pvc1
drwxrwxrwx  2 root root   6 Nov 11 08:56 pvc-95bc6533-cd29-4f89-adaf-7b740e311969_default_local-path-pvc2
drwxr-xr-x  2 root root   6 Nov 11 08:56 pvc-c3e09e1e-19a6-44ec-9cf3-843fe24fe1b5_default_local-path-pvc3

As you can all 3 pv's are being created however only one has the correct set of permissions.

The provisioner logs:

I1111 08:56:10.202181       1 controller.go:1202] provision "default/local-path-pvc1" class "local-path": started
I1111 08:56:10.214691       1 controller.go:1202] provision "default/local-path-pvc2" class "local-path": started
time="2020-11-11T08:56:10Z" level=debug msg="config doesn't contain node master-worker-1487cf6df4b42f3c60ef, use DEFAULT_PATH_FOR_NON_LISTED_NODES instead"
time="2020-11-11T08:56:10Z" level=info msg="Creating volume pvc-919db3fc-6c88-446e-b77c-01e7ae260289 at master-worker-1487cf6df4b42f3c60ef:/opt/local-path-provisioner/pvc-919db3fc-6c88-446e-b77c-01e7ae260289_default_local-path-pvc1"
time="2020-11-11T08:56:10Z" level=info msg="create the helper pod helper-pod into local-path-storage"
I1111 08:56:10.221367       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"local-path-pvc1", UID:"919db3fc-6c88-446e-b77c-01e7ae260289", APIVersion:"v1", ResourceVersion:"26463", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/local-path-pvc1"
I1111 08:56:10.403174       1 controller.go:1202] provision "default/local-path-pvc3" class "local-path": started
time="2020-11-11T08:56:10Z" level=debug msg="config doesn't contain node master-worker-1487cf6df4b42f3c60ef, use DEFAULT_PATH_FOR_NON_LISTED_NODES instead"
time="2020-11-11T08:56:10Z" level=info msg="Creating volume pvc-95bc6533-cd29-4f89-adaf-7b740e311969 at master-worker-1487cf6df4b42f3c60ef:/opt/local-path-provisioner/pvc-95bc6533-cd29-4f89-adaf-7b740e311969_default_local-path-pvc2"
time="2020-11-11T08:56:10Z" level=info msg="create the helper pod helper-pod into local-path-storage"
I1111 08:56:10.410110       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"local-path-pvc2", UID:"95bc6533-cd29-4f89-adaf-7b740e311969", APIVersion:"v1", ResourceVersion:"26467", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/local-path-pvc2"
time="2020-11-11T08:56:10Z" level=debug msg="config doesn't contain node master-worker-1487cf6df4b42f3c60ef, use DEFAULT_PATH_FOR_NON_LISTED_NODES instead"
time="2020-11-11T08:56:10Z" level=info msg="Creating volume pvc-c3e09e1e-19a6-44ec-9cf3-843fe24fe1b5 at master-worker-1487cf6df4b42f3c60ef:/opt/local-path-provisioner/pvc-c3e09e1e-19a6-44ec-9cf3-843fe24fe1b5_default_local-path-pvc3"
time="2020-11-11T08:56:10Z" level=info msg="create the helper pod helper-pod into local-path-storage"
I1111 08:56:10.420310       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"local-path-pvc3", UID:"c3e09e1e-19a6-44ec-9cf3-843fe24fe1b5", APIVersion:"v1", ResourceVersion:"26472", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/local-path-pvc3"
time="2020-11-11T08:56:25Z" level=info msg="Volume pvc-c3e09e1e-19a6-44ec-9cf3-843fe24fe1b5 has been created on master-worker-1487cf6df4b42f3c60ef:/opt/local-path-provisioner/pvc-c3e09e1e-19a6-44ec-9cf3-843fe24fe1b5_default_local-path-pvc3"
time="2020-11-11T08:56:25Z" level=info msg="Volume pvc-919db3fc-6c88-446e-b77c-01e7ae260289 has been created on master-worker-1487cf6df4b42f3c60ef:/opt/local-path-provisioner/pvc-919db3fc-6c88-446e-b77c-01e7ae260289_default_local-path-pvc1"
time="2020-11-11T08:56:25Z" level=info msg="Volume pvc-95bc6533-cd29-4f89-adaf-7b740e311969 has been created on master-worker-1487cf6df4b42f3c60ef:/opt/local-path-provisioner/pvc-95bc6533-cd29-4f89-adaf-7b740e311969_default_local-path-pvc2"
time="2020-11-11T08:56:28Z" level=error msg="unable to delete the helper pod: pods \"helper-pod\" not found"
I1111 08:56:28.440603       1 controller.go:1284] provision "default/local-path-pvc1" class "local-path": volume "pvc-919db3fc-6c88-446e-b77c-01e7ae260289" provisioned
I1111 08:56:28.440662       1 controller.go:1301] provision "default/local-path-pvc1" class "local-path": succeeded
I1111 08:56:28.440687       1 volume_store.go:212] Trying to save persistentvolume "pvc-919db3fc-6c88-446e-b77c-01e7ae260289"
I1111 08:56:28.444032       1 controller.go:1284] provision "default/local-path-pvc3" class "local-path": volume "pvc-c3e09e1e-19a6-44ec-9cf3-843fe24fe1b5" provisioned
I1111 08:56:28.444058       1 controller.go:1301] provision "default/local-path-pvc3" class "local-path": succeeded
I1111 08:56:28.444067       1 volume_store.go:212] Trying to save persistentvolume "pvc-c3e09e1e-19a6-44ec-9cf3-843fe24fe1b5"
time="2020-11-11T08:56:28Z" level=error msg="unable to delete the helper pod: pods \"helper-pod\" not found"
I1111 08:56:28.444364       1 controller.go:1284] provision "default/local-path-pvc2" class "local-path": volume "pvc-95bc6533-cd29-4f89-adaf-7b740e311969" provisioned
I1111 08:56:28.444381       1 controller.go:1301] provision "default/local-path-pvc2" class "local-path": succeeded
I1111 08:56:28.444387       1 volume_store.go:212] Trying to save persistentvolume "pvc-95bc6533-cd29-4f89-adaf-7b740e311969"
I1111 08:56:28.456125       1 volume_store.go:219] persistentvolume "pvc-95bc6533-cd29-4f89-adaf-7b740e311969" saved
I1111 08:56:28.456388       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"local-path-pvc2", UID:"95bc6533-cd29-4f89-adaf-7b740e311969", APIVersion:"v1", ResourceVersion:"26467", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-95bc6533-cd29-4f89-adaf-7b740e311969
I1111 08:56:28.457477       1 volume_store.go:219] persistentvolume "pvc-c3e09e1e-19a6-44ec-9cf3-843fe24fe1b5" saved
I1111 08:56:28.457550       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"local-path-pvc3", UID:"c3e09e1e-19a6-44ec-9cf3-843fe24fe1b5", APIVersion:"v1", ResourceVersion:"26472", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-c3e09e1e-19a6-44ec-9cf3-843fe24fe1b5
I1111 08:56:28.459975       1 volume_store.go:219] persistentvolume "pvc-919db3fc-6c88-446e-b77c-01e7ae260289" saved
I1111 08:56:28.460001       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"local-path-pvc1", UID:"919db3fc-6c88-446e-b77c-01e7ae260289", APIVersion:"v1", ResourceVersion:"26463", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-919db3fc-6c88-446e-b77c-01e7ae260289

Note the two messages regarding unable to delete the helper pod. That's because it want created for them. However there is no creation error message because:

	// If it already exists due to some previous errors, the pod will be cleaned up later automatically
	// https://github.com/rancher/local-path-provisioner/issues/27
	logrus.Infof("create the helper pod %s into %s", helperPod.Name, p.namespace)
	_, err = p.kubeClient.CoreV1().Pods(p.namespace).Create(helperPod)
	if err != nil && !k8serror.IsAlreadyExists(err) {
		return err
	}

From my understanding it seems that all 3 requests for pv provision are sent very close and the pod being named the same cannot be created multiple times. The first request that get's trough creates the pod the rest fails silently. I'm not entierly sure why the path on the node exists in any case since the helper pod does not get called.

However it's pretty clear that only one helper pod runs at a time and only one custom provisioning code (such as the one that sets the permissions) is being run.

@xor22h
Copy link

xor22h commented Nov 11, 2020

I got two of them with drwxrwxrwx and one with drwxr-xr-x. But, yes, they are inconsistent.

yasker added a commit to yasker/local-path-provisioner that referenced this issue Nov 12, 2020
rancher#154

Signed-off-by: Sheng Yang <sheng.yang@rancher.com>
yasker added a commit to yasker/local-path-provisioner that referenced this issue Nov 12, 2020
To fix the race condition caused by reusing the helper pod name.

rancher#154

Signed-off-by: Sheng Yang <sheng.yang@rancher.com>
yasker added a commit to yasker/local-path-provisioner that referenced this issue Nov 12, 2020
To fix the race condition caused by reusing the helper pod name.

rancher#154

Signed-off-by: Sheng Yang <sheng.yang@rancher.com>
yasker added a commit that referenced this issue Nov 12, 2020
To fix the race condition caused by reusing the helper pod name.

#154

Signed-off-by: Sheng Yang <sheng.yang@rancher.com>
@yasker
Copy link
Member

yasker commented Nov 12, 2020

@stefan-kiss Good catch!

Can you check if the image yasker/local-path-provisioner:57801fc-amd64 fixes the issue in your environment? Assuming you're using AMD64. You need to replace the images from the original yaml file.

@stefan-kiss
Copy link
Author

stefan-kiss commented Nov 12, 2020

Hello, yes, i tested and i get correct results.

For my project (since i needed to move forward very fast), i used for now the helper pod name (string from config) and the actual PV name (from functions arguments) , trim the result to 63 characters and remove '-' to be sure i don't hit the name/format problems.

But this approach (or any other that will use a unique pod) leaves open the possibility of having pods hanged. While there are only couple of scenarios where you could have this - like the provisioner pod getting killed while the helper pod is up it still can have an impact ...

I don't yet know how to fix this but i'm thinking of replacing the deferred delete pod function with a cleaup function that looks for all helper pods (by some label) and cleans them up if they are in 'Completed', or one of the failure statuses.
(well anyway ! 'Running')

P.S. thank you for a very fast response !

@yasker
Copy link
Member

yasker commented Nov 13, 2020

@stefan-kiss Label the helper pods and clean them up once completed sounds like a good idea. Though I probably won't have time to work on it until after the thanksgiving, due to the upcoming KubeCon and other projects. Feel free to raise a PR if you need it sooner.

@stefan-kiss
Copy link
Author

Sure. I'll try ... Thanks !

@unki
Copy link

unki commented Nov 16, 2020

@yasker could you maybe already create a new release containing your fix #155 as v0.0.19?

Otherwise all new users will probably hit the same issue with the current v0.0.18 (or whenever it was introduced).
I was today upgrading from v0.0.14->v0.0.19 - without checking the issues here first - and 💥 ... :)

Big thanks in advance and for all your great work!

@yasker
Copy link
Member

yasker commented Nov 19, 2020

I've just merged #156 which uses PV name as the suffix for the helper pod. I will leave it there for a couple of days for testing, then I can release v0.0.19.

@sgoodliff
Copy link

Hi,

Im also hitting this so would like to see 0.0.19 ship when ready

thanks

yasker added a commit to yasker/local-path-provisioner that referenced this issue Nov 30, 2020
Fixed the helper pod race condition for multiple PVCs.

Issue: rancher#154

Signed-off-by: Sheng Yang <sheng.yang@rancher.com>
yasker added a commit that referenced this issue Nov 30, 2020
Fixed the helper pod race condition for multiple PVCs.

Issue: #154

Signed-off-by: Sheng Yang <sheng.yang@rancher.com>
@yasker
Copy link
Member

yasker commented Nov 30, 2020

v0.0.19 has been released. https://github.com/rancher/local-path-provisioner/releases/tag/v0.0.19

Notice the label clean up mechanism is not there, but since it's the same as before, we can treat the labeled clean up as an enhancement and close this bug for now. Feel free to file another bug for it.

@yasker yasker closed this as completed Nov 30, 2020
fsdrw08 pushed a commit to fsdrw08/local-path-provisioner that referenced this issue Aug 12, 2021
Fixed the helper pod race condition for multiple PVCs.

Issue: rancher#154

Signed-off-by: Sheng Yang <sheng.yang@rancher.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants