Velero not re-creating volume using storage class with Retain policy #5506

aschi1 · 2022-10-27T13:42:31Z

Hi,
We tried testing a DR scenario for our deployments and hit a problem with a PVC using storage class with Retain policy. We are using AWS EKS cluster running on kubernetes 1.23. We are using CSI drivers for the disks and snapshotting, velero plugin for csi and we have features: EnableCSI set in helm.

I tried searching through issues but I did not find anything similar, only one discussion about missing roles to access KMS keys used for encryption of disks but even after adding those permissions our situation did not change.

Thank you for your help

What steps did you take and what happened:

I created a deployment with PVC
This PVC has label velero: true so that it is backed up by velero
I waited until automatic backup is triggered via velero
I verified I can see the snapshot of disk in AWS console
I deleted my deployment together with my created PVC, PV and the EBS volume in AWS console (we wanted to test full DR scenario)
After I verified everything is deleted I executed velero restore create --from-backup velero-snapshot-every-hour-20221027124246

What did you expect to happen:

PV and PVC are recreated in the cluster (this is working OK, I can see resources in cluster)
The EBS volume will be recreated in AWS console
- The Volume is not created in AWS and pod is stuck in ContainerCreating with error message below

AttachVolume.Attach failed for volume "pvc-xxx" : rpc error: code = Internal desc = Could not get volume with ID "vol-xxx": InvalidVolume.NotFound: The volume 'vol-xxx' does not exist.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

Support bundle attached
bundle-2022-10-27-15-22-52.tar.gz

Anything else you would like to add:
At first we thought Velero is maybe expecting the EBS volume to stay in AWS console since its policy Retain but we disproved this by creating PVC with manifest below. When we created a PVC like this manually it was automatically created in AWS console and the pod started with the data restored. Afterwards, we compared it to the manifest that velero creates for the PVC after performing the restore and they were trying to do the same thing.

When we tested the same behavior with PVC using storage class with Delete policy everything worked without issues. This leaves us wondering what could be the problem. We thought it might be missing permissions in AWS (we are using IRSA to pass the role to velero pod) but if disks with Delete policy are working then it does not seem like a permission problem

PVC manifest used to recreate the volume manually

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: alpine-retain-pvc
  namespace: alpine
  labels:
    velero: "true"
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
  storageClassName: "retain-sc"
  dataSource:
    name: velero-alpine-retain-pvc-jcxgp
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

[Miscellaneous information that will assist in solving the issue.]

Environment:

Velero version (use velero version):

Client:
	Version: v1.9.2
	Git commit: -
Server:
	Version: v1.9.2

Velero features (use velero client config get features):

features: EnableCSI

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.2", GitCommit:"5835544ca568b757a8ecae5c153f317e5736700e", GitTreeState:"clean", BuildDate:"2022-09-21T14:25:45Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.10-eks-15b7512", GitCommit:"cd6399691d9b1fed9ec20c9c5e82f5993c3f42cb", GitTreeState:"clean", BuildDate:"2022-08-31T19:17:01Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes installer & version: AWS EKS, version 1.23
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Linux

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2022-10-28T06:10:03Z

Looks like this is an environmental problem that the PV could not be provisioned by the CSI provisioner, could you try with the verification here, to make sure all the CSI functionalities work well.

EKS 1.23 has introduced CSI migration and some other security changes, if you are using KMS, there are other steps in other to config the policies to both the service role and node group.

aschi1 · 2022-10-31T09:34:28Z

Thank you for replying. I tested the use-case you linked and everything worked without issues. It was pretty similar to our use-case with the ReclaimPolicy on PVC set to Delete. The problems arise when we try it with Retain and the volume has been deleted in AWS console. Even in that case, when I try this restore operation via manual resources (such as in the verification you linked) it works OK. When I try to initiate it via velero restore create it does not work and it seems like there might be some problems between Velero and CSI driver?

Could you maybe please link the changes you mentioned in EKS? Specifically what other steps we need to take? I tried to google but I wasnt able to find anything that could seem to help. We are using new cluster created on 1.23 and we are not migrating any volumes from the old EBS storage class. Below is the current policy we are using for velero

{
    "Version" : "2012-10-17",
    "Statement" : [
      {
        "Effect" : "Allow",
        "Action" : [
          "ec2:DescribeVolumes",
          "ec2:DescribeSnapshots",
          "ec2:CreateTags",
          "ec2:CreateVolume",
          "ec2:CreateSnapshot",
          "ec2:DeleteSnapshot"
        ],
        "Resource" : "*"
      },
      {
        "Effect" : "Allow",
        "Action" : [
          "s3:GetObject",
          "s3:DeleteObject",
          "s3:PutObject",
          "s3:AbortMultipartUpload",
          "s3:ListMultipartUploadParts"
        ],
        "Resource" : [
          "arn:aws:s3:::${local.velero_bucket}/*"
        ]
      },
      {
        "Effect" : "Allow",
        "Action" : [
          "s3:ListBucket"
        ],
        "Resource" : [
          "arn:aws:s3:::${local.velero_bucket}"
        ]
      }
    ]
  }

Lyndon-Li · 2022-11-10T07:34:48Z

Could you run below commands in your env:
kubectl get po -n kube-system
kubectl get sc retain-sc -o yaml
kubectl get pv -o yaml

raghavkaranth · 2023-02-22T18:59:57Z

Running into the exact same issue. Restores fail if the storage class used has policy set to retain. Deterministically reproducible. Kindly prioritize this issue/bug.

Lyndon-Li · 2023-03-15T10:09:40Z

@raghavkaranth To help us troubleshoot, could you run below command after the problem happens and share us the output:

kubectl get vsc -o yaml
kubectl get vs -n velero -o yaml
kubectl get vs -n <restored namespace> -o yaml

waclawikj · 2023-04-14T12:58:41Z

I have similar problem, but I'm using restic instead of CSI snapshots. If there is PVC data included in backup (proper restic volumes) everything works fine: Velero re-creates volume, but when there are only PVCs and PVs backed up it doesn't, restored PVC takes over backed PV (making original PVC "Lost") or if restore happens on another K8s cluster it can't mount volume to pod. So i can't backup and restore just pods, that have PVCs (without their data). Also kindly prioritize this issue/bug.

Lyndon-Li · 2023-06-13T06:31:07Z

@waclawikj
I think this is another problem which is different from the current issue.
For this particular problem with restic backup, the thing is Velero doesn't support overwriting existing objects, so for any kind of restore, you need to delete the original objects first.

Lyndon-Li · 2023-06-13T06:31:53Z

@aschi1 @raghavkaranth For the original problem, could you help to collect the info mentioned above so that we can further troubleshoot?

bingwei-hong-partior · 2023-07-19T05:33:37Z

The problem seems to due to this ? It does not dynamically provision new volume when the retainPolicy is not Delete

Lyndon-Li · 2023-07-20T03:23:26Z

I guess all the problems mentioned in this issue thread are related to overwriting existing items during restore, which is not supported by Velero.
Here is the error logs attached to the original issue from @aschi1:

{"level":"info","logSource":"pkg/restore/restore.go:1370","msg":"Restore of VolumeSnapshotClass, delete-vsc skipped: it already exists in the cluster and is the same as the backed up version","restore":"default/velero-snapshot-every-hour-20221027124246-20221027150447","time":"2022-10-27T13:04:48Z"}
{"level":"warning","logSource":"pkg/controller/restore_controller.go:511","msg":"Cluster resource restore warning: could not restore, VolumeSnapshotContent \"snapcontent-d2d629b8-9091-4cdc-b021-6f778445c563\" already exists. Warning: the in-cluster version is different than the backed-up version.","restore":"default/velero-snapshot-every-hour-20221027124246-20221027150447","time":"2022-10-27T13:04:49Z"}
{"level":"warning","logSource":"pkg/controller/restore_controller.go:515","msg":"Namespace alpine, resource restore warning: could not restore, VolumeSnapshot \"velero-alpine-retain-pvc-jcxgp\" already exists. Warning: the in-cluster version is different than the backed-up version.","restore":"default/velero-snapshot-every-hour-20221027124246-20221027150447","time":"2022-10-27T13:04:49Z"}

For CSI snapshot restore, besides deleting the PVC/PV, we also need to guarantee the VS/VSC/snapshot class doesn't exist or contains the correct info.
Normally, this should not happen, because Velero will delete the VS/VSC during the backup. We need further info(as the first step, the info mentioned in #5506 (comment)) to troubleshoot what happened in the env.

anmolkhemuka · 2025-02-25T08:16:43Z

@Lyndon-Li is this the expected behavior- https://github.com/vmware-tanzu/velero/blob/main/pkg/restore/restore.go#L1251

aschi1 changed the title ~~Velero not re-creating volume from storage class with Retain policy~~ Velero not re-creating volume using storage class with Retain policy Oct 27, 2022

Lyndon-Li self-assigned this Oct 28, 2022

Lyndon-Li added Needs investigation Needs info Waiting for information 1.11-candidate labels Nov 23, 2022

Lyndon-Li added the defer-candidate label Dec 2, 2022

Lyndon-Li removed defer-candidate 1.11-candidate labels Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero not re-creating volume using storage class with Retain policy #5506

Velero not re-creating volume using storage class with Retain policy #5506

aschi1 commented Oct 27, 2022

Lyndon-Li commented Oct 28, 2022

aschi1 commented Oct 31, 2022 •

edited

Loading

Lyndon-Li commented Nov 10, 2022

raghavkaranth commented Feb 22, 2023

Lyndon-Li commented Mar 15, 2023

waclawikj commented Apr 14, 2023 •

edited

Loading

Lyndon-Li commented Jun 13, 2023

Lyndon-Li commented Jun 13, 2023

bingwei-hong-partior commented Jul 19, 2023

Lyndon-Li commented Jul 20, 2023

anmolkhemuka commented Feb 25, 2025

Velero not re-creating volume using storage class with Retain policy #5506

Velero not re-creating volume using storage class with Retain policy #5506

Comments

aschi1 commented Oct 27, 2022

Lyndon-Li commented Oct 28, 2022

aschi1 commented Oct 31, 2022 • edited Loading

Lyndon-Li commented Nov 10, 2022

raghavkaranth commented Feb 22, 2023

Lyndon-Li commented Mar 15, 2023

waclawikj commented Apr 14, 2023 • edited Loading

Lyndon-Li commented Jun 13, 2023

Lyndon-Li commented Jun 13, 2023

bingwei-hong-partior commented Jul 19, 2023

Lyndon-Li commented Jul 20, 2023

anmolkhemuka commented Feb 25, 2025

aschi1 commented Oct 31, 2022 •

edited

Loading

waclawikj commented Apr 14, 2023 •

edited

Loading