Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero not re-creating volume using storage class with Retain policy #5506

Open
aschi1 opened this issue Oct 27, 2022 · 11 comments
Open

Velero not re-creating volume using storage class with Retain policy #5506

aschi1 opened this issue Oct 27, 2022 · 11 comments
Assignees
Labels
Needs info Waiting for information Needs investigation

Comments

@aschi1
Copy link

aschi1 commented Oct 27, 2022

Hi,
We tried testing a DR scenario for our deployments and hit a problem with a PVC using storage class with Retain policy. We are using AWS EKS cluster running on kubernetes 1.23. We are using CSI drivers for the disks and snapshotting, velero plugin for csi and we have features: EnableCSI set in helm.

I tried searching through issues but I did not find anything similar, only one discussion about missing roles to access KMS keys used for encryption of disks but even after adding those permissions our situation did not change.

Thank you for your help

What steps did you take and what happened:

  1. I created a deployment with PVC
  2. This PVC has label velero: true so that it is backed up by velero
  3. I waited until automatic backup is triggered via velero
  4. I verified I can see the snapshot of disk in AWS console
  5. I deleted my deployment together with my created PVC, PV and the EBS volume in AWS console (we wanted to test full DR scenario)
  6. After I verified everything is deleted I executed velero restore create --from-backup velero-snapshot-every-hour-20221027124246

What did you expect to happen:

  • PV and PVC are recreated in the cluster (this is working OK, I can see resources in cluster)
  • The EBS volume will be recreated in AWS console
    • The Volume is not created in AWS and pod is stuck in ContainerCreating with error message below
AttachVolume.Attach failed for volume "pvc-xxx" : rpc error: code = Internal desc = Could not get volume with ID "vol-xxx": InvalidVolume.NotFound: The volume 'vol-xxx' does not exist.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

Anything else you would like to add:
At first we thought Velero is maybe expecting the EBS volume to stay in AWS console since its policy Retain but we disproved this by creating PVC with manifest below. When we created a PVC like this manually it was automatically created in AWS console and the pod started with the data restored. Afterwards, we compared it to the manifest that velero creates for the PVC after performing the restore and they were trying to do the same thing.

When we tested the same behavior with PVC using storage class with Delete policy everything worked without issues. This leaves us wondering what could be the problem. We thought it might be missing permissions in AWS (we are using IRSA to pass the role to velero pod) but if disks with Delete policy are working then it does not seem like a permission problem

PVC manifest used to recreate the volume manually

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: alpine-retain-pvc
  namespace: alpine
  labels:
    velero: "true"
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
  storageClassName: "retain-sc"
  dataSource:
    name: velero-alpine-retain-pvc-jcxgp
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Velero version (use velero version):
Client:
	Version: v1.9.2
	Git commit: -
Server:
	Version: v1.9.2
  • Velero features (use velero client config get features):
features: EnableCSI
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.2", GitCommit:"5835544ca568b757a8ecae5c153f317e5736700e", GitTreeState:"clean", BuildDate:"2022-09-21T14:25:45Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.10-eks-15b7512", GitCommit:"cd6399691d9b1fed9ec20c9c5e82f5993c3f42cb", GitTreeState:"clean", BuildDate:"2022-08-31T19:17:01Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version: AWS EKS, version 1.23
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Linux

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@aschi1 aschi1 changed the title Velero not re-creating volume from storage class with Retain policy Velero not re-creating volume using storage class with Retain policy Oct 27, 2022
@Lyndon-Li
Copy link
Contributor

Looks like this is an environmental problem that the PV could not be provisioned by the CSI provisioner, could you try with the verification here, to make sure all the CSI functionalities work well.

EKS 1.23 has introduced CSI migration and some other security changes, if you are using KMS, there are other steps in other to config the policies to both the service role and node group.

@Lyndon-Li Lyndon-Li self-assigned this Oct 28, 2022
@aschi1
Copy link
Author

aschi1 commented Oct 31, 2022

Thank you for replying. I tested the use-case you linked and everything worked without issues. It was pretty similar to our use-case with the ReclaimPolicy on PVC set to Delete. The problems arise when we try it with Retain and the volume has been deleted in AWS console. Even in that case, when I try this restore operation via manual resources (such as in the verification you linked) it works OK. When I try to initiate it via velero restore create it does not work and it seems like there might be some problems between Velero and CSI driver?

Could you maybe please link the changes you mentioned in EKS? Specifically what other steps we need to take? I tried to google but I wasnt able to find anything that could seem to help. We are using new cluster created on 1.23 and we are not migrating any volumes from the old EBS storage class. Below is the current policy we are using for velero

{
    "Version" : "2012-10-17",
    "Statement" : [
      {
        "Effect" : "Allow",
        "Action" : [
          "ec2:DescribeVolumes",
          "ec2:DescribeSnapshots",
          "ec2:CreateTags",
          "ec2:CreateVolume",
          "ec2:CreateSnapshot",
          "ec2:DeleteSnapshot"
        ],
        "Resource" : "*"
      },
      {
        "Effect" : "Allow",
        "Action" : [
          "s3:GetObject",
          "s3:DeleteObject",
          "s3:PutObject",
          "s3:AbortMultipartUpload",
          "s3:ListMultipartUploadParts"
        ],
        "Resource" : [
          "arn:aws:s3:::${local.velero_bucket}/*"
        ]
      },
      {
        "Effect" : "Allow",
        "Action" : [
          "s3:ListBucket"
        ],
        "Resource" : [
          "arn:aws:s3:::${local.velero_bucket}"
        ]
      }
    ]
  }

@Lyndon-Li
Copy link
Contributor

Could you run below commands in your env:
kubectl get po -n kube-system
kubectl get sc retain-sc -o yaml
kubectl get pv -o yaml

@raghavkaranth
Copy link

Running into the exact same issue. Restores fail if the storage class used has policy set to retain. Deterministically reproducible. Kindly prioritize this issue/bug.

@Lyndon-Li
Copy link
Contributor

@raghavkaranth To help us troubleshoot, could you run below command after the problem happens and share us the output:

kubectl get vsc -o yaml
kubectl get vs -n velero -o yaml
kubectl get vs -n <restored namespace> -o yaml

@waclawikj
Copy link

waclawikj commented Apr 14, 2023

I have similar problem, but I'm using restic instead of CSI snapshots. If there is PVC data included in backup (proper restic volumes) everything works fine: Velero re-creates volume, but when there are only PVCs and PVs backed up it doesn't, restored PVC takes over backed PV (making original PVC "Lost") or if restore happens on another K8s cluster it can't mount volume to pod. So i can't backup and restore just pods, that have PVCs (without their data). Also kindly prioritize this issue/bug.

@Lyndon-Li
Copy link
Contributor

@waclawikj
I think this is another problem which is different from the current issue.
For this particular problem with restic backup, the thing is Velero doesn't support overwriting existing objects, so for any kind of restore, you need to delete the original objects first.

@Lyndon-Li
Copy link
Contributor

@aschi1 @raghavkaranth For the original problem, could you help to collect the info mentioned above so that we can further troubleshoot?

@bingwei-hong-partior
Copy link

The problem seems to due to this ? It does not dynamically provision new volume when the retainPolicy is not Delete
image

@Lyndon-Li
Copy link
Contributor

I guess all the problems mentioned in this issue thread are related to overwriting existing items during restore, which is not supported by Velero.
Here is the error logs attached to the original issue from @aschi1:

{"level":"info","logSource":"pkg/restore/restore.go:1370","msg":"Restore of VolumeSnapshotClass, delete-vsc skipped: it already exists in the cluster and is the same as the backed up version","restore":"default/velero-snapshot-every-hour-20221027124246-20221027150447","time":"2022-10-27T13:04:48Z"}
{"level":"warning","logSource":"pkg/controller/restore_controller.go:511","msg":"Cluster resource restore warning: could not restore, VolumeSnapshotContent \"snapcontent-d2d629b8-9091-4cdc-b021-6f778445c563\" already exists. Warning: the in-cluster version is different than the backed-up version.","restore":"default/velero-snapshot-every-hour-20221027124246-20221027150447","time":"2022-10-27T13:04:49Z"}
{"level":"warning","logSource":"pkg/controller/restore_controller.go:515","msg":"Namespace alpine, resource restore warning: could not restore, VolumeSnapshot \"velero-alpine-retain-pvc-jcxgp\" already exists. Warning: the in-cluster version is different than the backed-up version.","restore":"default/velero-snapshot-every-hour-20221027124246-20221027150447","time":"2022-10-27T13:04:49Z"}

For CSI snapshot restore, besides deleting the PVC/PV, we also need to guarantee the VS/VSC/snapshot class doesn't exist or contains the correct info.
Normally, this should not happen, because Velero will delete the VS/VSC during the backup. We need further info(as the first step, the info mentioned in #5506 (comment)) to troubleshoot what happened in the env.

@anmolkhemuka
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs info Waiting for information Needs investigation
Projects
None yet
Development

No branches or pull requests

6 participants