Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing Log Errors After Move to containerd #56

Open
artntek opened this issue Feb 19, 2025 · 4 comments
Open

Addressing Log Errors After Move to containerd #56

artntek opened this issue Feb 19, 2025 · 4 comments
Assignees

Comments

@artntek
Copy link
Contributor

artntek commented Feb 19, 2025

No description provided.

@artntek artntek self-assigned this Feb 19, 2025
@artntek
Copy link
Contributor Author

artntek commented Feb 19, 2025

OBSERVED

k8s-dev-node-1: tail -f /var/log/syslog includes many:

Feb 19 18:09:29 k8s-dev-node-1 kubelet[2006661]: E0219 18:09:29.257526 2006661 cri_stats_provider.go:669] "Unable
    to fetch container log stats" err="failed to get fsstats for \"/var/log/pods/ceph-csi-rbd_ceph-csi-rbd-csi-cephfsplugin-
    t9qsf_687a526a-d9db-4c81-917b-4aefef82fc7f/csi-rbdplugin/6.log\": no such file or directory" containerName="csi-
    rbdplugin"
Feb 19 18:09:29 k8s-dev-node-1 kubelet[2006661]: E0219 18:09:29.381692 2006661 kubelet_volumes.go:245] "There
    were many similar errors. Turn up verbosity to see them." err="orphaned pod \"0d450624-8e1b-499b-8184-
    64c7cd2c2437\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=32

CAUSE

log locations have moved (docker kept them in /var/lib/docker/containers, and containerd do longer does)

brooke@k8s-dev-node-1:~$ sudo ls -la /var/lib/docker/containers/
total 0
drwxr-xr-x 2 root root  6 Feb 18 21:01 .
drwx--x--- 3 root root 24 Feb 18 21:01 ..

However, there are existing symlinks from /var/log/pods/... over to /var/lib/docker/containers..., which are now broken, which is what's causing the log errors.

These are the broken symlinks:

brooke@k8s-dev-node-1:~$ find /var/log/pods -type l ! -exec test -e {} \; -print

/var/log/pods/kube-system_kube-proxy-wlwnx_688267e9-8631-436c-b45b-b4431534737f/kube-proxy/6.log
/var/log/pods/kube-system_kube-proxy-wlwnx_688267e9-8631-436c-b45b-b4431534737f/kube-proxy/7.log
/var/log/pods/kube-system_calico-node-6kfdj_ac07af65-25e7-4b53-933a-71dd5d409ee6/upgrade-ipam/7.log
/var/log/pods/kube-system_calico-node-6kfdj_ac07af65-25e7-4b53-933a-71dd5d409ee6/install-cni/0.log
/var/log/pods/kube-system_calico-node-6kfdj_ac07af65-25e7-4b53-933a-71dd5d409ee6/flexvol-driver/0.log
/var/log/pods/kube-system_calico-node-6kfdj_ac07af65-25e7-4b53-933a-71dd5d409ee6/calico-node/7.log
/var/log/pods/kube-system_calico-node-6kfdj_ac07af65-25e7-4b53-933a-71dd5d409ee6/calico-node/8.log
/var/log/pods/ceph-csi-rbd_ceph-csi-rbd-csi-cephfsplugin-t9qsf_687a526a-d9db-4c81-917b-4aefef82fc7f/driver-registrar/6.log
/var/log/pods/ceph-csi-rbd_ceph-csi-rbd-csi-cephfsplugin-t9qsf_687a526a-d9db-4c81-917b-4aefef82fc7f/driver-registrar/7.log
/var/log/pods/ceph-csi-rbd_ceph-csi-rbd-csi-cephfsplugin-t9qsf_687a526a-d9db-4c81-917b-4aefef82fc7f/csi-rbdplugin/6.log
/var/log/pods/ceph-csi-rbd_ceph-csi-rbd-csi-cephfsplugin-t9qsf_687a526a-d9db-4c81-917b-4aefef82fc7f/csi-rbdplugin/7.log
/var/log/pods/ceph-csi-rbd_ceph-csi-rbd-csi-cephfsplugin-t9qsf_687a526a-d9db-4c81-917b-4aefef82fc7f/liveness-prometheus/6.log
/var/log/pods/ceph-csi-rbd_ceph-csi-rbd-csi-cephfsplugin-t9qsf_687a526a-d9db-4c81-917b-4aefef82fc7f/liveness-prometheus/7.log
/var/log/pods/velero_node-agent-qxqsf_fd501ba9-6855-4084-97ac-acfc84b00c75/node-agent/6.log
/var/log/pods/velero_node-agent-qxqsf_fd501ba9-6855-4084-97ac-acfc84b00c75/node-agent/7.log
/var/log/pods/brooke_fluentbitbrooke-fluent-bit-8lx7d_0f80fda3-05ba-4991-b72d-b2d99a679402/fluent-bit/6.log
/var/log/pods/brooke_fluentbitbrooke-fluent-bit-8lx7d_0f80fda3-05ba-4991-b72d-b2d99a679402/fluent-bit/7.log
/var/log/pods/ceph-csi-cephfs_ceph-csi-cephfs-csi-cephfsplugin-zrk44_9e19da0d-8fc9-4d66-9208-870dab986568/driver-registrar/6.log
/var/log/pods/ceph-csi-cephfs_ceph-csi-cephfs-csi-cephfsplugin-zrk44_9e19da0d-8fc9-4d66-9208-870dab986568/driver-registrar/7.log
/var/log/pods/ceph-csi-cephfs_ceph-csi-cephfs-csi-cephfsplugin-zrk44_9e19da0d-8fc9-4d66-9208-870dab986568/csi-cephfsplugin/6.log
/var/log/pods/ceph-csi-cephfs_ceph-csi-cephfs-csi-cephfsplugin-zrk44_9e19da0d-8fc9-4d66-9208-870dab986568/csi-cephfsplugin/7.log
/var/log/pods/ceph-csi-cephfs_ceph-csi-cephfs-csi-cephfsplugin-zrk44_9e19da0d-8fc9-4d66-9208-870dab986568/liveness-prometheus/6.log
/var/log/pods/ceph-csi-cephfs_ceph-csi-cephfs-csi-cephfsplugin-zrk44_9e19da0d-8fc9-4d66-9208-870dab986568/liveness-prometheus/7.log

SOLUTION

remove these broken links

## List them as a test
$ for link in $(find /var/log/pods -type l ! -exec test -e {} \; -print); do ls -la $link; done

## remove them
$ for link in $(find /var/log/pods -type l ! -exec test -e {} \; -print); do sudo rm $link; done

@artntek
Copy link
Contributor Author

artntek commented Feb 19, 2025

OBSERVED

k8s-dev-node-1: tail -f /var/log/syslog includes these every 2 seconds:

Feb 19 22:32:13 k8s-dev-node-1 kubelet[2006661]: E0219 22:32:13.350197 2006661 kubelet_volumes.go:245] "There
    were many similar errors. Turn up verbosity to see them." err="orphaned pod \"3e251619-7e5d-4b39-8f87-
    ffdbc08364f1\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=30

CAUSE

(see this GH issue)

kubelet is unable to remove the orphaned pod directory, because there's still a file in it; e.g.:

brooke@k8s-dev-node-1:~$ sudo ls -la /var/lib/kubelet/pods/3e251619-7e5d-4b39-8f87-
    ffdbc08364f1/volumes/kubernetes.io~csi/cephfs-metadig-pv
total 4
drwxr-x--- 2 root root  27 Apr 15  2024 .
drwxr-x--- 3 root root  31 Apr 12  2024 ..
-rw-r--r-- 1 root root 253 Feb 18 20:58 vol_data.json

SOLUTION

If you delete the orphaned vol_data.json file, then kubelet is immediately able to delete the entire subtree under and including the pod id (e.g. /var/lib/kubelet/pods/3e251619-7e5d-4b39-8f87-ffdbc08364f1). However, it then goes on to the next orphaned pod, and the same errors show up for that one. It's therefore necessary to do this:

while true; do \
    sudo tail -n10 /var/log/syslog | \
    grep kubelet | grep -Eo 'orphaned pod \\"([a-z0-9]+-?)*\\"' | \
    awk '{ print $3 }' | tr -d '\\' | uniq | \
    xargs -I % sh -c 'echo "deleting /var/lib/kubelet/pods/%"; sudo rm -rf /var/lib/kubelet/pods/%;'; \
    sleep 2; \
done

@artntek
Copy link
Contributor Author

artntek commented Feb 20, 2025

OBSERVED

Feb 20 00:14:19 docker-dev-ucsb-1 systemd[1]: Configuration file /run/systemd/system/netplan-ovs-cleanup.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.

SOLUTION

sudo chmod o+r /run/systemd/system/netplan-ovs-cleanup.service

@artntek
Copy link
Contributor Author

artntek commented Feb 20, 2025

to do:

  • k8s-dev-ctrl-1
  • k8s-dev-node-1
  • k8s-dev-node-2
  • k8s-dev-node-3
  • k8s-dev-node-4
  • k8s-dev-node-5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant