Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeHealthy is not changed when majority failure case #8129

Closed
youngbo89 opened this issue Feb 19, 2023 · 5 comments · Fixed by #9864
Closed

NodeHealthy is not changed when majority failure case #8129

youngbo89 opened this issue Feb 19, 2023 · 5 comments · Fixed by #9864
Assignees
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@youngbo89
Copy link

youngbo89 commented Feb 19, 2023

What steps did you take and what happened:

  1. Create cluster with kcp.replicas=3
  2. Wait for process complete
  3. Shutdown Control-plane machine#1
  4. Shutdown Control-plane machine#2

What did you expect to happen:
Second-shutdown machine's NodeHealthy condition should be Unknown.

Machine Condition First shutdown-Machine Second second-Machine
APIServerPodHealthy Unknown Unknown
ControllerManagerPodHealthy Unknown Unknown
EtcdMemberHealthy Unknown Unknown
EtcdPodHealthy Unknown Unknown
NodeHealthy Unknown True
SchedulerPodHealthy Unknown Unknown
Machine Object Details(YAML)

First Machine

apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
annotations:
  capi.vks.linecorp.com/cni-plugin: calico-ipip
  capi.vks.linecorp.com/control-plane-role: "true"
  capi.vks.linecorp.com/etcd-role: "true"
  capi.vks.linecorp.com/node-pool-id: p-vksmantest.yb0219-nodetest2-cp
  controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: '{"etcd":{},"networking":{},"apiServer":{},"controllerManager":{"extraArgs":{"bind-address":"0.0.0.0"}},"scheduler":{"extraArgs":{"bind-address":"0.0.0.0"}},"dns":{},"imageRepository":"registry.k8s.io"}'
creationTimestamp: "2023-02-19T08:17:15Z"
finalizers:
- machine.cluster.x-k8s.io
generation: 3
labels:
  cluster.x-k8s.io/cluster-name: yb0219-nodetest2
  cluster.x-k8s.io/control-plane: ""
  cluster.x-k8s.io/control-plane-name: yb0219-nodetest2-cp
name: yb0219-nodetest2-cp-qwcb6
namespace: p-vksmantest
ownerReferences:
- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
  blockOwnerDeletion: true
  controller: true
  kind: KubeadmControlPlane
  name: yb0219-nodetest2-cp
  uid: ae52d794-13dc-4892-98cd-d52a250a787a
resourceVersion: "12038424"
uid: eef6869a-b037-4ec3-b4e6-efca3482bdf3
spec:
bootstrap:
  configRef:
    apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
    kind: KubeadmConfig
    name: yb0219-nodetest2-cp-bsjjq
    namespace: p-vksmantest
    uid: 24def3f1-5535-41c0-9f8b-d8a82b7f0e2e
  dataSecretName: yb0219-nodetest2-cp-bsjjq
clusterName: yb0219-nodetest2
infrastructureRef:
  apiVersion: verda.lico.linecorp.com/v1alpha1
  kind: VerdaMachine
  name: yb0219-nodetest2-cp-grss2
  namespace: p-vksmantest
  uid: 86bba56d-81c9-496e-a2f7-fa3d9591a9a0
nodeDeletionTimeout: 10s
providerID: verda://411e4e42-d066-479d-a910-594b40b367bc
version: v1.25.4
status:
addresses:
- address: 10.241.133.151
  type: InternalIP
bootstrapReady: true
certificatesExpiryDate: "2024-02-19T08:19:45Z"
conditions:
- lastTransitionTime: "2023-02-19T08:20:40Z"
  status: "True"
  type: Ready
- lastTransitionTime: "2023-02-19T09:00:00Z"
  message: Node is unreachable
  reason: PodInspectionFailed
  status: Unknown
  type: APIServerPodHealthy
- lastTransitionTime: "2023-02-19T08:17:15Z"
  status: "True"
  type: BootstrapReady
- lastTransitionTime: "2023-02-19T09:00:00Z"
  message: Node is unreachable
  reason: PodInspectionFailed
  status: Unknown
  type: ControllerManagerPodHealthy
- lastTransitionTime: "2023-02-19T08:59:49Z"
  message: 'Failed to connect to the etcd pod on the yb0219-nodetest2-cp-grss2 node:
    could not establish a connection to any etcd node: unable to create etcd client:
    context deadline exceeded'
  reason: MemberInspectionFailed
  status: Unknown
  type: EtcdMemberHealthy
- lastTransitionTime: "2023-02-19T09:00:00Z"
  message: Node is unreachable
  reason: PodInspectionFailed
  status: Unknown
  type: EtcdPodHealthy
- lastTransitionTime: "2023-02-19T08:20:40Z"
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-02-19T08:59:38Z"
  message: 'Node condition MemoryPressure is Unknown. Node condition DiskPressure
    is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown. '
  reason: NodeConditionsFailed
  status: Unknown
  type: NodeHealthy
- lastTransitionTime: "2023-02-19T09:00:00Z"
  message: Node is unreachable
  reason: PodInspectionFailed
  status: Unknown
  type: SchedulerPodHealthy
infrastructureReady: true
lastUpdated: "2023-02-19T08:20:41Z"
nodeInfo:
  architecture: amd64
  bootID: 09aaeadf-18f8-456e-a3ee-e541cfbd0ace
  containerRuntimeVersion: containerd://1.6.6
  kernelVersion: 5.4.206-200.el7.x86_64
  kubeProxyVersion: v1.25.4
  kubeletVersion: v1.25.4
  machineID: 662a1f015c715ff51e65ea91c3b57903
  operatingSystem: linux
  osImage: CentOS Linux 7 (Core)
  systemUUID: 411e4e42-d066-479d-a910-594b40b367bc
nodeRef:
  apiVersion: v1
  kind: Node
  name: yb0219-nodetest2-cp-grss2
  uid: 3c05508a-cd6a-43ed-8214-d4dc170cfd04
observedGeneration: 3
phase: Running

Second Machine

apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
annotations:
  capi.vks.linecorp.com/cni-plugin: calico-ipip
  capi.vks.linecorp.com/control-plane-role: "true"
  capi.vks.linecorp.com/etcd-role: "true"
  capi.vks.linecorp.com/node-pool-id: p-vksmantest.yb0219-nodetest2-cp
  controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: '{"etcd":{},"networking":{},"apiServer":{},"controllerManager":{"extraArgs":{"bind-address":"0.0.0.0"}},"scheduler":{"extraArgs":{"bind-address":"0.0.0.0"}},"dns":{},"imageRepository":"registry.k8s.io"}'
creationTimestamp: "2023-02-19T08:20:43Z"
finalizers:
- machine.cluster.x-k8s.io
generation: 3
labels:
  cluster.x-k8s.io/cluster-name: yb0219-nodetest2
  cluster.x-k8s.io/control-plane: ""
  cluster.x-k8s.io/control-plane-name: yb0219-nodetest2-cp
name: yb0219-nodetest2-cp-8jp4d
namespace: p-vksmantest
ownerReferences:
- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
  blockOwnerDeletion: true
  controller: true
  kind: KubeadmControlPlane
  name: yb0219-nodetest2-cp
  uid: ae52d794-13dc-4892-98cd-d52a250a787a
resourceVersion: "12040745"
uid: eb2bf040-f0a0-4bac-8dce-e3ceb7193608
spec:
bootstrap:
  configRef:
    apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
    kind: KubeadmConfig
    name: yb0219-nodetest2-cp-mk5dq
    namespace: p-vksmantest
    uid: 51ced914-b334-42cb-b090-e624a6a4cedc
  dataSecretName: yb0219-nodetest2-cp-mk5dq
clusterName: yb0219-nodetest2
infrastructureRef:
  apiVersion: verda.lico.linecorp.com/v1alpha1
  kind: VerdaMachine
  name: yb0219-nodetest2-cp-p6wbn
  namespace: p-vksmantest
  uid: 81b4070e-d7cb-4978-be4c-c8c00790efa7
nodeDeletionTimeout: 10s
providerID: verda://ebb23846-63f5-40a3-9c4d-def1d0d8e114
version: v1.25.4
status:
addresses:
- address: 10.241.133.154
  type: InternalIP
bootstrapReady: true
certificatesExpiryDate: "2024-02-19T08:24:00Z"
conditions:
- lastTransitionTime: "2023-02-19T08:24:24Z"
  status: "True"
  type: Ready
- lastTransitionTime: "2023-02-19T09:05:21Z"
  message: Failed to get pod status
  reason: PodInspectionFailed
  status: Unknown
  type: APIServerPodHealthy
- lastTransitionTime: "2023-02-19T08:20:43Z"
  status: "True"
  type: BootstrapReady
- lastTransitionTime: "2023-02-19T09:05:21Z"
  message: Failed to get pod status
  reason: PodInspectionFailed
  status: Unknown
  type: ControllerManagerPodHealthy
- lastTransitionTime: "2023-02-19T09:05:21Z"
  message: 'Failed to connect to the etcd pod on the yb0219-nodetest2-cp-p6wbn node:
    could not establish a connection to any etcd node: unable to create etcd client:
    context deadline exceeded'
  reason: MemberInspectionFailed
  status: Unknown
  type: EtcdMemberHealthy
- lastTransitionTime: "2023-02-19T09:05:21Z"
  message: Failed to get pod status
  reason: PodInspectionFailed
  status: Unknown
  type: EtcdPodHealthy
- lastTransitionTime: "2023-02-19T08:24:24Z"
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-02-19T08:24:25Z"
  status: "True"
  type: NodeHealthy
- lastTransitionTime: "2023-02-19T09:05:21Z"
  message: Failed to get pod status
  reason: PodInspectionFailed
  status: Unknown
  type: SchedulerPodHealthy
infrastructureReady: true
lastUpdated: "2023-02-19T08:24:24Z"
nodeInfo:
  architecture: amd64
  bootID: c555ac34-866c-4e00-b81f-3775dfb3d230
  containerRuntimeVersion: containerd://1.6.6
  kernelVersion: 5.4.206-200.el7.x86_64
  kubeProxyVersion: v1.25.4
  kubeletVersion: v1.25.4
  machineID: 662a1f015c715ff51e65ea91c3b57903
  operatingSystem: linux
  osImage: CentOS Linux 7 (Core)
  systemUUID: ebb23846-63f5-40a3-9c4d-def1d0d8e114
nodeRef:
  apiVersion: v1
  kind: Node
  name: yb0219-nodetest2-cp-p6wbn
  uid: fbd451bd-7b57-46d5-a7ca-6849ce135bc3
observedGeneration: 3
phase: Running

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api version: v1.3.2
  • Kubernetes version: (use kubectl version):
    • management-cluster: v1.24
    • workload-cluster: v1.25

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 19, 2023
@fabriziopandini
Copy link
Member

/triage accepted
/help

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 23, 2023
@sbueringer
Copy link
Member

/area control-plane

@k8s-ci-robot k8s-ci-robot added the area/control-plane Issues or PRs related to control-plane lifecycle management label Feb 24, 2023
@youngbo89
Copy link
Author

/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Mar 12, 2023
@fabriziopandini
Copy link
Member

/assign @youngbo89

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
4 participants