NodeHealthy is not changed when majority failure case #8129

youngbo89 · 2023-02-19T10:08:15Z

What steps did you take and what happened:

Create cluster with kcp.replicas=3
Wait for process complete
Shutdown Control-plane machine#1
Shutdown Control-plane machine#2

What did you expect to happen:
Second-shutdown machine's NodeHealthy condition should be Unknown.

Machine Condition	First shutdown-Machine	Second second-Machine
APIServerPodHealthy	Unknown	Unknown
ControllerManagerPodHealthy	Unknown	Unknown
EtcdMemberHealthy	Unknown	Unknown
EtcdPodHealthy	Unknown	Unknown
NodeHealthy	Unknown	True
SchedulerPodHealthy	Unknown	Unknown

Machine Object Details(YAML)

First Machine

apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
annotations:
  capi.vks.linecorp.com/cni-plugin: calico-ipip
  capi.vks.linecorp.com/control-plane-role: "true"
  capi.vks.linecorp.com/etcd-role: "true"
  capi.vks.linecorp.com/node-pool-id: p-vksmantest.yb0219-nodetest2-cp
  controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: '{"etcd":{},"networking":{},"apiServer":{},"controllerManager":{"extraArgs":{"bind-address":"0.0.0.0"}},"scheduler":{"extraArgs":{"bind-address":"0.0.0.0"}},"dns":{},"imageRepository":"registry.k8s.io"}'
creationTimestamp: "2023-02-19T08:17:15Z"
finalizers:
- machine.cluster.x-k8s.io
generation: 3
labels:
  cluster.x-k8s.io/cluster-name: yb0219-nodetest2
  cluster.x-k8s.io/control-plane: ""
  cluster.x-k8s.io/control-plane-name: yb0219-nodetest2-cp
name: yb0219-nodetest2-cp-qwcb6
namespace: p-vksmantest
ownerReferences:
- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
  blockOwnerDeletion: true
  controller: true
  kind: KubeadmControlPlane
  name: yb0219-nodetest2-cp
  uid: ae52d794-13dc-4892-98cd-d52a250a787a
resourceVersion: "12038424"
uid: eef6869a-b037-4ec3-b4e6-efca3482bdf3
spec:
bootstrap:
  configRef:
    apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
    kind: KubeadmConfig
    name: yb0219-nodetest2-cp-bsjjq
    namespace: p-vksmantest
    uid: 24def3f1-5535-41c0-9f8b-d8a82b7f0e2e
  dataSecretName: yb0219-nodetest2-cp-bsjjq
clusterName: yb0219-nodetest2
infrastructureRef:
  apiVersion: verda.lico.linecorp.com/v1alpha1
  kind: VerdaMachine
  name: yb0219-nodetest2-cp-grss2
  namespace: p-vksmantest
  uid: 86bba56d-81c9-496e-a2f7-fa3d9591a9a0
nodeDeletionTimeout: 10s
providerID: verda://411e4e42-d066-479d-a910-594b40b367bc
version: v1.25.4
status:
addresses:
- address: 10.241.133.151
  type: InternalIP
bootstrapReady: true
certificatesExpiryDate: "2024-02-19T08:19:45Z"
conditions:
- lastTransitionTime: "2023-02-19T08:20:40Z"
  status: "True"
  type: Ready
- lastTransitionTime: "2023-02-19T09:00:00Z"
  message: Node is unreachable
  reason: PodInspectionFailed
  status: Unknown
  type: APIServerPodHealthy
- lastTransitionTime: "2023-02-19T08:17:15Z"
  status: "True"
  type: BootstrapReady
- lastTransitionTime: "2023-02-19T09:00:00Z"
  message: Node is unreachable
  reason: PodInspectionFailed
  status: Unknown
  type: ControllerManagerPodHealthy
- lastTransitionTime: "2023-02-19T08:59:49Z"
  message: 'Failed to connect to the etcd pod on the yb0219-nodetest2-cp-grss2 node:
    could not establish a connection to any etcd node: unable to create etcd client:
    context deadline exceeded'
  reason: MemberInspectionFailed
  status: Unknown
  type: EtcdMemberHealthy
- lastTransitionTime: "2023-02-19T09:00:00Z"
  message: Node is unreachable
  reason: PodInspectionFailed
  status: Unknown
  type: EtcdPodHealthy
- lastTransitionTime: "2023-02-19T08:20:40Z"
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-02-19T08:59:38Z"
  message: 'Node condition MemoryPressure is Unknown. Node condition DiskPressure
    is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown. '
  reason: NodeConditionsFailed
  status: Unknown
  type: NodeHealthy
- lastTransitionTime: "2023-02-19T09:00:00Z"
  message: Node is unreachable
  reason: PodInspectionFailed
  status: Unknown
  type: SchedulerPodHealthy
infrastructureReady: true
lastUpdated: "2023-02-19T08:20:41Z"
nodeInfo:
  architecture: amd64
  bootID: 09aaeadf-18f8-456e-a3ee-e541cfbd0ace
  containerRuntimeVersion: containerd://1.6.6
  kernelVersion: 5.4.206-200.el7.x86_64
  kubeProxyVersion: v1.25.4
  kubeletVersion: v1.25.4
  machineID: 662a1f015c715ff51e65ea91c3b57903
  operatingSystem: linux
  osImage: CentOS Linux 7 (Core)
  systemUUID: 411e4e42-d066-479d-a910-594b40b367bc
nodeRef:
  apiVersion: v1
  kind: Node
  name: yb0219-nodetest2-cp-grss2
  uid: 3c05508a-cd6a-43ed-8214-d4dc170cfd04
observedGeneration: 3
phase: Running

Second Machine

apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
annotations:
  capi.vks.linecorp.com/cni-plugin: calico-ipip
  capi.vks.linecorp.com/control-plane-role: "true"
  capi.vks.linecorp.com/etcd-role: "true"
  capi.vks.linecorp.com/node-pool-id: p-vksmantest.yb0219-nodetest2-cp
  controlplane.cluster.x-k8s.io/kubeadm-cluster-configuration: '{"etcd":{},"networking":{},"apiServer":{},"controllerManager":{"extraArgs":{"bind-address":"0.0.0.0"}},"scheduler":{"extraArgs":{"bind-address":"0.0.0.0"}},"dns":{},"imageRepository":"registry.k8s.io"}'
creationTimestamp: "2023-02-19T08:20:43Z"
finalizers:
- machine.cluster.x-k8s.io
generation: 3
labels:
  cluster.x-k8s.io/cluster-name: yb0219-nodetest2
  cluster.x-k8s.io/control-plane: ""
  cluster.x-k8s.io/control-plane-name: yb0219-nodetest2-cp
name: yb0219-nodetest2-cp-8jp4d
namespace: p-vksmantest
ownerReferences:
- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
  blockOwnerDeletion: true
  controller: true
  kind: KubeadmControlPlane
  name: yb0219-nodetest2-cp
  uid: ae52d794-13dc-4892-98cd-d52a250a787a
resourceVersion: "12040745"
uid: eb2bf040-f0a0-4bac-8dce-e3ceb7193608
spec:
bootstrap:
  configRef:
    apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
    kind: KubeadmConfig
    name: yb0219-nodetest2-cp-mk5dq
    namespace: p-vksmantest
    uid: 51ced914-b334-42cb-b090-e624a6a4cedc
  dataSecretName: yb0219-nodetest2-cp-mk5dq
clusterName: yb0219-nodetest2
infrastructureRef:
  apiVersion: verda.lico.linecorp.com/v1alpha1
  kind: VerdaMachine
  name: yb0219-nodetest2-cp-p6wbn
  namespace: p-vksmantest
  uid: 81b4070e-d7cb-4978-be4c-c8c00790efa7
nodeDeletionTimeout: 10s
providerID: verda://ebb23846-63f5-40a3-9c4d-def1d0d8e114
version: v1.25.4
status:
addresses:
- address: 10.241.133.154
  type: InternalIP
bootstrapReady: true
certificatesExpiryDate: "2024-02-19T08:24:00Z"
conditions:
- lastTransitionTime: "2023-02-19T08:24:24Z"
  status: "True"
  type: Ready
- lastTransitionTime: "2023-02-19T09:05:21Z"
  message: Failed to get pod status
  reason: PodInspectionFailed
  status: Unknown
  type: APIServerPodHealthy
- lastTransitionTime: "2023-02-19T08:20:43Z"
  status: "True"
  type: BootstrapReady
- lastTransitionTime: "2023-02-19T09:05:21Z"
  message: Failed to get pod status
  reason: PodInspectionFailed
  status: Unknown
  type: ControllerManagerPodHealthy
- lastTransitionTime: "2023-02-19T09:05:21Z"
  message: 'Failed to connect to the etcd pod on the yb0219-nodetest2-cp-p6wbn node:
    could not establish a connection to any etcd node: unable to create etcd client:
    context deadline exceeded'
  reason: MemberInspectionFailed
  status: Unknown
  type: EtcdMemberHealthy
- lastTransitionTime: "2023-02-19T09:05:21Z"
  message: Failed to get pod status
  reason: PodInspectionFailed
  status: Unknown
  type: EtcdPodHealthy
- lastTransitionTime: "2023-02-19T08:24:24Z"
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-02-19T08:24:25Z"
  status: "True"
  type: NodeHealthy
- lastTransitionTime: "2023-02-19T09:05:21Z"
  message: Failed to get pod status
  reason: PodInspectionFailed
  status: Unknown
  type: SchedulerPodHealthy
infrastructureReady: true
lastUpdated: "2023-02-19T08:24:24Z"
nodeInfo:
  architecture: amd64
  bootID: c555ac34-866c-4e00-b81f-3775dfb3d230
  containerRuntimeVersion: containerd://1.6.6
  kernelVersion: 5.4.206-200.el7.x86_64
  kubeProxyVersion: v1.25.4
  kubeletVersion: v1.25.4
  machineID: 662a1f015c715ff51e65ea91c3b57903
  operatingSystem: linux
  osImage: CentOS Linux 7 (Core)
  systemUUID: ebb23846-63f5-40a3-9c4d-def1d0d8e114
nodeRef:
  apiVersion: v1
  kind: Node
  name: yb0219-nodetest2-cp-p6wbn
  uid: fbd451bd-7b57-46d5-a7ca-6849ce135bc3
observedGeneration: 3
phase: Running

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api version: v1.3.2
Kubernetes version: (use kubectl version):
- management-cluster: v1.24
- workload-cluster: v1.25

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

fabriziopandini · 2023-02-23T20:11:41Z

/triage accepted
/help

k8s-ci-robot · 2023-02-23T20:11:42Z

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2023-02-24T04:40:47Z

/area control-plane

youngbo89 · 2023-03-12T05:24:19Z

/lifecycle active

fabriziopandini · 2023-03-20T18:36:48Z

/assign @youngbo89

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 19, 2023

k8s-ci-robot added the area/control-plane Issues or PRs related to control-plane lifecycle management label Feb 24, 2023

k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Mar 12, 2023

youngbo89 mentioned this issue Mar 12, 2023

🐛 Mark machine unknown when failed to get node #8268

Closed

k8s-ci-robot assigned youngbo89 Mar 20, 2023

sbueringer mentioned this issue Dec 12, 2023

🌱 Mark Machine healthy condition as unknown if we can't list wl nodes #9864

Merged

k8s-ci-robot closed this as completed in #9864 Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodeHealthy is not changed when majority failure case #8129

NodeHealthy is not changed when majority failure case #8129

youngbo89 commented Feb 19, 2023 •

edited

Loading

First Machine

Second Machine

fabriziopandini commented Feb 23, 2023

k8s-ci-robot commented Feb 23, 2023

sbueringer commented Feb 24, 2023

youngbo89 commented Mar 12, 2023

fabriziopandini commented Mar 20, 2023

NodeHealthy is not changed when majority failure case #8129

NodeHealthy is not changed when majority failure case #8129

Comments

youngbo89 commented Feb 19, 2023 • edited Loading

First Machine

Second Machine

fabriziopandini commented Feb 23, 2023

k8s-ci-robot commented Feb 23, 2023

Guidelines

sbueringer commented Feb 24, 2023

youngbo89 commented Mar 12, 2023

fabriziopandini commented Mar 20, 2023

youngbo89 commented Feb 19, 2023 •

edited

Loading