Reconsider design of restart events issued when rolling Kafka Pods #10958

scholzj · 2024-12-15T22:27:04Z

Currently, when the Kafka pods are rolled, we issue Kubernetes Events describing the reason for the restart. It is done only for the Kafka, Connect and MM2 node restarts. The events are issued to the Pods as the main objects.

This approach has several issues:

Pods have many events when they are restarted. So the restart reason event is issued by Strimzi CO, it is easily lost among them
Very often, the restart reason is Pod has old revision which means that the Pod definition has changed -> but the root cause of the change could be for example updated listener certificate or something similar.

I think we should consider the future of the events used by the operator and two options for how to deal with them come to my mind:

We can issue them with the custom resource as the main object they reference (i.e. the regarding field). That would make it easier to find the events as the custom resource will have only our events and not the events related to the Pod lifecycle. The Pod might be referenced as the related resource if needed. Issuing the events to the custom resource might also make it easier to consider other situations when we might want to issue events.
We can simply remove them. There seem to be some users using them, but I think it is a relatively small number of users. So removing them might help to simplify our codebase and testing.

The text was updated successfully, but these errors were encountered:

ppatierno · 2024-12-17T12:56:28Z

I have never used them directly but my guess is that Kube events are useful for some users (as you said we have them if not that many). Even the idea about integrating self-healing could rely on events in the future.
My take on this is going with option 1. I agree that our events could be "missed" in between pods lifecycle events. My question is about which custom resource you are referring to? Our pods are related to the StrimziPodSet resource so are you referring to it and then using "related" to specify the specific pod?

scholzj · 2024-12-17T13:26:14Z

I was thinking the Kafka, KafkaConnect, KafkaMIrrorMaker2 etc. resources

katheris · 2025-01-02T15:37:09Z

I would also vote for option 1 of having them use the custom resource as the main object they reference. I think it is useful to be able to see the restart as an event rather than having to always look through the operator logs.

im-konge · 2025-01-09T08:12:08Z

Triaged on 9.1.2025: We should go with the option 1 and keep this issue open.

katheris · 2025-02-10T15:00:18Z

I've started taking a look at this and as a result have changed my mind on the approach. The suggestion was to change the regarding field to the operator that issued the event. If we do this we need to consider how we represent the pod that was restarted - do we include it in the message or in one of the other fields?

The fields for the Event API are in the Kubernetes docs

Today we use these fields:

action - hard coded to StrimziInitiatedPodRestart
reportingController - hard coded to strimzi.io/cluster-operator
reportingInstance - the pod name of the cluster operator pod
regarding - pod reference
reason - reason for restarting
type - Normal or Warning
eventTime - time event occured
note - note related to the reason

Given the fields above it's possible to filter events. For example I was able to use the command:

kubectl get events -n kafka --field-selector reportingComponent=strimzi.io/cluster-operator

I am unsure why the field comes back as reportingComponent rather than reportingController, but regardless it is possible for users to view only events from Strimzi, even via the command line. We also state this in the docs.

Although I do think that changing the regarding field might help those users who are unfamiliar with events to find the Strimzi events, given that Kubernetes has the range of fields it does I don't think it makes sense to move away from what we have today. Instead we should encourage users to properly use the field selector to filter for the events that are more relevant to them. The events have to be collected anyway since they are only accessible for an hour live so it seems like only a small step to then use field selectors on the resulting events.

ppatierno · 2025-02-10T15:51:42Z

If we do this we need to consider how we represent the pod that was restarted

I guess that it was proposed to use the related field for the pod and the regarding for the Kafka custom resource.
Even if reading the Kubernetes API doc, it makes an example like this regarding ReplicaSet Controller:

regarding contains the object this Event is about. In most cases it's an Object reporting controller implements, e.g. ReplicaSetController implements ReplicaSets and this event is emitted because it acts on some changes in a ReplicaSet object.

And putting Kafka in the regarding field would mean a change in the Kafka custom resource which is not actually true.

Maybe we could stick with regarding to point at the Pod and using related for the Kafka (or any other) custom resource?

scholzj · 2025-02-10T16:15:14Z

I've started taking a look at this and as a result have changed my mind on the approach. The suggestion was to change the regarding field to the operator that issued the event. If we do this we need to consider how we represent the pod that was restarted - do we include it in the message or in one of the other fields?

The fields for the Event API are in the Kubernetes docs

Today we use these fields:

action - hard coded to StrimziInitiatedPodRestart

reportingController - hard coded to strimzi.io/cluster-operator

reportingInstance - the pod name of the cluster operator pod

regarding - pod reference

reason - reason for restarting

type - Normal or Warning

eventTime - time event occured

note - note related to the reason

Given the fields above it's possible to filter events. For example I was able to use the command:
kubectl get events -n kafka --field-selector reportingComponent=strimzi.io/cluster-operator
I am unsure why the field comes back as reportingComponent rather than reportingController, but regardless it is possible for users to view only events from Strimzi, even via the command line. We also state this in the docs.

Although I do think that changing the regarding field might help those users who are unfamiliar with events to find the Strimzi events, given that Kubernetes has the range of fields it does I don't think it makes sense to move away from what we have today. Instead we should encourage users to properly use the field selector to filter for the events that are more relevant to them. The events have to be collected anyway since they are only accessible for an hour live so it seems like only a small step to then use field selectors on the resulting events.

The Event API has for example also related objects where you can stick the Pods. You can also have the identification in the message?

Indeed, not all of the fields can be filterable. But for me this is mainly a question of where do I think the people are looking -> Are people really interested in the Pod level? Are they filtering for event of typically at least 6 different pods (3 controllers + 3 brokers)? Or are they more interested in what is happening with the operand itself which might include events for all Pods and more?

Note:

Which fields can be filtered by: https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/
The discrepancy in the names is from the whole mixup between core/Event and events/Event APIs. The core APi has reportingComponent and involvedObject.

fvaleri · 2025-02-11T15:32:12Z

Just to mention, with the UTO proposal we decided to get rid of all KafkaTopic events and no one complained. The reasoning was that most people use the resource status, operator logs and metrics.

scholzj · 2025-02-11T15:35:48Z

@fvaleri Yeah, I know. But for the broker pods, there are some people using them. Likely not many, but we get questions, complains, and so on about them. So it is not completely unused.

That said, the main complaint is that most of the restart events are Pod revision changed which is not a good user-facing restart reason and of course, the possible change I suggested here does absolutely nothing about that.

scholzj added enhancement needs-triage labels Dec 15, 2024

scholzj changed the title ~~Reconsider design restart events issued when rolling Kafka Pods~~ Reconsider design of restart events issued when rolling Kafka Pods Dec 15, 2024

im-konge removed the needs-triage label Jan 9, 2025

scholzj added the needs-proposal label Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider design of restart events issued when rolling Kafka Pods #10958

Reconsider design of restart events issued when rolling Kafka Pods #10958

scholzj commented Dec 15, 2024

ppatierno commented Dec 17, 2024

scholzj commented Dec 17, 2024

katheris commented Jan 2, 2025

im-konge commented Jan 9, 2025

katheris commented Feb 10, 2025

ppatierno commented Feb 10, 2025

scholzj commented Feb 10, 2025

fvaleri commented Feb 11, 2025

scholzj commented Feb 11, 2025 •

edited

Loading

Reconsider design of restart events issued when rolling Kafka Pods #10958

Reconsider design of restart events issued when rolling Kafka Pods #10958

Comments

scholzj commented Dec 15, 2024

ppatierno commented Dec 17, 2024

scholzj commented Dec 17, 2024

katheris commented Jan 2, 2025

im-konge commented Jan 9, 2025

katheris commented Feb 10, 2025

ppatierno commented Feb 10, 2025

scholzj commented Feb 10, 2025

fvaleri commented Feb 11, 2025

scholzj commented Feb 11, 2025 • edited Loading

scholzj commented Feb 11, 2025 •

edited

Loading