Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider design of restart events issued when rolling Kafka Pods #10958

Open
scholzj opened this issue Dec 15, 2024 · 9 comments
Open

Reconsider design of restart events issued when rolling Kafka Pods #10958

scholzj opened this issue Dec 15, 2024 · 9 comments

Comments

@scholzj
Copy link
Member

scholzj commented Dec 15, 2024

Currently, when the Kafka pods are rolled, we issue Kubernetes Events describing the reason for the restart. It is done only for the Kafka, Connect and MM2 node restarts. The events are issued to the Pods as the main objects.

This approach has several issues:

  • Pods have many events when they are restarted. So the restart reason event is issued by Strimzi CO, it is easily lost among them
  • Very often, the restart reason is Pod has old revision which means that the Pod definition has changed -> but the root cause of the change could be for example updated listener certificate or something similar.

I think we should consider the future of the events used by the operator and two options for how to deal with them come to my mind:

  1. We can issue them with the custom resource as the main object they reference (i.e. the regarding field). That would make it easier to find the events as the custom resource will have only our events and not the events related to the Pod lifecycle. The Pod might be referenced as the related resource if needed. Issuing the events to the custom resource might also make it easier to consider other situations when we might want to issue events.
  2. We can simply remove them. There seem to be some users using them, but I think it is a relatively small number of users. So removing them might help to simplify our codebase and testing.
@scholzj scholzj changed the title Reconsider design restart events issued when rolling Kafka Pods Reconsider design of restart events issued when rolling Kafka Pods Dec 15, 2024
@ppatierno
Copy link
Member

I have never used them directly but my guess is that Kube events are useful for some users (as you said we have them if not that many). Even the idea about integrating self-healing could rely on events in the future.
My take on this is going with option 1. I agree that our events could be "missed" in between pods lifecycle events. My question is about which custom resource you are referring to? Our pods are related to the StrimziPodSet resource so are you referring to it and then using "related" to specify the specific pod?

@scholzj
Copy link
Member Author

scholzj commented Dec 17, 2024

I was thinking the Kafka, KafkaConnect, KafkaMIrrorMaker2 etc. resources

@katheris
Copy link
Contributor

katheris commented Jan 2, 2025

I would also vote for option 1 of having them use the custom resource as the main object they reference. I think it is useful to be able to see the restart as an event rather than having to always look through the operator logs.

@im-konge
Copy link
Member

im-konge commented Jan 9, 2025

Triaged on 9.1.2025: We should go with the option 1 and keep this issue open.

@katheris
Copy link
Contributor

I've started taking a look at this and as a result have changed my mind on the approach. The suggestion was to change the regarding field to the operator that issued the event. If we do this we need to consider how we represent the pod that was restarted - do we include it in the message or in one of the other fields?

The fields for the Event API are in the Kubernetes docs

Today we use these fields:

  • action - hard coded to StrimziInitiatedPodRestart
  • reportingController - hard coded to strimzi.io/cluster-operator
  • reportingInstance - the pod name of the cluster operator pod
  • regarding - pod reference
  • reason - reason for restarting
  • type - Normal or Warning
  • eventTime - time event occured
  • note - note related to the reason

Given the fields above it's possible to filter events. For example I was able to use the command:

kubectl get events -n kafka --field-selector reportingComponent=strimzi.io/cluster-operator

I am unsure why the field comes back as reportingComponent rather than reportingController, but regardless it is possible for users to view only events from Strimzi, even via the command line. We also state this in the docs.

Although I do think that changing the regarding field might help those users who are unfamiliar with events to find the Strimzi events, given that Kubernetes has the range of fields it does I don't think it makes sense to move away from what we have today. Instead we should encourage users to properly use the field selector to filter for the events that are more relevant to them. The events have to be collected anyway since they are only accessible for an hour live so it seems like only a small step to then use field selectors on the resulting events.

@ppatierno
Copy link
Member

If we do this we need to consider how we represent the pod that was restarted

I guess that it was proposed to use the related field for the pod and the regarding for the Kafka custom resource.
Even if reading the Kubernetes API doc, it makes an example like this regarding ReplicaSet Controller:

regarding contains the object this Event is about. In most cases it's an Object reporting controller implements, e.g. ReplicaSetController implements ReplicaSets and this event is emitted because it acts on some changes in a ReplicaSet object.

And putting Kafka in the regarding field would mean a change in the Kafka custom resource which is not actually true.

Maybe we could stick with regarding to point at the Pod and using related for the Kafka (or any other) custom resource?

@scholzj
Copy link
Member Author

scholzj commented Feb 10, 2025

I've started taking a look at this and as a result have changed my mind on the approach. The suggestion was to change the regarding field to the operator that issued the event. If we do this we need to consider how we represent the pod that was restarted - do we include it in the message or in one of the other fields?

The fields for the Event API are in the Kubernetes docs

Today we use these fields:

  • action - hard coded to StrimziInitiatedPodRestart
  • reportingController - hard coded to strimzi.io/cluster-operator
  • reportingInstance - the pod name of the cluster operator pod
  • regarding - pod reference
  • reason - reason for restarting
  • type - Normal or Warning
  • eventTime - time event occured
  • note - note related to the reason

Given the fields above it's possible to filter events. For example I was able to use the command:

kubectl get events -n kafka --field-selector reportingComponent=strimzi.io/cluster-operator

I am unsure why the field comes back as reportingComponent rather than reportingController, but regardless it is possible for users to view only events from Strimzi, even via the command line. We also state this in the docs.

Although I do think that changing the regarding field might help those users who are unfamiliar with events to find the Strimzi events, given that Kubernetes has the range of fields it does I don't think it makes sense to move away from what we have today. Instead we should encourage users to properly use the field selector to filter for the events that are more relevant to them. The events have to be collected anyway since they are only accessible for an hour live so it seems like only a small step to then use field selectors on the resulting events.

The Event API has for example also related objects where you can stick the Pods. You can also have the identification in the message?

Indeed, not all of the fields can be filterable. But for me this is mainly a question of where do I think the people are looking -> Are people really interested in the Pod level? Are they filtering for event of typically at least 6 different pods (3 controllers + 3 brokers)? Or are they more interested in what is happening with the operand itself which might include events for all Pods and more?

Note:

@fvaleri
Copy link
Contributor

fvaleri commented Feb 11, 2025

Just to mention, with the UTO proposal we decided to get rid of all KafkaTopic events and no one complained. The reasoning was that most people use the resource status, operator logs and metrics.

@scholzj
Copy link
Member Author

scholzj commented Feb 11, 2025

@fvaleri Yeah, I know. But for the broker pods, there are some people using them. Likely not many, but we get questions, complains, and so on about them. So it is not completely unused.

That said, the main complaint is that most of the restart events are Pod revision changed which is not a good user-facing restart reason and of course, the possible change I suggested here does absolutely nothing about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants