Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_kubernetes_events: Inefficient Defaults Lead To Kube API Spamming/Resource Drain & extra processing required in fluent-bit #8315

Closed
ryanohnemus opened this issue Dec 21, 2023 · 2 comments · Fixed by #8351

Comments

@ryanohnemus
Copy link
Contributor

Bug Report

Describe the bug
The current in_kubernetes_events plugin is polling the kubeapi every 500ms by default (unless you specifically update the interval_(sec|nsec) config options. It is retrieving the same data over and over and using improper resourceVersion semantics (see bug #8314) to detect resource (event) changes.

resourceVersion / resourceVersionMatch are unset during the call to /api/v1/events which requires a quorum of kube api servers before it's response.

https://kubernetes.io/docs/reference/using-api/api-concepts/#semantics-for-get-and-list
Unless you have strong consistency requirements, using resourceVersionMatch=NotOlderThan and a known resourceVersion is preferable since it can achieve better performance and scalability of your cluster than leaving resourceVersion and resourceVersionMatch unset, which requires quorum read to be served.

This combination in large clusters, especially when combined with another default (requesting events for all namespaces when not limiting to a single namespace with kube_namespace), can lead to several Mbs of data being constantly polled from the kube api servers every 500ms.

To Reproduce

  • Steps to reproduce the problem:

Use the following input with defaults:

[INPUT]
    name          kubernetes_events
    tag           k8s_events

set debug logging (FLB_LOG_LEVEL=debug) and you can see we are consistently polling the same data and skipping over resourceVersion information we already have. If you attach this to a debugger (or just rebuild with extra flb_plg_debug lines within this do:

)

Expected behavior
Running a list against the k8s cluster should only be done at startup or after our last resourceVersion is considered too far out of date by k8s (it will return a 410 when requesting too old of a version). Then we should follow efficient-detection-of-changes which uses the resourceVersion of the EventList (not the individual events) to create a chunked stream of updates.

Your Environment

  • Version used: 2.2.0
Copy link
Contributor

github-actions bot commented Apr 7, 2024

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Apr 7, 2024
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant