Skip to content

Enhance compaction Job metrics to expose the reason for Job Failures #1037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anveshreddy18 opened this issue Mar 17, 2025 · 0 comments · Fixed by #1039
Closed

Enhance compaction Job metrics to expose the reason for Job Failures #1037

anveshreddy18 opened this issue Mar 17, 2025 · 0 comments · Fixed by #1039
Assignees
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension status/closed Issue is closed (either delivered or triaged)
Milestone

Comments

@anveshreddy18
Copy link
Contributor

How to categorize this issue?

/area monitoring
area control-plane
/kind enhancement

What would you like to be added:

At present the metrics exposed by druid regarding compaction jobs are very minimal, i.e it categorises the jobs into just succeeded or failed, but in reality the failed category is vast, it can fail for multitude of reasons with the prominent reasons being

  1. Process Failure
  2. Deadline Exceeded
  3. Preemption
  4. Eviction

There can be multiple sub reasons for Eviction like DeletionByTaintManager, EvictionByEvictionAPI, TerminationByKubelet, etc

The reason for why knowing about the failure reason matters is because currently we expose these compaction metrics from druid which gardener project scrapes through prometheus for monitoring and alerting, and we've been seeing a lot of jobs tagged under failed category but upon monitoring for some time we've seen that these are not because of the actual process failures, but due to external disruptions and job exceeding the deadline most often. So it's important that we expose the actual reason for the failure instead of just saying it's failed. This way, it becomes extremely clear as to what is causing the failure and the actions that can be taken to minimise such failures.

This is important not just for gardener project, but for external independent consumers of etcd-druid as well as they can scrape the prometheus metrics exposed out of etcd-druid for the compactions job if they've enabled them to better monitor and understand the compaction job lifecycles.

Why is this needed:

To improve monitoring of the compaction jobs to better understand the reasons for job failures so that necessary targeted action can be taken to minimise such failures.

@anveshreddy18 anveshreddy18 self-assigned this Mar 17, 2025
@gardener-robot gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension labels Mar 17, 2025
@anveshreddy18 anveshreddy18 added this to the v0.29.0 milestone Mar 17, 2025
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants