Enhance compaction Job metrics to expose the reason for Job Failures #1037

anveshreddy18 · 2025-03-17T13:52:02Z

How to categorize this issue?

/area monitoring
area control-plane
/kind enhancement

What would you like to be added:

At present the metrics exposed by druid regarding compaction jobs are very minimal, i.e it categorises the jobs into just succeeded or failed, but in reality the failed category is vast, it can fail for multitude of reasons with the prominent reasons being

Process Failure
Deadline Exceeded
Preemption
Eviction

There can be multiple sub reasons for Eviction like DeletionByTaintManager, EvictionByEvictionAPI, TerminationByKubelet, etc

The reason for why knowing about the failure reason matters is because currently we expose these compaction metrics from druid which gardener project scrapes through prometheus for monitoring and alerting, and we've been seeing a lot of jobs tagged under failed category but upon monitoring for some time we've seen that these are not because of the actual process failures, but due to external disruptions and job exceeding the deadline most often. So it's important that we expose the actual reason for the failure instead of just saying it's failed. This way, it becomes extremely clear as to what is causing the failure and the actions that can be taken to minimise such failures.

This is important not just for gardener project, but for external independent consumers of etcd-druid as well as they can scrape the prometheus metrics exposed out of etcd-druid for the compactions job if they've enabled them to better monitor and understand the compaction job lifecycles.

Why is this needed:

To improve monitoring of the compaction jobs to better understand the reasons for job failures so that necessary targeted action can be taken to minimise such failures.

The text was updated successfully, but these errors were encountered:

anveshreddy18 self-assigned this Mar 17, 2025

gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension labels Mar 17, 2025

anveshreddy18 added this to the v0.29.0 milestone Mar 17, 2025

anveshreddy18 mentioned this issue Mar 17, 2025

Enhance compaction metrics by segregating them into various categories #1039

Merged

anveshreddy18 closed this as completed in #1039 Mar 27, 2025

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance compaction Job metrics to expose the reason for Job Failures #1037

Enhance compaction Job metrics to expose the reason for Job Failures #1037

anveshreddy18 commented Mar 17, 2025

Enhance compaction Job metrics to expose the reason for Job Failures #1037

Enhance compaction Job metrics to expose the reason for Job Failures #1037

Comments

anveshreddy18 commented Mar 17, 2025