Enhance compaction Job metrics to expose the reason for Job Failures #1037
Labels
area/monitoring
Monitoring (including availability monitoring and alerting) related
kind/enhancement
Enhancement, improvement, extension
status/closed
Issue is closed (either delivered or triaged)
Milestone
How to categorize this issue?
/area monitoring
area control-plane
/kind enhancement
What would you like to be added:
At present the metrics exposed by druid regarding compaction jobs are very minimal, i.e it categorises the jobs into just
succeeded
orfailed
, but in reality thefailed
category is vast, it can fail for multitude of reasons with the prominent reasons beingThere can be multiple sub reasons for
Eviction
likeDeletionByTaintManager
,EvictionByEvictionAPI
,TerminationByKubelet
, etcThe reason for why knowing about the failure reason matters is because currently we expose these compaction metrics from druid which gardener project scrapes through prometheus for monitoring and alerting, and we've been seeing a lot of jobs tagged under
failed
category but upon monitoring for some time we've seen that these are not because of the actual process failures, but due to external disruptions and job exceeding the deadline most often. So it's important that we expose the actual reason for the failure instead of just saying it's failed. This way, it becomes extremely clear as to what is causing the failure and the actions that can be taken to minimise such failures.This is important not just for gardener project, but for external independent consumers of
etcd-druid
as well as they can scrape the prometheus metrics exposed out ofetcd-druid
for the compactions job if they've enabled them to better monitor and understand the compaction job lifecycles.Why is this needed:
To improve monitoring of the compaction jobs to better understand the reasons for job failures so that necessary targeted action can be taken to minimise such failures.
The text was updated successfully, but these errors were encountered: