Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Stale ray_cluster_<state>_nodes metrics #50735

Open
jleben opened this issue Feb 19, 2025 · 1 comment
Open

[Core] Stale ray_cluster_<state>_nodes metrics #50735

jleben opened this issue Feb 19, 2025 · 1 comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core good-first-issue Great starter issue for someone just starting to contribute to Ray observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks

Comments

@jleben
Copy link

jleben commented Feb 19, 2025

What happened + What you expected to happen

I am observing stale values for ray_cluster_active_nodes and ray_cluster_pending_nodes metrics.

Example:

Dashboard shows this (accurate):

Image

However, http://10.212.14.97:8080/metrics shows this (inaccurate):

# HELP ray_cluster_active_nodes Active nodes on the cluster
# TYPE ray_cluster_active_nodes gauge
ray_cluster_active_nodes{SessionName="session_2025-02-12_19-51-59_037643_1",Version="2.41.0",node_type="headgroup"} 1.0
ray_cluster_active_nodes{SessionName="session_2025-02-12_19-51-59_037643_1",Version="2.41.0",node_type="worker2"} 1.0
ray_cluster_active_nodes{SessionName="session_2025-02-12_19-51-59_037643_1",Version="2.41.0",node_type="worker4"} 1.0
ray_cluster_active_nodes{SessionName="session_2025-02-12_19-51-59_037643_1",Version="2.41.0",node_type="worker8"} 6.0
ray_cluster_active_nodes{SessionName="session_2025-02-12_19-51-59_037643_1",Version="2.41.0",node_type="worker16"} 3.0
# HELP ray_cluster_pending_nodes Pending nodes on the cluster
# TYPE ray_cluster_pending_nodes gauge
ray_cluster_pending_nodes{SessionName="session_2025-02-12_19-51-59_037643_1",Version="2.41.0",node_type="worker2"} 1.0
ray_cluster_pending_nodes{SessionName="session_2025-02-12_19-51-59_037643_1",Version="2.41.0",node_type="worker4"} 1.0
ray_cluster_pending_nodes{SessionName="session_2025-02-12_19-51-59_037643_1",Version="2.41.0",node_type="worker8"} 3.0
ray_cluster_pending_nodes{SessionName="session_2025-02-12_19-51-59_037643_1",Version="2.41.0",node_type="worker16"} 1.0

Versions / Dependencies

Ray version 2.41.0
KubeRay version 1.2.2

Reproduction script

I've reproduced this multiple times in the context of KubeRay:

  • I restart the head pod and the metrics for worker nodes accurately reset to 0
  • I run some jobs, allowing the cluster to scale up and back down to zero worker nodes
  • the metrics for worker nodes are now inaccurate (non-zero)

Issue Severity

Low: It annoys or frustrates me.

@jleben jleben added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 19, 2025
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Feb 19, 2025
@jjyao jjyao added P1 Issue that should be fixed within a few weeks observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling good-first-issue Great starter issue for someone just starting to contribute to Ray and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 20, 2025
@jjyao
Copy link
Collaborator

jjyao commented Feb 20, 2025

for node_type, pending_node_count in pending_nodes_dict.items():
                records_reported.append(
                    Record(
                        gauge=METRICS_GAUGES["cluster_pending_nodes"],
                        value=pending_node_count,
                        tags={"node_type": node_type},
                    )
                )

I think the issue is that when pending_node_count for a node_type becomes 0, it's removed from the pending_nodes_dict so we don't have a chance to emit a gauge with value 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core good-first-issue Great starter issue for someone just starting to contribute to Ray observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants