Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[service] Internal counter metrics now have a _total suffix in Prometheus #12458

Closed
jade-guiton-dd opened this issue Feb 21, 2025 · 8 comments · Fixed by #12500
Closed

[service] Internal counter metrics now have a _total suffix in Prometheus #12458

jade-guiton-dd opened this issue Feb 21, 2025 · 8 comments · Fixed by #12500
Labels
area:service bug Something isn't working collector-telemetry healthchecker and other telemetry collection issues

Comments

@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Feb 21, 2025

Component(s)

service

What happened?

Describe the bug

Between v0.118.0 and v0.120.0, the name of internal counter metrics exposed through the Prometheus endpoint gained a _total suffix, breaking the stability of internal metrics.

Steps to reproduce

Exercise the OTLP receiver (sending a span for example) with an explicitly configured Prometheus reader.

What did you expect to see?

Old output on the Prometheus endpoint:

# HELP otelcol_receiver_accepted_spans Number of spans successfully pushed into the pipeline. [alpha]
# TYPE otelcol_receiver_accepted_spans counter
otelcol_receiver_accepted_spans{receiver="otlp/1",service_instance_id="dfcfad0f-6d19-4340-9d76-1e0df4883526",service_name="otelcol",service_version="0.118.0",transport="grpc"} 123

What did you see instead?

New output on the Prometheus endpoint:

# HELP otelcol_receiver_accepted_spans_total Number of spans successfully pushed into the pipeline. [alpha]
# TYPE otelcol_receiver_accepted_spans_total counter
otelcol_receiver_accepted_spans_total{otel_scope_name="go.opentelemetry.io/collector/receiver/receiverhelper",otel_scope_version="",receiver="otlp/1",transport="grpc"} 123

Collector version

v0.120.0

OpenTelemetry Collector configuration

service:
  telemetry:
    metrics:
      readers:
        - pull:
            exporter:
              prometheus:
                host: '0.0.0.0'
                port: 8888

Additional context

This was reported by Jon Bates on the CNCF Slack.

I believe this was introduced by #11611. The previous code for instantiating the Prometheus exporter explicitly added the WithoutCounterSuffixes option, but the new code now relies on the go-contrib config module to instantiate it based on the user's config.

While the default telemetry config has a Prometheus reader with the corresponding without_type_suffix: true option set, I believe that setting your own list of readers without specifying without_type_suffix will make it default to false, adding the _total suffix to counter metrics.

A workaround is to add without_type_suffix: true to the prometheus: section.

@jade-guiton-dd jade-guiton-dd added area:service bug Something isn't working collector-telemetry healthchecker and other telemetry collection issues labels Feb 21, 2025
@jade-guiton-dd jade-guiton-dd changed the title [service] Internal counter metrics now have a _total suffix [service] Internal counter metrics now have a _total suffix in Prometheus Feb 21, 2025
@mx-psi
Copy link
Member

mx-psi commented Feb 21, 2025

cc @codeboten

@spadger
Copy link

spadger commented Feb 21, 2025

I had a play and it was introduced in v1.119.0, but it's still in v1.120.0

@bogdandrutu
Copy link
Member

Between v0.118.0 and v0.120.0, the name of internal counter metrics exposed through the Prometheus endpoint gained a _total suffix, breaking the stability of internal metrics.

Not sure about that, since for Prometheus _total is expected for counters and both representations are valid. Though there are other problems like if you have a unit it is added to the name. That can also be disabled with a config.

@jade-guiton-dd
Copy link
Contributor Author

jade-guiton-dd commented Feb 24, 2025

@bogdandrutu I see. Does that mean you would argue that dashboards breaking because of the change is the backend's responsibility for not properly parsing out this prefix?

@jade-guiton-dd
Copy link
Contributor Author

The "internal telemetry" page in the docs states that "A stable metric without a deprecated signature will not be deleted or renamed.". For this to not be breaking stability, we would need to argue that the "_total" suffix in the Prometheus exporter output is not part of the name of the metric, but a Prometheus-only indicator of the type of the metric.

@codeboten
Copy link
Contributor

The suffix being appended is a result of end user configuration being passed in directly to the config package.

A user that is using the default Prometheus configuration would not see this as a breaking change, as the without_type_suffix value is set to true. I agree that if I'm a user that's been using the readers configuration and the metrics started reporting with _total, it would be confusing. This could be documented as a required configuration field to avoid causing end users to have to change their dashboards?

We could possibly set without_type_suffix to true if end users are not configuring this setting, however, that may end up being more confusing for end users.

@spadger
Copy link

spadger commented Feb 24, 2025

As the person who found this issue (for what it's worth) having to add a single line config update to remove the suffix felt like a perfectly reasonable fix. This is a technical product for technical people, and the changelog gives clear instructions of breaking changes.

It's true that the prometheus spec says counters SHOULD end with _count, but the openmetrics spec mandates this. Whatever you decide, it'd be handy to retroactively add the current change of behaviour to the v119 and v120 change notes (if it's possible)

@bogdandrutu
Copy link
Member

@codeboten let's have at least document in all changelogs (119/120) what user should do / expect. I am ok to move forward with this "breaking" change.

codeboten added a commit to codeboten/opentelemetry-collector that referenced this issue Feb 26, 2025
Fixes open-telemetry#12458

Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this issue Feb 28, 2025
)

Fixes #12458

---------

Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:service bug Something isn't working collector-telemetry healthchecker and other telemetry collection issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants