-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add error.hint semantic convention #822
Conversation
How do you see the relation of this PR to #784's |
Btw, @Oberon00 are you aware that there is a meeting of Errors WG every week? As it seems that you have great interest in this topic, care to join us there for real-time discussion? |
I copied the meeting into my calendar, thanks for the reminder 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes because I think this should be separate from exceptions.
Based on #706 (comment), I'm marking this after-ga for now. |
If we don't have this (or something similar) for GA, we cannot remove span status. At the very least, Jaeger exporters depend on this information as I linked in the description (repeated below):
|
@mwear Is this "error" tag in Jaeger somehow specified? Or is it just a de facto convention? |
It's a semantic convention from OpenTracing that Jaeger has built functionality around: https://github.com/opentracing/specification/blob/master/semantic_conventions.md#span-tags-table, |
From the SIG mtg: We have Jaeger in our GA requirements and Jaeger more or less requires the ability to mark errored spans, so marking this as required for GA. |
@Oberon00 we have required to support exporting Jaeger which does not require to support a way to record error in the API. You can argue that we have OpenTracing -> OpenTelemetry bridge to support which require a solution for "error = true". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original complaint is resolved (#822 (comment)). Thank you!
But I'm still marking this as "Request changes" until we have resolved how to deal with ever-growing parameter count for RecordException, see my review comment in api.md #822 (comment).
## Error Attributes | ||
|
||
A span or event can be annotated with an `error.hint` attribute to indicate that | ||
an error condition was observed. If an [exception event](exceptions) is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split the exception special case in a separate paragraph and ensure that it is clear that this attribute can be used without exceptions. With the way it's currently worded, it seems like this can only be used with exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On similar lines, do we want to define a recommended mapping of http status code to error hint? Is that a use case for this attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this would be a good idea. If you do that, you don't know if the error hint comes from an "intelligent" guess or is just a redundantly mapped value. Or maybe we should make error.hint an enum / add an additional attribute to indicate "this is based on another (main status) attribute" vs "this is a more intelligent guess" (e.g. a 404 is unexpected here).
I have no strong opinion on this yet though.
@carlosalberto @bogdandrutu If you think this is after-GA, please relabel. I only relabelled this because I thought this was our decision in the SIG meeting. |
I would prefer to resolve this before GA. Especially given that there is an intent to remove Span Status. We will loose normalized ability to indicate errors in the spans, and many products need some hint to show erroneous spans in the UI. |
Agreed. For any other semantic convention, I'd be fine to postpone it, but this is a special case (also, the Jaeger exporter needs a way to report errors). Edit: No need to re-label, as it actually stands as "required-for-ga" already ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwear Please add a changelog entry.
Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>
@open-telemetry/specs-approvers @open-telemetry/technical-committee Accepting and merging this PR is one of the ways to speed up open-telemetry/oteps#134. Is it possible to expedite this? |
as for the `AddEvent` method). If attributes with the same name would be | ||
generated by the method already, the additional attributes take precedence. | ||
|
||
- A boolean error hint, as defined by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a question I'd like to have an answer to before merging this: Why is this additional parameter in the RecordException spec instead of the AddEvent spec. Do we expect more intelligent guesses (beyond exception.escaped / #784) to be more likely for exceptions than any other event? Personally, I don't see a reason for that to be true, but there may be one.
I would thus propose to move this parameter to AddEvent. The "Note: RecordException
may be seen as a variant of AddEvent
with additional exception-specific parameters and all other parameters being optional" then implicitly already allows languages to add error.hint (as well as event name and timestamp) as additional optional parameters, though I would be fine with making these more explicit or even SHOULD requirements here (I'm also OK with a specific MUST requirement for error.hint for RecordException that references the actual spec wording in AddEvent.
The rationale being, that I'd like to see error.hint not unnecessarily coupled with exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would thus propose to move this parameter to AddEvent.
Why? I imagine two main cases for this semantic convention:
- There was some exception and you want to mark it as an actual irrecoverable error, thus you call
RecordException(error_hint=true)
- There was something wrong with some parameters, a validation failed, or some timeout happened, and you want to set the convention directly (no need for an event to happen), without an actual event, so you call
SetAttribute()
for it.
If users actually want to have a error_hint
parameter for AddEvent()
we will know for sure, and we can add it in the future. It's easier to add it than to remove it later on IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There was some exception and you want to mark it as an actual irrecoverable error, thus you call
RecordException(error_hint=true)
First, I don't think that it is usual to know whether an exception is irrecoverable or not, usually you only know if it escaped. Second, I would be OK with leaving the parameter a MUST for RecordException, I just don't want the primary definition to be associated with exceptions more than required. I fear that error.hint could become a more widely used but less consistent synonym for exception.escaped, and am happy about every bit of distance it has from exceptions.
- There was something wrong with some parameters, a validation failed, or some timeout happened, and you want to set the convention directly (no need for an event to happen), without an actual event, so you call
SetAttribute()
for it.
It's true that with the current semantic conventions, exceptions are special in that they are the only AFAIK semantic convention for events yet (besides gRPC stream messages). Yet no one is arguing to remove AddEvent. I just think that error.hint is more logical there but I agree that currently there is no use case for having it in AddEvent (other than manually building an exception event).
It's easier to add it than to remove it later on IMHO.
As far as that goes, since #874 it is possible to have any additional attributes in RecordException, so the errorHint
parameter could be replaced by calling RecordException with that additional attribute and then calling SetAttribute("error.hint", true)
(except for the "only set to false if not set yet" part, which would have to be substituted by "only call SetAttribute if true"). The main thing we accomplish by having this parameter, is guiding users to use the semantic attribute. And personally I am not sure if having an "intelligent guess" that goes beyond exception.escaped is common enough to have it baked into the API. To that end, I would be fine with removing the API changes without replacement (instead of moving/copying the parameter to AddEvent) too, maybe it would be even better after all to move the API changes to a separate PR.
@open-telemetry/specs-approvers Please review/approve this PR. As @iNikem said, this is blocking an OTEP, so your help is needed (and I don't want to play the dictator by approving/merging this based on my own judgment ;) ) |
Looks like the are still disagreements about how the API should look like. In order to make progress I suggest that we split this PR into 2 parts:
|
@Oberon00 can you please attend the Error WG meeting this Thursday 8am PT? We want to make the final call on this PR and on the error.hint (provided that is not resolved offline in this thread before that). |
I appreciate the willingness to split this up to facilitate moving this and related PRs forward, but am conflicted in that I think it's essential that we have an API to help manage the interplay between spans and events when setting the |
I don't think that's all that concerning. We also don't have an API for any other semantic convention. OpenTracing also did not have an API for setting the error attribute AFAIK. EDIT: You still can set error.hint using the additional attributes argument to RecordException plus SetAttribute on the span.
Why is that? |
I had two main objectives when opening this PR:
As a result of the discussions on the PR, we realized that if you also add the an To answer your questions @Oberon00: Why is the API important?
Why might we want to keep span status without the API component of this PR? Revisiting the two objectives for this PR mentioned above, one was having a span level indication of whether or not an error (likely) was observed, and the other was an API to encourage users to think about, and set that attribute when recording an exception. Without the API we are left with the span level indicator of (likely) error. A non-ok span status can be used as a proxy for that situation. One benefit is that it is more nuanced than a boolean, another benefit is that is has a first class API method on the span to help and encourage users to set it. One downside, is that the API method doesn't encourage a user to to consider setting it when recording an exception, as the current proposal does, but given the choice between an |
Regardless of whether it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that could convince me that the API is useful would be concrete examples of instrumentations that would set error.hint different from exception.escaped.
Easy: servlet auto-instrumentation which sets |
That makes sense, but would it call RecordException for that? |
Oi, sorry, of course not :) I forgot for a second that we are talking about Regarding |
It's exception.escaped, not error.escaped, but I think you described the meaning correctly. #761 has an example and a definition of "escaped". |
Yes, I think so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As pointed in another comment #784 (comment) we should make a clear recommendation when something is API concept vs Semantic convention. We should not allow some implementation to have an API concept that others have as semantic convention because there may be different requirements for backwards compatibility between the two for example.
@bogdandrutu Probably it is too late for you to see this now, but maybe you can jump into today's (8 AM Pacific) error WG meeting? https://www.google.com/calendar/event?eid=MXM5Y3QydWhhdGltZjY3cDNiOHNxN2szaWJfMjAyMDA5MDNUMTUwMDAwWiBnb29nbGUuY29tX2I3OWUzZTkwajdiYnNhMm4ycDVhbjVsZjYwQGc&ctz=America/Los_Angeles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the error WG meeting just now (see Notes on Google Docs), I think we should not merge this PR, not even just the error.hint convention without API change, currently. The reason is that it is unclear when you are supposed to set error.hint:
- Is it a "weak" hint that is to be used by backends as a fallback when it cannot determine whether this was an error
otherwise? I.e. is this a helper hint for backends which do not interpret http.status_code and I should set it on an HTTP span whenever I have a status >=400? - Is it a "strong" hint that should override the backends' decision on whether this is an error? For example I might have an "delete_user_account" or "force_abort_operation" operation that I want to treat as an error even when it has a status of 200, or I might have a low priority "is this ready" API where a 503 is not considered an error.
- There is the third case, that @mwear pointed out, that you are tracing some operation for which there are no other semantic conventions to tell whether it is an error or not (for example I want to trace some internal function in my code). In that case, it would not matter if you have a strong or a weak hint.
It seems that jaeger/OpenTelemetry error tag was used mostly as a "weak" hint and was expected to be set whenever there was an indication of an error. But maybe we want both things. I'd say it would be best if we did not put the burden of providing a weak hint on instrumentations. And a strong hint can usually not be supplied by generic instrumentations but only by application's manual instrumentation (if at all).
One more thought: We may even think about a convention error.hint.<priority-integer/enum>
. Or maybe we should record an indicator on whether this is a generic or application instrumentation in the InstrumentationLibraryInfo. Lot's of things to be considered...
I agree. This is not ready for merging, still too much disagreement. I believe @tedsuo is going to come up with an alternate proposal. |
Do I sense that this is gravitating towards my earlier proposal to use SeverityNumber? :-) |
We would have a severity linked with a success value. So you could have a success with a high "severity". More a decision weight / probability of being right. But this is just an idea I threw out, probably too complex to be useful anyway. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closing this in favor of: #965. |
Changes
This PR adds an
error.hint
semantic convention, and updates the span.RecordException API method to optionally set it.The purpose of this attribute is to convey an instrumentation author's best guess as to whether an exception represents an error condition. This will be a flag that tracing backends can use to identify candidate spans where an error may have been observed.
Tracing backends can use this information:
It serves a similar purpose to the
error: true
tag from OpenTracing. It's callederror.hint
rather thanerror
in order to impart that this is not a definitive decision and a tracing backend may override it.If we remove the
Span.status
API as proposed in in #706 a similar semantic convention will be needed as Jaeger exporters from a number of projects use it as a proxy for theerror = true
OpenTracing semantic convention.See:
Related issues #599 #706