-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zipkin Receiver: default timestamp #1068
Conversation
@@ -123,7 +123,7 @@ func (ze *zipkinExporter) PushTraceData(_ context.Context, td consumerdata.Trace | |||
|
|||
body, err := ze.serializer.Serialize(tbatch) | |||
if err != nil { | |||
return len(td.Spans), err | |||
return len(td.Spans), consumererror.Permanent(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the serialization failure to a permanent error because it had caused an infinite loop in the retry processor otherwise. Is there any reason we'd want to retry on a serialization error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of any reason to retry them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value of spans with no timestamps? Is there a particular reason to have them in zipkin?
Btw we will need to remove the OC
part from the zipkin -> OC -> internal
translation later on. So we need to think how to represent such spans in internal data structure if cannot discard them.
@@ -123,7 +123,7 @@ func (ze *zipkinExporter) PushTraceData(_ context.Context, td consumerdata.Trace | |||
|
|||
body, err := ze.serializer.Serialize(tbatch) | |||
if err != nil { | |||
return len(td.Spans), err | |||
return len(td.Spans), consumererror.Permanent(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of any reason to retry them
now := internal.TimeToTimestamp(time.Now()) | ||
span.StartTime = now | ||
span.EndTime = now | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this but I don't know what the right thing to do is. Addin/modifying data like this just to make serialize happy doesn't sound that great to me. This mean backends will show misleading information to users.
Alternatively, we could just drop such spans as a validation step in the zipkin receiver and log messages but we can't do that as zipkin spec allows for the possibility of not having timestamps. Is there a way we can serialize it differently without adding fake timestamps?
If there is no other viable solution, we should at least add an attribute to the span that represents that the timestamps are fake and shouldn't be trusted. Also, can we may be use more obvious fake value for timestamp like unix epoch or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zipkin spec on this:
Epoch microseconds of the start of this span, possibly absent if
incomplete.
For example, 1502787600000000 corresponds to 2017-08-15 09:00 UTC
This value should be set directly by instrumentation, using the most
precise value possible. For example, gettimeofday or multiplying epoch
millis by 1000.
There are three known edge-cases where this could be reported absent.
A span was allocated but never started (ex not yet received a timestamp)
The span’s start event was lost
Data about a completed span (ex tags) were sent after the fact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A span was allocated but never started (ex not yet received a timestamp)
If this was the only edge-case, we could just drop the span as it didn't actually represent a real operation but we have two more scenarios that can result in spans without timestamps so we can't just drop.
It looks like setting a static timestamp that is obviously fake along with a well-known attribute that says the timestamps were injected by Otel collector as a workaround is probably our best bet. Does Jaeger proto allow having empty timestamps? If so, an OTLP>Jaeger/Zipkin translator should probably set the timestamp to nil again when it encounters this tag on a span. One concern with setting a very old timestamp is that it might affect some backends in an unexpected way. For example, some backend might drop the span very soon if it stores only recent data and by span operation timestamps instead of ingestion timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with Jaeger, will need someone else to weigh in on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like setting a static timestamp that is obviously fake along with a well-known attribute that says the timestamps were injected by Otel collector as a workaround is probably our best bet. Does Jaeger proto allow having empty timestamps? If so, an OTLP>Jaeger/Zipkin translator should probably set the timestamp to nil again when it encounters this tag on a span. One concern with setting a very old timestamp is that it might affect some backends in an unexpected way. For example, some backend might drop the span very soon if it stores only recent data and by span operation timestamps instead of ingestion timestamp.
I agree with this approach. We can set a fake current timestamp along with an attribute identifying this workaround and use it in Jaeger/zipkin exporters to reconstruct null timestamps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a zipkin convention. We should probably have an otel convention about such tags. May be otel.zipkin.missingTimestamp
or otel.zipkin.missingFields
with value being comma separated list of field names? May be we should have otel collector specific conventions for behavior like this. Perhaps prefix with otelcol
or something?
We can set a fake current timestamp
Why current timestamp? Can we do zero time? Like 0 seconds since unix epoch? I'm afraid if we use current time and the span get exported using something other than zipkin, it might be very misleading to users when seeing the span in a tracing backend. "zero time" will at least raise some eye brows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@owais , 0 timestamps look cleaner, but if I understood your concern right it doesn't look like the best way:
One concern with setting a very old timestamp is that it might affect some backends in an unexpected way. For example, some backend might drop the span very soon if it stores only recent data and by span operation timestamps instead of ingestion timestamp.
IMO still dropping spans with 0 timestamp looks like the best way to handle it in otel-collector. @chris-smith-zocdoc could you please let us know if the spans with no timestamps are actually visible in the backend that you use or they needed for aggregation purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO still dropping spans with 0 timestamp looks like the best way to handle it in otel-collector
I disagree. Throwing away data seems suboptimal.
Through coincidence they actually end up as a zero timestamp right now, but after the translation from OC -> internal -> OC for the zipkin export, the timestamp is becoming invalid and causing the serialization issue.
could you please let us know if the spans with no timestamps are actually visible in the backend that you use or they needed for aggregation purposes?
Yes they're visible. We have two backends, one is zipkin the other is honeycomb.
The data in these spans are just events and annotations, sent after the original span. This appears to be a common behavior in the zipkin libraries we're using for javascript and scala.
[
{
"traceId": "404059d49fac24fc",
"id": "404059d49fac24ff",
"annotations": [
{
"timestamp": 1589292979495000,
"value": "zipkin-js.flush"
}
],
"tags": {
"response_size_bytes": "9622"
}
}
]
In zipkin the data is merged into the original span.
In honeycomb these end up as a second event in their system. I've considered building a buffer/merge processor (similar to tail sampling) to do the merge in OTel before sending it to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that throwing away the spans does not sound correct for people who might be using zipkin end to end (or even some parts of it) since it is a valid zipkin use case. I still have concerns about zero'ed timestamps causing backends to drop spans too soon. May be backends can use the tag we inject to detect this edge-case but not sure we can expect all backends to do this especially since it is not part of the otel spec. I guess adding current timestamp for both start and end is the safest choice here even though it's not very optimal for backends other than zipkin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is value in keeping them, @owais covered the cases where this can occur. In my environment I believe I was running into case 3 "Data about a completed span (ex tags) were sent after the fact" |
Please fix linter issues |
e033b2a
to
0da8861
Compare
Codecov Report
@@ Coverage Diff @@
## master #1068 +/- ##
==========================================
- Coverage 86.34% 86.33% -0.01%
==========================================
Files 198 198
Lines 14159 14187 +28
==========================================
+ Hits 12226 12249 +23
- Misses 1477 1481 +4
- Partials 456 457 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes open-telemetry#954 open-telemetry#954 is being caused from the conversion of OC -> Internal -> OC span types. OC technically allows nil times because its a pointer, but [docs state](https://github.com/census-instrumentation/opencensus-proto/blob/master/src/opencensus/proto/trace/v1/trace.proto#L125-L135) ```go // This field is semantically required. When not set on receive - // receiver should set it to the value of end_time field if it was // set. Or to the current time if neither was set. It is important to // keep end_time > start_time for consistency. // // This field is required. StartTime *timestamp.Timestamp ``` So this pr changes the zipkin receiver to set the start/end timestamps if they're not set on receive
Fixes #954
#954 is being caused from the conversion of OC -> Internal -> OC span types.
OC technically allows nil times because its a pointer, but docs state
So this pr changes the zipkin receiver to set the start/end timestamps if they're not set on receive