@@ -5,33 +5,38 @@ sampling rates and probability calculation algorithms.
5
5
6
6
## TL;DR
7
7
8
- ** Score** is a floating point number associated with
9
- the trace. It's calculated when trace starts and flows in the ` tracestate ` ,
10
- it's used by samplers to make consistent sampling decisions.
8
+ ** Score** is a floating point number associated with the trace.
9
+ It's calculated when trace starts and flows in the ` tracestate ` .
10
+
11
+ * Score* is independent of sampling * probability* (aka * rate* ) which represents
12
+ sampler's configuration, not specific to trace.
13
+
14
+ Sampler can compare the * score* with the configured * probability* to make
15
+ sampling decisions.
11
16
12
17
Service that starts the trace calculates the score and adds it to the
13
18
` tracestate ` so downstream services can re-use it to make their sampling
14
19
decisions * instead of* re-calculating score as a function of trace-id
15
- (or trace-flags).
16
-
17
- * Score* is not related to sampling * rate* (aka * probability* which represents
18
- sampler's configuration not specific to trace).
20
+ (or trace-flags). This allows to configure sampling algorithm on the first
21
+ service ans avoid coordination of algorithms when multiple tracing tools are
22
+ involved.
19
23
20
24
## Motivation
21
25
22
26
The goal is to enable a mechanism for consistent (best effort) sampling
23
27
between services with different sampling rates and different probability
24
28
calculation algorithms (for interoperability with existing tracing tools).
25
29
26
- Consistent sampling decision made in each app of a distributed trace is
27
- important for better user experience of trace analysis. Consistency is achieved
28
- by following means:
30
+ Today consistency across multiple services is achieved by following means:
29
31
30
- 1 . Same hashing algorithms used across all apps in a trace.
31
- Coordination of sampling algorithms across multiple apps not always possible:
32
- for example existing components in a system use vendor-specific
33
- tracing tool (pre-OpenTelemetry and update is hard to justify) while there
34
- is a desire to use OpenTelemetry for new components.
32
+ 1 . Same hashing algorithms on trace-id applied on each span.
33
+ Problems:
34
+ - ** same sampling algorithm must be used across multiple apps** : it is
35
+ not always possible e.g. when existing components in a system use
36
+ vendor-specific tracing tool (pre-OpenTelemetry and major upgrade is hard to
37
+ justify) while new components are instrumented with OpenTelemetry.
38
+ - ** trace-id uniform distribution is not guaranteed** therefore sampling
39
+ decisions could be biased
35
40
36
41
2 . Sampling flag propagated from the head component/app is used by downstream
37
42
apps to sample in a given trace.
@@ -40,16 +45,22 @@ by following means:
40
45
41
46
## Explanation
42
47
43
- Sampling propabaility is generated by the first service to make sampling
44
- decision. It's a random float number in [ 0, 1] range.
48
+ Sampling score is generated by the first service to make sampling
49
+ decision. It's a random float (6-9 digits precision) number in [ 0, 1] range.
45
50
Score is stamped on the span and also propagated further within ` tracestate ` .
46
51
47
52
Next service reads score from ` tracestate ` (instead of calculating it from
48
53
trace-id) and compares it with its sampling rate to make sampling decision.
49
54
50
- Score is also exposed through span attributes. Vendors can leverage it
55
+ Score is exposed through span attributes. Vendors can leverage it
51
56
to sort traces based on their completeness: the lower the value of score is,
52
- the higher the chance it was sampled it by each component.
57
+ the higher the chance it was sampled in by each component.
58
+
59
+ Vendors can enable interoperability (in terms of sampling) between legacy
60
+ tools and OpenTelemetry: legacy libraries can be updated in non-breaking way to
61
+ support external score sampling. Updating current vendor-specific library
62
+ version on the existing service in a backward-compatible way is much easier
63
+ than upgrading to OpenTelemetry.
53
64
54
65
### Example
55
66
@@ -91,7 +102,8 @@ Vendors can pick the most complete traces sorting them by score.
91
102
- Service that starts a trace makes sampling decision. It's configured to use
92
103
` ExternalScoreSampler ` (name TBD) is configured by user. Within ` ShouldSample `
93
104
callback sampler
94
- - generates random float score (6-9 digits) in [ 0, 1] interval
105
+ - generates score [ 0, 1] interval using ` SamplingScoreGenerator ` that can run
106
+ random or deterministic ` hash(trace-id) ` algorithm.
95
107
- makes sampling decision by comparing generated score to configured rate
96
108
- if decision is ` RECORD ` (or ` RECORD_AND_SAMPLED ` ), sampler adds
97
109
` sampling.score ` attribute to attributes collection of to-be-created span
@@ -105,17 +117,25 @@ callback sampler
105
117
sampling rate
106
118
- if span will be recorded: sampler adds ` sampling.score ` attribute to
107
119
attributes collection of to-be-created span
120
+ - If downstream service does not find a score in the tracestate, it falls back
121
+ to the configured score generation algorithm and updates tracestate and
122
+ attributes
108
123
109
124
Here is a [ proof of concept] ( https://github.com/lmolkova/opentelemetry-dotnet/pull/1 )
110
125
in .NET.
111
126
112
127
### Specification Delta
113
128
114
- 1 . Add ` SamplingResult.Tracestate ` field: sampler should be able to assign a
115
- new tracestate for to-be-created span
116
- 2 . Add convention for ` sampling.score ` attribute on span (TBC ). Check out
129
+ 1 . Add ` SamplingResult.Tracestate ` field: sampler should be able to [ assign a
130
+ new tracestate for to-be-created span] ( https://github.com/open-telemetry/opentelemetry-specification/issues/856 )
131
+ 2 . Add convention for ` sampling.score ` attribute on span (TBD ). Check out
117
132
[ open questions] ( open-questions ) regarding attribute vs special field.
118
- 3 . Add ` ExternalScoreSampler ` implementation of ` Sampler `
133
+ 3 . Add notion of ` SamplingScoreGenerator ` that has ` TraceIdRatioGenerator ` ,
134
+ ` RandomGenerator ` , etc implementations.
135
+ - Change ` TraceIdRatioBased ` sampler to use corresponding generator and serve
136
+ as generic probability sampler with configurable score generation approach.
137
+ 4 . Add ` ExternalScoreSampler ` implementation of ` Sampler ` . It's created with
138
+ probability value and implementation of ` SamplingScoreGenerator ` .
119
139
120
140
### Trade-offs and mitigations
121
141
@@ -147,37 +167,15 @@ as an implementation-specific hint for sampler to prioritize recording a span.
147
167
[ OpenTelemetry collector] ( https://github.com/open-telemetry/opentelemetry-collector/blob/60b03d0d2d503351501291b30836d2126487a741/processor/samplingprocessor/probabilisticsamplerprocessor/testdata/config.yaml#L10 )
148
168
uses ` sampling.priority ` to hint collector's sampler decision
149
169
150
- To avoid conflicts with existing implementations we fo not reuse priority term.
170
+ To avoid conflicts with existing implementations we do not reuse priority term.
151
171
152
172
## Open questions
153
173
154
- ### Score calculation: can we use ProbabilitySampler?
155
-
156
- This spec suggests to generate score randomly to achieve uniform
157
- distribution.
158
-
159
- Assuming trace-ids are uniformly distributed, ` ProbabilitySampler ` can generate
160
- score, so the flow could look like this:
161
-
162
- ` ExternalScoreSampler.ShouldSample ` :
163
-
164
- - checks if ` sampling.score ` is available in the tracestate
165
- - if it's not there, invokes ` ProbabilitySampler ` , which calculates score
166
- and populates it on the attributes
167
- - updates tracestate
168
-
169
- #### Pros
170
-
171
- Fallback to ` ProbabilitySampler ` improves the case when ` tracestate ` is trimmed
172
- so there is a chance sampling could be consistent if same probability
173
- calculation algorithm was used.
174
-
175
- #### Cons
174
+ ### Should we separate sampling from score generation?
176
175
177
- - There is no requirement for trace-ids to be uniformly distributed
178
- - No clear boundary between ` ProbabilitySampler ` and ` ExternalScoreSampler ` .
179
- ` ProbabilitySampler ` needs to set score in attribute even if there is no
180
- ` ExternalScoreSampler ` .
176
+ Rate-based sampling in this spec is separated from score generation. Sampler can
177
+ be configured to use any algorithm on sampling parameters. Different samplers
178
+ may reuse generation algorithms.
181
179
182
180
### Attribute vs field on the span to-be-created
183
181
@@ -189,8 +187,8 @@ Creating a new float field on `SamplingDecision` could be an alternative.
189
187
It'd also require adding similar property on Span/SpanData.
190
188
191
189
There are other scenarios when sampling information is useful for
192
- exporter: e.g. sampling rate (or inverse value: count of spans
193
- this span represent). Exporters can use it to estimate metrics.
190
+ exporter: e.g. sampling rate (or it's inverse value: count of spans
191
+ this span represents), exporters can use it to estimate metrics.
194
192
195
193
Populating all sampling information on all spans may be inefficient in terms of
196
194
event payload size and storage while being useful for a subset of vendors.
0 commit comments