Skip to content
This repository was archived by the owner on Dec 6, 2024. It is now read-only.

Commit 234ddf0

Browse files
author
Liudmila
committed
separate score generation from sampling and other review comments
1 parent ed978c7 commit 234ddf0

File tree

1 file changed

+51
-53
lines changed

1 file changed

+51
-53
lines changed

text/trace/0107-sampling-score.md

+51-53
Original file line numberDiff line numberDiff line change
@@ -5,33 +5,38 @@ sampling rates and probability calculation algorithms.
55

66
## TL;DR
77

8-
**Score** is a floating point number associated with
9-
the trace. It's calculated when trace starts and flows in the `tracestate`,
10-
it's used by samplers to make consistent sampling decisions.
8+
**Score** is a floating point number associated with the trace.
9+
It's calculated when trace starts and flows in the `tracestate`.
10+
11+
*Score* is independent of sampling *probability* (aka *rate*) which represents
12+
sampler's configuration, not specific to trace.
13+
14+
Sampler can compare the *score* with the configured *probability* to make
15+
sampling decisions.
1116

1217
Service that starts the trace calculates the score and adds it to the
1318
`tracestate` so downstream services can re-use it to make their sampling
1419
decisions *instead of* re-calculating score as a function of trace-id
15-
(or trace-flags).
16-
17-
*Score* is not related to sampling *rate* (aka *probability* which represents
18-
sampler's configuration not specific to trace).
20+
(or trace-flags). This allows to configure sampling algorithm on the first
21+
service ans avoid coordination of algorithms when multiple tracing tools are
22+
involved.
1923

2024
## Motivation
2125

2226
The goal is to enable a mechanism for consistent (best effort) sampling
2327
between services with different sampling rates and different probability
2428
calculation algorithms (for interoperability with existing tracing tools).
2529

26-
Consistent sampling decision made in each app of a distributed trace is
27-
important for better user experience of trace analysis. Consistency is achieved
28-
by following means:
30+
Today consistency across multiple services is achieved by following means:
2931

30-
1. Same hashing algorithms used across all apps in a trace.
31-
Coordination of sampling algorithms across multiple apps not always possible:
32-
for example existing components in a system use vendor-specific
33-
tracing tool (pre-OpenTelemetry and update is hard to justify) while there
34-
is a desire to use OpenTelemetry for new components.
32+
1. Same hashing algorithms on trace-id applied on each span.
33+
Problems:
34+
- **same sampling algorithm must be used across multiple apps**: it is
35+
not always possible e.g. when existing components in a system use
36+
vendor-specific tracing tool (pre-OpenTelemetry and major upgrade is hard to
37+
justify) while new components are instrumented with OpenTelemetry.
38+
- **trace-id uniform distribution is not guaranteed** therefore sampling
39+
decisions could be biased
3540

3641
2. Sampling flag propagated from the head component/app is used by downstream
3742
apps to sample in a given trace.
@@ -40,16 +45,22 @@ by following means:
4045

4146
## Explanation
4247

43-
Sampling propabaility is generated by the first service to make sampling
44-
decision. It's a random float number in [0, 1] range.
48+
Sampling score is generated by the first service to make sampling
49+
decision. It's a random float (6-9 digits precision) number in [0, 1] range.
4550
Score is stamped on the span and also propagated further within `tracestate`.
4651

4752
Next service reads score from `tracestate` (instead of calculating it from
4853
trace-id) and compares it with its sampling rate to make sampling decision.
4954

50-
Score is also exposed through span attributes. Vendors can leverage it
55+
Score is exposed through span attributes. Vendors can leverage it
5156
to sort traces based on their completeness: the lower the value of score is,
52-
the higher the chance it was sampled it by each component.
57+
the higher the chance it was sampled in by each component.
58+
59+
Vendors can enable interoperability (in terms of sampling) between legacy
60+
tools and OpenTelemetry: legacy libraries can be updated in non-breaking way to
61+
support external score sampling. Updating current vendor-specific library
62+
version on the existing service in a backward-compatible way is much easier
63+
than upgrading to OpenTelemetry.
5364

5465
### Example
5566

@@ -91,7 +102,8 @@ Vendors can pick the most complete traces sorting them by score.
91102
- Service that starts a trace makes sampling decision. It's configured to use
92103
`ExternalScoreSampler`(name TBD) is configured by user. Within `ShouldSample`
93104
callback sampler
94-
- generates random float score (6-9 digits) in [0, 1] interval
105+
- generates score [0, 1] interval using `SamplingScoreGenerator` that can run
106+
random or deterministic `hash(trace-id)` algorithm.
95107
- makes sampling decision by comparing generated score to configured rate
96108
- if decision is `RECORD` (or `RECORD_AND_SAMPLED`), sampler adds
97109
`sampling.score` attribute to attributes collection of to-be-created span
@@ -105,17 +117,25 @@ callback sampler
105117
sampling rate
106118
- if span will be recorded: sampler adds `sampling.score` attribute to
107119
attributes collection of to-be-created span
120+
- If downstream service does not find a score in the tracestate, it falls back
121+
to the configured score generation algorithm and updates tracestate and
122+
attributes
108123

109124
Here is a [proof of concept](https://github.com/lmolkova/opentelemetry-dotnet/pull/1)
110125
in .NET.
111126

112127
### Specification Delta
113128

114-
1. Add `SamplingResult.Tracestate` field: sampler should be able to assign a
115-
new tracestate for to-be-created span
116-
2. Add convention for `sampling.score` attribute on span (TBC). Check out
129+
1. Add `SamplingResult.Tracestate` field: sampler should be able to [assign a
130+
new tracestate for to-be-created span](https://github.com/open-telemetry/opentelemetry-specification/issues/856)
131+
2. Add convention for `sampling.score` attribute on span (TBD). Check out
117132
[open questions](open-questions) regarding attribute vs special field.
118-
3. Add `ExternalScoreSampler` implementation of `Sampler`
133+
3. Add notion of `SamplingScoreGenerator` that has `TraceIdRatioGenerator`,
134+
`RandomGenerator`, etc implementations.
135+
- Change `TraceIdRatioBased` sampler to use corresponding generator and serve
136+
as generic probability sampler with configurable score generation approach.
137+
4. Add `ExternalScoreSampler` implementation of `Sampler`. It's created with
138+
probability value and implementation of `SamplingScoreGenerator`.
119139

120140
### Trade-offs and mitigations
121141

@@ -147,37 +167,15 @@ as an implementation-specific hint for sampler to prioritize recording a span.
147167
[OpenTelemetry collector](https://github.com/open-telemetry/opentelemetry-collector/blob/60b03d0d2d503351501291b30836d2126487a741/processor/samplingprocessor/probabilisticsamplerprocessor/testdata/config.yaml#L10)
148168
uses `sampling.priority` to hint collector's sampler decision
149169

150-
To avoid conflicts with existing implementations we fo not reuse priority term.
170+
To avoid conflicts with existing implementations we do not reuse priority term.
151171

152172
## Open questions
153173

154-
### Score calculation: can we use ProbabilitySampler?
155-
156-
This spec suggests to generate score randomly to achieve uniform
157-
distribution.
158-
159-
Assuming trace-ids are uniformly distributed, `ProbabilitySampler` can generate
160-
score, so the flow could look like this:
161-
162-
`ExternalScoreSampler.ShouldSample`:
163-
164-
- checks if `sampling.score` is available in the tracestate
165-
- if it's not there, invokes `ProbabilitySampler`, which calculates score
166-
and populates it on the attributes
167-
- updates tracestate
168-
169-
#### Pros
170-
171-
Fallback to `ProbabilitySampler` improves the case when `tracestate` is trimmed
172-
so there is a chance sampling could be consistent if same probability
173-
calculation algorithm was used.
174-
175-
#### Cons
174+
### Should we separate sampling from score generation?
176175

177-
- There is no requirement for trace-ids to be uniformly distributed
178-
- No clear boundary between `ProbabilitySampler` and `ExternalScoreSampler`.
179-
`ProbabilitySampler` needs to set score in attribute even if there is no
180-
`ExternalScoreSampler`.
176+
Rate-based sampling in this spec is separated from score generation. Sampler can
177+
be configured to use any algorithm on sampling parameters. Different samplers
178+
may reuse generation algorithms.
181179

182180
### Attribute vs field on the span to-be-created
183181

@@ -189,8 +187,8 @@ Creating a new float field on `SamplingDecision` could be an alternative.
189187
It'd also require adding similar property on Span/SpanData.
190188

191189
There are other scenarios when sampling information is useful for
192-
exporter: e.g. sampling rate (or inverse value: count of spans
193-
this span represent). Exporters can use it to estimate metrics.
190+
exporter: e.g. sampling rate (or it's inverse value: count of spans
191+
this span represents), exporters can use it to estimate metrics.
194192

195193
Populating all sampling information on all spans may be inefficient in terms of
196194
event payload size and storage while being useful for a subset of vendors.

0 commit comments

Comments
 (0)