Promscale as the Jaeger certified storage #1621

arajkumar · 2022-09-06T11:18:16Z

Add integration tests for promscale in upstream jaeger so that we can claim promscale meets all the expectations.

EDIT

Summary

Storage integration test failures shall be broadly classified into the following categories,

Get Trace with >10K span in a single query - Promscale as the Jaeger certified storage #1621 (comment) (fixed by [Feature]: Add more realistic scenario to large trace integration test jaegertracing/jaeger#3912)
Empty span name - Promscale as the Jaeger certified storage #1621 (comment)
Binary tags are returned as String tags - Jaeger to Otel translator encodes binary value with base64 - Promscale as the Jaeger certified storage #1621 (comment)
Span x Event table join issue - Promscale as the Jaeger certified storage #1621 (comment)
Query tag which is spread across multiple events with in single trace/span - Promscale as the Jaeger certified storage #1621 (comment)

Steps to run integration test

Run promscale
Apply the patch https://github.com/jaegertracing/jaeger/compare/main...arajkumar:jaeger:promscale-storage-int-tests.diff into jaeger
Run the following from jaeger repo

go clean -testcache && (cd examples/memstore-plugin/ && go build .) && STORAGE=grpc-plugin go test ./plugin/storage/integration -tags grpc_storage_integration -run="TestExternalGRPCStorage" -v

The text was updated successfully, but these errors were encountered:

arajkumar · 2022-09-06T11:28:28Z

With little bit of tinkering, I could run the upstream jaeger integration tests.

t.Run("GetServices", s.testGetServices)
t.Run("GetOperations", s.testGetOperations)
t.Run("GetTrace", s.testGetTrace)
t.Run("GetLargeSpans", s.testGetLargeSpan)
t.Run("FindTraces", s.testFindTraces)
t.Run("GetDependencies", s.testGetDependencies)

in the above list, first 3 are running(with little modification), but other tests will need some changes in promscale.

arajkumar · 2022-09-06T11:56:44Z

t.Run("GetLargeSpans", s.testGetLargeSpan)

The above test ingests 10K spans under a single trace id and get them in a single GetTraces operation.

arajkumar · 2022-09-06T12:17:46Z

Here is a test result without GetLargeSpans

tests.log

arajkumar · 2022-09-07T08:01:00Z

Test execution summary:

--- FAIL: TestExternalGRPCStorage (212.15s)
    --- PASS: TestExternalGRPCStorage/GetServices (0.21s)
    --- PASS: TestExternalGRPCStorage/GetOperations (0.00s)
    --- PASS: TestExternalGRPCStorage/GetTrace (0.01s)
        --- PASS: TestExternalGRPCStorage/GetTrace/NotFound_error (0.00s)
    --- FAIL: TestExternalGRPCStorage/FindTraces (211.93s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Tags_in_one_spot_-_Tags (10.72s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Tags_in_one_spot_-_Logs (10.70s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Tags_in_one_spot_-_Process (10.64s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Tags_in_different_spots (10.64s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Trace_spans_over_multiple_indices (10.39s)
        --- PASS: TestExternalGRPCStorage/FindTraces/Operation_name (0.01s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Operation_name_+_max_Duration (10.43s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Operation_name_+_Duration_range (10.37s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Duration_range (10.37s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/max_Duration (10.40s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/default (10.40s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Tags_+_Operation_name (10.69s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Tags_+_Operation_name_+_max_Duration (10.74s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Tags_+_Operation_name_+_Duration_range (0.02s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Tags_+_Duration_range (10.81s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Tags_+_max_Duration (10.69s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Multi-spot_Tags_+_Operation_name (10.81s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Multi-spot_Tags_+_Operation_name_+_max_Duration (10.76s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Multi-spot_Tags_+_Operation_name_+_Duration_range (10.70s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Multi-spot_Tags_+_Duration_range (10.70s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Multi-spot_Tags_+_max_Duration (10.56s)
        --- FAIL: TestExternalGRPCStorage/FindTraces/Multiple_Traces (10.39s)
    --- SKIP: TestExternalGRPCStorage/GetDependencies (0.00s)

arajkumar · 2022-09-07T08:08:49Z

FAIL: TestExternalGRPCStorage/FindTraces/Tags_in_one_spot_-_Tags (10.72s)

The current find trace query does a INNER JOIN on span & event which would cause failure to find a trace which doesn't have any event. We must use LEFT JOIN.

EDIT:

Event after LEFT JOIN fix, this test still fails because of diff in the JSON output.

Actually the span has an attribute with a type as "BINARY" which gets translated into base64 encoded string[1] while storing in DB, when jaeger queries it is returned as a string.

[1] https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/84ec43271330c6e1bfaf5c1483042de04798fd10/pkg/translator/jaeger/jaegerproto_to_traces.go#L263

arajkumar · 2022-09-07T08:27:18Z

When I use LEFT JOIN for event and span table merge, I see the following tests are passing additionally but above listed problem remains for other tests,

TestExternalGRPCStorage/FindTraces/Tags_in_different_spots
TestExternalGRPCStorage/FindTraces/Operation_name_+_Duration_range
TestExternalGRPCStorage/FindTraces/Duration_range
TestExternalGRPCStorage/FindTraces/max_Duration

arajkumar · 2022-09-07T09:32:16Z

Few more are passing after adding a second delay in grpc_test Refresh function. Looks like a bug in somewhere in the grpc plugin implementation(Note tracing.asyc-acks are not enabled).

--- PASS: TestExternalGRPCStorage/FindTraces/Tags_in_different_spots (0.00s)
--- PASS: TestExternalGRPCStorage/FindTraces/Trace_spans_over_multiple_indices (0.00s)
--- PASS: TestExternalGRPCStorage/FindTraces/Operation_name (0.00s)
--- PASS: TestExternalGRPCStorage/FindTraces/Operation_name_+max_Duration (0.00s)
--- PASS: TestExternalGRPCStorage/FindTraces/Operation_name+_Duration_range (0.11s)
--- PASS: TestExternalGRPCStorage/FindTraces/Duration_range (0.11s)
--- PASS: TestExternalGRPCStorage/FindTraces/max_Duration (0.11s)

EDIT:

It uses Streaming Span writer which is async.

arajkumar · 2022-09-07T10:04:44Z

    --- FAIL: TestExternalGRPCStorage/FindTraces/default (10.40s)

This one fails when operationName field of Span is empty.

EDIT:

Following error is thrown when OperationName is empty,

ERROR: new row for relation "operation" violates check constraint "operation_span_name_check" (SQLSTATE 23514)

EDIT:

operations table has a constraint on the span_name column which prevents empty insertions.

EDIT:

I could think of 2 ways to fix the problem.

OTLP doesn't really mandate non empty value for span name, it just semantically required and empty value is equivalent to an unknown span name. Hence we can drop the constrains in the db side.
In Promscale, set default value if span name is empty and remove the same from response.

arajkumar · 2022-09-08T11:51:38Z

--- FAIL: TestExternalGRPCStorage/FindTraces/Multi-spot_Tags_+Operation_name (10.81s)
--- FAIL: TestExternalGRPCStorage/FindTraces/Multi-spot_Tags+Operation_name+max_Duration (10.76s)
--- FAIL: TestExternalGRPCStorage/FindTraces/Multi-spot_Tags+Operation_name+Duration_range (10.70s)
--- FAIL: TestExternalGRPCStorage/FindTraces/Multi-spot_Tags+Duration_range (10.70s)
--- FAIL: TestExternalGRPCStorage/FindTraces/Multi-spot_Tags+_max_Duration (10.56s)

trace

query

This has a tag query across span, resource and event. It fails when you have 2 events with 2 distinct tags and both of them are queried at a time.

e.g.

trace_id	span_id	event_name	tag
0001	0007	event1	foo=foo
0001	0007	event2	bar=bar

When you query foo=foo bar=bar, the following WHERE clause would be generated by the promscale which would return empty result because both of the condition won't meet.

e.tags @> ANY($3::jsonb[]) AND e.tags @> ANY($5::jsonb[])

It also depends on fix from #1621 (comment)

alejandrodnm · 2022-09-13T20:49:36Z

The problem with GetLargeSpans is with the array_agg.

The same span is sent 10008 times in the test:

What this means is that we end up with 10008 entries in _ps_trace.span that are identical:

promscale=# select * from _ps_trace.span;
─[ RECORD 1 ]───────────────┬─────────────────────────────────────
trace_id                    │ 00000000-0000-0000-0000-000000000011
span_id                     │ 3
parent_span_id              │ ∅
operation_id                │ 1
start_time                  │ 2022-09-12 19:03:11.639875+02
end_time                    │ 2022-09-12 19:03:11.639975+02
duration_ms                 │ 0.1
instrumentation_lib_id      │ ∅
resource_schema_url_id      │ ∅
event_time                  │ empty
dropped_tags_count          │ 0
dropped_events_count        │ 0
dropped_link_count          │ 0
resource_dropped_tags_count │ 0
status_code                 │ unset
trace_state                 │ ∅
span_tags                   │ {}
status_message              │
resource_tags               │ {"1": 1}
═[ RECORD 2 ]═══════════════╪═════════════════════════════════════
trace_id                    │ 00000000-0000-0000-0000-000000000011
span_id                     │ 3
parent_span_id              │ ∅
operation_id                │ 1
start_time                  │ 2022-09-12 19:03:22.639875+02
end_time                    │ 2022-09-12 19:03:22.639975+02
duration_ms                 │ 0.1
instrumentation_lib_id      │ ∅
resource_schema_url_id      │ ∅
event_time                  │ empty
dropped_tags_count          │ 0
dropped_events_count        │ 0
dropped_link_count          │ 0
resource_dropped_tags_count │ 0
status_code                 │ unset
trace_state                 │ ∅
span_tags                   │ {}
status_message              │
resource_tags               │ {"1": 1}

If you look at the span fixture it has 2 empty (except for timestamp) entries in logs, the translations between jaeger and otel stores this log entries as events. Looking at the _ps_trace.event table we got the same 2 events duplicated 10008 times each (same number of spans that were sent).

promscale=# select count(*) from _ps_trace.event;
 count
═══════
 20016

promscale=# select * from _ps_trace.event limit 4;
             time              │               trace_id               │ span_id │ event_nbr │ dropped_tags_count │ name │ tags
═══════════════════════════════╪══════════════════════════════════════╪═════════╪═══════════╪════════════════════╪══════╪══════
 2022-09-12 18:46:31.639875+02 │ 00000000-0000-0000-0000-000000000011 │       3 │         0 │                  0 │      │ {}
 2022-09-12 18:46:31.639875+02 │ 00000000-0000-0000-0000-000000000011 │       3 │         1 │                  0 │      │ {}
 2022-09-12 18:46:31.639875+02 │ 00000000-0000-0000-0000-000000000011 │       3 │         0 │                  0 │      │ {}
 2022-09-12 18:46:31.639875+02 │ 00000000-0000-0000-0000-000000000011 │       3 │         1 │                  0 │      │ {}

When we try to get the trace we use array_agg to get the events since all the spans have the same values, the query aggregates the 20016 events to every row.

─[ RECORD 1 ]────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
trace_id                 │ 00000000-0000-0000-0000-000000000011
span_id                  │ 3
parent_span_id           │ ∅
start_times              │ 2022-09-12 18:46:31.639875+02
end_times                │ 2022-09-12 18:46:31.639975+02
span_kind                │ unspecified
dropped_tags_counts      │ 0
dropped_events_counts    │ 0
dropped_link_counts      │ 0
status_code_string       │ unset
status_message           │
trace_states             │ ∅
schema_urls              │ ∅
span_names               │ example-operation-1
resource_tags            │ {"service.name": "example-service-1"}
span_tags                │ ∅
event_names              │ {"","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""
event_times              │ {"2022-09-12 18:46:31.639875+02","2022-09-12 18:46:31.639875+02","2022-09-12 18:46:31.639875+02","2022-09-12 18:46:31.639875+02","2022-09-12 18:46:31.639875+02","2022-09-12 18:46:31.639875+02","2022-09-12 18:46:31
event_dropped_tags_count │ {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
event_tags               │ {NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU
library_name             │ ∅
library_version          │ ∅
library_schema_url       │ ∅
links_linked_trace_ids   │ ∅
links_linked_span_ids    │ ∅
links_trace_states       │ ∅
links_dropped_tags_count │ ∅
links_tags               │ ∅

alejandrodnm · 2022-09-14T10:07:42Z

Related to GetLargeSpans

If I give each Span in the test a unique ID the test passes. Maybe we can add this change to the test. We can use the following arguments to state that the current test is not very representative of a real scenario:

The opentelemtry spec defines the SpanID

SpanId is the identifier for a span. It is globally unique with practically sufficient probability by being made as 8 randomly generated bytes. When passed to a child Span this identifier becomes the parent span id for the child Span

The Jaeger Thrift model states:

  3:  required i64           spanId       # unique span id (only unique within a given trace)

Also, Jaeger has a depuper for non unique spanIDs, because as the documentation says:

Jaeger UI expects all spans to have unique IDs.

I guess the other option is handling duplicates ourselves.

alejandrodnm · 2022-09-15T19:22:04Z

GetLargeSpans fixed by jaegertracing/jaeger#3912

alejandrodnm · 2022-09-20T08:49:26Z

The problems with TestExternalGRPCStorage/FindTraces/Tags_+_Operation_name are:

Spans in fixture have different trace_id:

This test sends a span (a) from a trace and has 3 references to 2 other spans (b and c):

a CHILD_OF b
a CHILD_OF of c
a FOLLOWS_FROM c

https://github.com/jaegertracing/jaeger/blob/cac21f82950545b613f45d3984d96e1498bfd15d/plugin/storage/integration/fixtures/traces/tags_opname_trace.json#L4-L23

The problem is with the child relationship and the fact that each span has a different traceId. This breaks the parent relationship because when the translator tries to get the parent span it cannot find a match given that all of the spans have different traces, the ParentSpanID method makes the following assertion ref.TraceID == s.TraceID && ref.RefType == ChildOf :

https://github.com/jaegertracing/jaeger/blob/cac21f82950545b613f45d3984d96e1498bfd15d/model/span.go#L99-L109

I think for this point we can propose a fix to the test. Like we did for the GetLargeSpan propose a more realistic scenario.

OTEL doesn't have a concept of refType

The OTEL model doesn't define a span reference type (in the case of OTEL it would be link). When translating from otel to jaeger the parent span is assign the CHILD_OF refType, and all other links are assigned FOLLOWS_FROM:

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/7b5eb5280c15a5cc195d68757ba4c24c2c7f8eba/pkg/translator/jaeger/traces_to_jaegerproto.go#L260-L281

They explicitly say:

Since Jaeger RefType is not captured in internal data, use SpanRefType_FOLLOWS_FROM by default.
SpanRefType_CHILD_OF supposed to be set only from parentSpanID.

This is a problem for the test because it's only setting one CHILD_OF when the test expects 2 from the result.

The expected based on the fixture is:

                "references": [
                    {
                        "trace_id": "AAAAAAAAAAAAAAAAAAAAEg==",
                        "span_id": "AAAAAAAAAP8="
                    },
                    {
                        "trace_id": "AAAAAAAAAAAAAAAAAAAAEg==",
                        "span_id": "AAAAAAAAAAI="
                    },
                    {
                        "trace_id": "AAAAAAAAAAAAAAAAAAAAAQ==",
                        "span_id": "AAAAAAAAAAI=",
                        "ref_type": 1
                    }
                ],

The first 2 are ref_type=0 which is CHILD_OF.

The actual result we are getting because the translator sets only one CHILD_OF is:

                "references": [
                    {
                        "trace_id": "AAAAAAAAAAAAAAAAAAAAEg==",
                        "span_id": "AAAAAAAAAP8="
                    },
                    {
                        "trace_id": "AAAAAAAAAAAAAAAAAAAAEg==",
                        "span_id": "AAAAAAAAAAI=",
                        "ref_type": 1
                    },
                    {
                        "trace_id": "AAAAAAAAAAAAAAAAAAAAAQ==",
                        "span_id": "AAAAAAAAAAI=",
                        "ref_type": 1
                    }
                ],

As you can see, since the translator takes the first CHILD_OF for the span parent id all the other relationships are FOLLOWS_FROM.

Again, we can ask for a more realistic test which sets only one parent for the span.

Aggregate for links are not returning data

The third problem is that the the query doesn't seem to be returning all the links properly. The aggregate function filter clause doesn't seem to be working correctly.

We are doing FILTER(WHERE lk IS NOT NULL) but that seems to be filtering out everything. Maybe the WHERE clause requires and explicit column

SELECT
    s.trace_id,
    s.span_id,
    array_agg(lk.linked_trace_id ORDER BY lk.link_nbr) FILTER(WHERE lk.trace_id IS NOT NULL)  links_lkTraceId_not_null,
    array_agg(lk.linked_trace_id ORDER BY lk.link_nbr) FILTER(WHERE lk IS NOT NULL) links_lk_not_null
  FROM
    _ps_trace.span s
  LEFT JOIN
    _ps_trace.link lk ON lk.trace_id = s.trace_id AND lk.span_id = s.span_id
  GROUP BY
    s.trace_id,
    s.span_id;

               trace_id               │ span_id │        links_lktraceid_not_null        │ links_lk_not_null
══════════════════════════════════════╪═════════╪════════════════════════════════════════╪═══════════════════
 00000000-0000-0000-0000-000000000001 │       4 │ {00000000-0000-0000-0000-000000000001} │ ∅
(1 row)

Time: 101.343 ms```

arajkumar · 2022-09-30T09:22:24Z

GetLargeSpans fixed by jaegertracing/jaeger#3912

GetLargeSpans test still fails when it is run with all other tests specifically GetTrace. Here is a test summary,

/plugin/storage/integration/integration_test.go:133
                Error:          Not equal:
                                expected: 10008
                                actual  : 10012
                Test:           TestExternalGRPCStorage/GetLargeSpans

The actual problem is that both GetLargeSpans and GetTrace ingests test trace with same trace id and span id and end up getting more the expected number of results.

arajkumar · 2022-09-30T09:32:18Z

Hmm, the expectation is to run tests in isolation. See
https://github.com/jaegertracing/jaeger/blob/4fc291568d8ac59a1c67cc47ee1d91ab20dd06c4/plugin/storage/integration/integration_test.go#L119

arajkumar · 2022-09-30T12:11:23Z

Hmm, the expectation is to run tests in isolation. See https://github.com/jaegertracing/jaeger/blob/4fc291568d8ac59a1c67cc47ee1d91ab20dd06c4/plugin/storage/integration/integration_test.go#L119

Fixed in f02d9a8

arajkumar · 2022-10-17T09:25:14Z

All tests are passing now.

arajkumar self-assigned this Sep 6, 2022

cevian mentioned this issue Sep 7, 2022

Improve trace id query #1626

Merged

2 tasks

arajkumar mentioned this issue Sep 13, 2022

Unable to query tags which are spread across span, resource and events with in a span #1641

Closed

arajkumar mentioned this issue Sep 28, 2022

Drop non empty span name check constraint from _ps_trace.operation table timescale/promscale_extension#520

Merged

2 tasks

arajkumar closed this as completed in timescale/promscale_extension#520 Sep 29, 2022

arajkumar reopened this Sep 30, 2022

arajkumar closed this as completed Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promscale as the Jaeger certified storage #1621

Promscale as the Jaeger certified storage #1621

arajkumar commented Sep 6, 2022 •

edited

Loading

arajkumar commented Sep 6, 2022

arajkumar commented Sep 6, 2022

arajkumar commented Sep 6, 2022

arajkumar commented Sep 7, 2022

arajkumar commented Sep 7, 2022 •

edited

Loading

arajkumar commented Sep 7, 2022

arajkumar commented Sep 7, 2022 •

edited

Loading

arajkumar commented Sep 7, 2022 •

edited

Loading

arajkumar commented Sep 8, 2022 •

edited

Loading

alejandrodnm commented Sep 13, 2022 •

edited

Loading

alejandrodnm commented Sep 14, 2022

alejandrodnm commented Sep 15, 2022

alejandrodnm commented Sep 20, 2022 •

edited

Loading

arajkumar commented Sep 30, 2022

arajkumar commented Sep 30, 2022

arajkumar commented Sep 30, 2022 •

edited

Loading

arajkumar commented Oct 17, 2022

Promscale as the Jaeger certified storage #1621

Promscale as the Jaeger certified storage #1621

Comments

arajkumar commented Sep 6, 2022 • edited Loading

EDIT

Summary

Steps to run integration test

arajkumar commented Sep 6, 2022

arajkumar commented Sep 6, 2022

arajkumar commented Sep 6, 2022

arajkumar commented Sep 7, 2022

arajkumar commented Sep 7, 2022 • edited Loading

EDIT:

arajkumar commented Sep 7, 2022

arajkumar commented Sep 7, 2022 • edited Loading

EDIT:

arajkumar commented Sep 7, 2022 • edited Loading

EDIT:

EDIT:

EDIT:

arajkumar commented Sep 8, 2022 • edited Loading

alejandrodnm commented Sep 13, 2022 • edited Loading

alejandrodnm commented Sep 14, 2022

alejandrodnm commented Sep 15, 2022

alejandrodnm commented Sep 20, 2022 • edited Loading

Spans in fixture have different trace_id:

OTEL doesn't have a concept of refType

Aggregate for links are not returning data

arajkumar commented Sep 30, 2022

arajkumar commented Sep 30, 2022

arajkumar commented Sep 30, 2022 • edited Loading

arajkumar commented Oct 17, 2022

arajkumar commented Sep 6, 2022 •

edited

Loading

arajkumar commented Sep 7, 2022 •

edited

Loading

arajkumar commented Sep 7, 2022 •

edited

Loading

arajkumar commented Sep 7, 2022 •

edited

Loading

arajkumar commented Sep 8, 2022 •

edited

Loading

alejandrodnm commented Sep 13, 2022 •

edited

Loading

alejandrodnm commented Sep 20, 2022 •

edited

Loading

arajkumar commented Sep 30, 2022 •

edited

Loading