Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenTelemetry performance benchmark spec #748

Merged
merged 19 commits into from
Nov 10, 2020
Merged
Changes from 1 commit
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Add OpenTelemetry perf bench draft
  • Loading branch information
ThomsonTan committed Jul 28, 2020

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit 4dc4c41111f3e268a5e6e8c8456b00af55334a6a
29 changes: 29 additions & 0 deletions specification/performance-benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Performance Benchmark of OpenTelemetry API

This document describes common performance benchmark requirements of OpenTelemetry API implementation in language libriries.

**Status:** Draft

## Events Throughput

### Number of events could be produced per second

For application uses with C/C++/Java library, it can produce 10K events without dropping any event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why special mention of c/cpp/java?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit vague.. Events could be dropped by the Exporter based on the specific exporter needs/design. Is this about all the exporters? or only for the otlp exporter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please clarify what makes one event. Is it a span? is it a span with 10 attributes? or span with 50 attributes? or span with 0 attributes? Depending on the implementation, perf can vary, and unless explicitly documented here. everyone end up with their own choice of event.


### Number of outstanding events in local cache

If the remote end-point is unavailable for OpenTelemetry exporter, the library needs cache 1M events before dropping them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a hard-coded number? I would expect this to be a configuration option with some reasonable default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem like a performance benchmark, but a requirement of the sdk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan I'll remove the hard-coded numbers which could be hard to apply to languages and sustain over time.

@dyladan this spec is supposed to recommend SDKs to implement some common performance benchmark (like the basic metrics listed here) which the users could easily get (by running the benchmark program locally).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no mention of local cache in any specs so far. Its upto individual exporters to deal with local storage, if any. Or is this specifically about the OTLP exporter?


## Instrumentation Cost

### CPU time

Under extreme workload, the library should not take more than 5% of total CPU time for event tracing, or it should not take more than 1 full logical core in multi-core system.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To guarantee this may require the library to monitor its own CPU usage and begin throttling, sampling or otherwise limits its own resource consumption.
Otherwise it is pretty easy to to write an application that does nothing but make OpenTelemetry calls and the library will consume close to 100% of CPU.


### Memory consumption

Under extreme workload, the peak working set increase caused by tracing should not exceed 100MB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the number of items I would expect this to be configurable if it is a limit that is honored by the library. However, it is not clear if we want to put a limitation on the total memory usage. Unlike the number of events the total memory usage by tracing is much more difficult to calculate accurately.

It is also not clear what the interaction between the limits on the number of events and on the memory usage is. Should the library begin dropping when either of these limits is hit? If that is the intent then it would be useful to call it out explicitly. This likely belongs to a functional specification of the libraries, not necessarily to the performance spec that this PR describes.

I think it will be better to have a separate document for functional requirements for libraries in the specification and refer to it as necessary from the performance specification document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that providing numbers does not seem like a cross-language issue - different languages have way too different performance characteristics. How about providing guidelines on what metrics should be looked at without targets? Also we can add tips / tricks about high performance tracing - for example, and this is my opinion, but if others agree, observability is an area that deserves possibly gross microoptimizations since users don't want to pay cost for observability, they want their money to go into serving users. This file seems like a good avenue for providing observability best practices.


## Benchmark Report

The implementation language libraries need add the typical result of above benchmarks in the release page.
Copy link
Member

@cijothomas cijothomas Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can describe the type of the application along with the spans expected from it - then every language can implement the same type of application. Otherwise there be no consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.