Enhance AWS SDK Instrumentation with Detailed HTTP Error Information #9448

pxaws · 2023-09-12T23:15:10Z

Reason:

The existing AWS SDK instrumentation generates a span for every SDK call. While this provides a general overview, there are scenarios where a more detailed insight into the SDK call is desired. Specifically, there is an interest in understanding individual HTTP requests, particularly their error codes and messages. Enhancing the span with this detailed information can be invaluable for debugging and monitoring purposes. Consider this scenario: We identify a specific SDK span with an unusually long duration and seek to understand the cause. At present, the SDK span lacks the necessary information for this analysis. However, by logging the error messages for each HTTP retry, we can deduce that the extended duration might be due to multiple retries, which are a result of backend throttling.

Implementation:

To achieve this, the pull request introduces the following changes:

The error code and error message of each failed HTTP request are added as events within the SDK span.
The afterTransmission hook, part of the AWS SDK's ExecutionInterceptor, is leveraged to achieve this enhancement.
After receiving an HTTP response, the afterTransmission hook is triggered. Within this hook, the HTTP response is inspected, and the error code and message (if any) are extracted.
These details are then packaged into an event, timestamped with the current time, and added to the SDK span.
We've introduced an experimental flag named experimental-record-individual-http-error, which defaults to false, to limit any potential impact. The described behaviors are only activated when this flag is set to true.

trask · 2023-09-20T18:59:11Z

cc @rapphil @wangzlei @srprash

.../src/main/java/io/opentelemetry/instrumentation/awssdk/v2_2/TracingExecutionInterceptor.java

.../groovy/io/opentelemetry/instrumentation/awssdk/v2_2/Aws2ClientNotRecordHttpErrorTest.groovy

rapphil · 2023-09-20T22:12:16Z

.../src/main/java/io/opentelemetry/instrumentation/awssdk/v2_2/TracingExecutionInterceptor.java

@@ -49,6 +57,11 @@ final class TracingExecutionInterceptor implements ExecutionInterceptor {
  private final Instrumenter<ExecutionAttributes, SdkHttpResponse> requestInstrumenter;
  private final Instrumenter<ExecutionAttributes, SdkHttpResponse> consumerInstrumenter;
  private final boolean captureExperimentalSpanAttributes;
+  private static final Logger logger = Logger.getLogger(PluginImplUtil.class.getName());
+
+  static final AttributeKey<String> HTTP_ERROR_MSG =


any plans to make this attribute part of the semantic conventions for the aws sdk?

https://github.com/open-telemetry/semantic-conventions/blob/203691d99612452df0c951640b04521e34969628/docs/cloud-providers/aws-sdk.md?plain=1#L2

Yes, it is in the long term plan. Actually I am not quite sure which part of semantic conventions that I should put into because this is the event attribute not the span attribute (the previous link that you give seems to be the conventions for span attributes). Any suggestions?

rapphil · 2023-09-20T22:23:33Z

by logging the error messages for each HTTP retry, we can deduce that the extended duration might be due to multiple retries, which are a result of backend throttling.

is the http status code not enough for that purpose? 429 is the status code for throttling.

Are there other examples where the error message returned by the backend might be useful?

pxaws · 2023-09-21T23:38:49Z

is the http status code not enough for that purpose? 429 is the status code for throttling.

Recording only the error code is not ideal because users need to go and figure out the meaning for the code. I can't assume that everyone knows 429 means throttling :) In addition, an error code can correspond to multiple types of the error cases.

Are there other examples where the error message returned by the backend might be useful?

Yes. I would say all the error messages from backend is kind of useful because it can tell you more details about backend and sometimes indicate some client-side issue. For example, the following one from the dynamodb public doc: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.MessagesAndCodes

LimitExceededException
Message: Too many operations for a given subscriber.

There are too many concurrent control plane operations. The cumulative number of tables and indexes in the CREATING, >DELETING, or UPDATING state cannot exceed 500.

OK to retry? Yes

rapphil · 2023-09-26T16:48:16Z

@trask this LGTM and is ready for maintainers to review/merge.

kenfinnigan · 2023-09-26T18:12:49Z

.../src/main/java/io/opentelemetry/instrumentation/awssdk/v2_2/TracingExecutionInterceptor.java

+      Span span = Span.fromContext(otelContext);
+      SdkHttpResponse response = context.httpResponse();
+
+      if (!response.isSuccessful()) {


Can response ever be null on the context? Should a null check be added?

Make sense. I will add a null check on response.

kenfinnigan · 2023-09-26T18:15:08Z

.../src/main/java/io/opentelemetry/instrumentation/awssdk/v2_2/TracingExecutionInterceptor.java

+        if (responseBody.isPresent()) {
+          String errorMsg =
+              new BufferedReader(
+                      new InputStreamReader(responseBody.get(), Charset.defaultCharset()))


Is the underlying InputStream still available for use by the SDK after calling responseBody.get()?

Thank you for catching this!

I think there is no guarantee that the InputStream will support mark()/reset() operations or seeking. So the next time if the same InputStream is read, it will not give you anything.

To verify, I did an simple experiment. Based on the pipeline stage in aws java sdk: https://github.com/aws/aws-sdk-java-v2/blob/d020d37138eee9d4d74e814086143d26d923fee0/core/sdk-core/src/main/java/software/amazon/awssdk/core/internal/http/pipeline/stages/AfterTransmissionExecutionInterceptorsStage.java#L43-L46, it seems like modifyHttpResponse call is after afterTransmission call. So I tested the same InputStream in modifyHttpResponse hooks in ExecutionInterceptor and confirmed it's already empty (after it's read out in afterTransmission).

It's a valid concern that the inputStream for the response body will become useless after we read it out in afterTransmission hook. Fortunately there is another hook Optional<InputStream> modifyHttpResponseContent which allow us to modify the content of a response. We can copy the content out and generate a new InputStream from it. A similar approach has already been implemented in the aws java adk: https://github.com/aws/aws-sdk-java-v2/blob/d020d37138eee9d4d74e814086143d26d923fee0/core/sdk-core/src/main/java/software/amazon/awssdk/core/internal/interceptor/HttpChecksumValidationInterceptor.java#L55-L73

I will make the change in next commit.

Thanks @pxaws

Is there a way to verify in the test that the InputStream is available to the client and has the expected value?

Yes, I modified the test to do that. I basically defined another execution interceptor which is registered with AWS SDK after the TracingExecutionInterceptor used for SDK instrumentation and then verify from there to make sure the response body contains the expected content. Please see the next commit. Thank you!

…ponse body

pxaws · 2023-09-27T17:41:16Z

@kenfinnigan Could you review it again? Thank you!

…usable

kenfinnigan

Thanks for the changes @pxaws, looks good

trask

Thanks @kenfinnigan @rapphil for the reviews!

pxaws requested a review from a team September 12, 2023 23:15

pxaws marked this pull request as draft September 12, 2023 23:32

pxaws force-pushed the add-event-to-aws-sdk branch 3 times, most recently from b6c69d8 to 0a47a4f Compare September 19, 2023 18:18

github-actions bot requested a review from theletterf September 19, 2023 18:19

pxaws force-pushed the add-event-to-aws-sdk branch 7 times, most recently from 7ee8a0d to 36460a6 Compare September 19, 2023 23:00

add events to AWS SDK span

34a7171

pxaws force-pushed the add-event-to-aws-sdk branch from 36460a6 to 34a7171 Compare September 19, 2023 23:51

pxaws marked this pull request as ready for review September 20, 2023 00:23

rapphil reviewed Sep 20, 2023

View reviewed changes

.../src/main/java/io/opentelemetry/instrumentation/awssdk/v2_2/TracingExecutionInterceptor.java Outdated Show resolved Hide resolved

rapphil reviewed Sep 20, 2023

View reviewed changes

.../src/main/java/io/opentelemetry/instrumentation/awssdk/v2_2/TracingExecutionInterceptor.java Outdated Show resolved Hide resolved

rapphil reviewed Sep 20, 2023

View reviewed changes

.../groovy/io/opentelemetry/instrumentation/awssdk/v2_2/Aws2ClientNotRecordHttpErrorTest.groovy Outdated Show resolved Hide resolved

rapphil reviewed Sep 20, 2023

View reviewed changes

remove dependency on 'commons-io:commons-io' and refactor

91a5f2c

write new tests in java

b8dc2b8

pxaws force-pushed the add-event-to-aws-sdk branch from f9ac488 to b8dc2b8 Compare September 21, 2023 23:51

rapphil approved these changes Sep 26, 2023

View reviewed changes

wangzlei approved these changes Sep 26, 2023

View reviewed changes

kenfinnigan reviewed Sep 26, 2023

View reviewed changes

use modifyHttpResponseContent hook to maintain the content in the res…

1a68d51

…ponse body

modify the test to verify the input stream of response body is still …

3a347a5

…usable

kenfinnigan approved these changes Sep 29, 2023

View reviewed changes

trask approved these changes Sep 29, 2023

View reviewed changes

trask merged commit 2724a87 into open-telemetry:main Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance AWS SDK Instrumentation with Detailed HTTP Error Information #9448

Enhance AWS SDK Instrumentation with Detailed HTTP Error Information #9448

pxaws commented Sep 12, 2023 •

edited

Loading

trask commented Sep 20, 2023

rapphil Sep 20, 2023

rapphil Sep 20, 2023

pxaws Sep 21, 2023

rapphil commented Sep 20, 2023

pxaws commented Sep 21, 2023

rapphil commented Sep 26, 2023

kenfinnigan Sep 26, 2023

pxaws Sep 27, 2023

kenfinnigan Sep 26, 2023

pxaws Sep 27, 2023 •

edited

Loading

kenfinnigan Sep 27, 2023

pxaws Sep 28, 2023

pxaws commented Sep 27, 2023

kenfinnigan left a comment

trask left a comment

Enhance AWS SDK Instrumentation with Detailed HTTP Error Information #9448

Enhance AWS SDK Instrumentation with Detailed HTTP Error Information #9448

Conversation

pxaws commented Sep 12, 2023 • edited Loading

Reason:

Implementation:

trask commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rapphil commented Sep 20, 2023

pxaws commented Sep 21, 2023

rapphil commented Sep 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pxaws Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pxaws commented Sep 27, 2023

kenfinnigan left a comment

Choose a reason for hiding this comment

trask left a comment

Choose a reason for hiding this comment

pxaws commented Sep 12, 2023 •

edited

Loading

pxaws Sep 27, 2023 •

edited

Loading