-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance AWS SDK Instrumentation with Detailed HTTP Error Information #9448
Conversation
b6c69d8
to
0a47a4f
Compare
7ee8a0d
to
36460a6
Compare
36460a6
to
34a7171
Compare
.../src/main/java/io/opentelemetry/instrumentation/awssdk/v2_2/TracingExecutionInterceptor.java
Outdated
Show resolved
Hide resolved
.../src/main/java/io/opentelemetry/instrumentation/awssdk/v2_2/TracingExecutionInterceptor.java
Outdated
Show resolved
Hide resolved
.../groovy/io/opentelemetry/instrumentation/awssdk/v2_2/Aws2ClientNotRecordHttpErrorTest.groovy
Outdated
Show resolved
Hide resolved
@@ -49,6 +57,11 @@ final class TracingExecutionInterceptor implements ExecutionInterceptor { | |||
private final Instrumenter<ExecutionAttributes, SdkHttpResponse> requestInstrumenter; | |||
private final Instrumenter<ExecutionAttributes, SdkHttpResponse> consumerInstrumenter; | |||
private final boolean captureExperimentalSpanAttributes; | |||
private static final Logger logger = Logger.getLogger(PluginImplUtil.class.getName()); | |||
|
|||
static final AttributeKey<String> HTTP_ERROR_MSG = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any plans to make this attribute part of the semantic conventions for the aws sdk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is in the long term plan. Actually I am not quite sure which part of semantic conventions that I should put into because this is the event attribute not the span attribute (the previous link that you give seems to be the conventions for span attributes). Any suggestions?
is the http status code not enough for that purpose? 429 is the status code for throttling. Are there other examples where the error message returned by the backend might be useful? |
Recording only the error code is not ideal because users need to go and figure out the meaning for the code. I can't assume that everyone knows 429 means throttling :) In addition, an error code can correspond to multiple types of the error cases.
Yes. I would say all the error messages from backend is kind of useful because it can tell you more details about backend and sometimes indicate some client-side issue. For example, the following one from the dynamodb public doc: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.MessagesAndCodes
|
f9ac488
to
b8dc2b8
Compare
@trask this LGTM and is ready for maintainers to review/merge. |
Span span = Span.fromContext(otelContext); | ||
SdkHttpResponse response = context.httpResponse(); | ||
|
||
if (!response.isSuccessful()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can response ever be null on the context? Should a null check be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. I will add a null check on response.
if (responseBody.isPresent()) { | ||
String errorMsg = | ||
new BufferedReader( | ||
new InputStreamReader(responseBody.get(), Charset.defaultCharset())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the underlying InputStream
still available for use by the SDK after calling responseBody.get()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for catching this!
I think there is no guarantee that the InputStream will support mark()/reset()
operations or seeking. So the next time if the same InputStream is read, it will not give you anything.
To verify, I did an simple experiment. Based on the pipeline stage in aws java sdk: https://github.com/aws/aws-sdk-java-v2/blob/d020d37138eee9d4d74e814086143d26d923fee0/core/sdk-core/src/main/java/software/amazon/awssdk/core/internal/http/pipeline/stages/AfterTransmissionExecutionInterceptorsStage.java#L43-L46, it seems like modifyHttpResponse
call is after afterTransmission
call. So I tested the same InputStream in modifyHttpResponse
hooks in ExecutionInterceptor and confirmed it's already empty (after it's read out in afterTransmission
).
It's a valid concern that the inputStream for the response body will become useless after we read it out in afterTransmission
hook. Fortunately there is another hook Optional<InputStream> modifyHttpResponseContent
which allow us to modify the content of a response. We can copy the content out and generate a new InputStream from it. A similar approach has already been implemented in the aws java adk: https://github.com/aws/aws-sdk-java-v2/blob/d020d37138eee9d4d74e814086143d26d923fee0/core/sdk-core/src/main/java/software/amazon/awssdk/core/internal/interceptor/HttpChecksumValidationInterceptor.java#L55-L73
I will make the change in next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pxaws
Is there a way to verify in the test that the InputStream
is available to the client and has the expected value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I modified the test to do that. I basically defined another execution interceptor which is registered with AWS SDK after the TracingExecutionInterceptor used for SDK instrumentation and then verify from there to make sure the response body contains the expected content. Please see the next commit. Thank you!
@kenfinnigan Could you review it again? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @pxaws, looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kenfinnigan @rapphil for the reviews!
Reason:
The existing AWS SDK instrumentation generates a span for every SDK call. While this provides a general overview, there are scenarios where a more detailed insight into the SDK call is desired. Specifically, there is an interest in understanding individual HTTP requests, particularly their error codes and messages. Enhancing the span with this detailed information can be invaluable for debugging and monitoring purposes. Consider this scenario: We identify a specific SDK span with an unusually long duration and seek to understand the cause. At present, the SDK span lacks the necessary information for this analysis. However, by logging the error messages for each HTTP retry, we can deduce that the extended duration might be due to multiple retries, which are a result of backend throttling.
Implementation:
To achieve this, the pull request introduces the following changes:
afterTransmission
hook, part of the AWS SDK's ExecutionInterceptor, is leveraged to achieve this enhancement.afterTransmission
hook is triggered. Within this hook, the HTTP response is inspected, and the error code and message (if any) are extracted.experimental-record-individual-http-error
, which defaults tofalse
, to limit any potential impact. The described behaviors are only activated when this flag is set totrue
.