Skip to content

Latest commit

 

History

History
201 lines (150 loc) · 10 KB

communication.md

File metadata and controls

201 lines (150 loc) · 10 KB

Communication with New Relic

All communication with New Relic MUST take place via the public telemetry ingest APIs. These APIs all share a common JSON format to provide a consistent experience across data types. SDK implementations MUST adhere to this common format when sending data to New Relic.

Request format

The SDK MUST use the Telemetry ingest APIs to send data to New Relic. The SDK sends all telemetry of a given type to the appropriate telemetry ingest endpoint.

  • SDKs MUST compress the JSON payload with gzip encoding by default.
  • Only send API keys as headers (not query params)

Request ID header

When communicating with data ingest services, there are 3 possible outcomes of the HTTP call:

  1. OK status (200 <= status_code < 300), indicating data has been received and persisted.
  2. non-OK status, indicating data has not been persisted
  3. disconnect (either client or server). In this case, the connection is closed prior to receiving a status indication.

In case (3) above, the client should only retry the request if the request is idempotent since data may or may not have been persisted (and thus the data may get recorded twice, resulting in the data aggregates being inaccurate).

To prevent data loss while allowing clients to retransmit in the case of transient failures, the ingest service must be able to identify duplicate requests; therefore, all SDKs MUST send the following HTTP header with the request:

Header Name Header Value Code Example
x-request-id A version 4 UUID string str(uuid.uuid4())

NOTE the request ID should be generated before the first attempt to send the request is made and the value should be maintained throughout any retries which transmit the same payload. If the SDK partitions the payload in response to a 413 status code, a unique request ID should be used for the transmission of each partition.

User Agent

The User-Agent header field is used to perform analytics on requests received by New Relic. In order to enable these analytics, all SDKs MUST include a User-Agent header in requests they make to New Relic. In addition to conforming to the specification defined in RFC 7231, the User-Agent header MUST include an SDK product identifier as its first entry.

User-Agent  = sdk-id *( RWS ( product / comment ) )
sdk-id      = sdk-name "/" sdk-version
sdk-name    = "NewRelic-" language "-TelemetrySDK"
sdk-version = token

The language portion of the sdk-name needs to be the programming language the SDK is written for and the sdk-version is the version of the SDK. The rest of this syntax (RWS, product, comment, and token) all use the meanings defined in RFC 7231 and RFC 7230

Extending User Agent with Exporter Product

Understanding which exporter was used to export data is an important dimension to have analytics on as well. Exporters that use the SDK need to be able to append a product identifier of their own to the User-Agent header. Therefore, all SDKs MUST provide a method to extend the User-Agent header field-value. This method SHOULD accept the exporter determined product identifier as an argument. The exact form and the validity of this product identifier SHOULD be left to the exporter to determine.

An example of this User-Agent mutation functionality might look like the following.

class SDK(object):
    _user_agent = "NewRelic-Python-TelemetrySDK/0.1.0"

    def add_user_agent(self, product, product_version=None):
        """Add product to the User-Agent header field"""
        if product_version:
            product += "/{}".format(product_version)

        self._user_agent += " {}".format(product)
    ...

Then, when this SDK is used to build a NewRelic-Python-OpenCensus/0.2.1 exporter, the User-Agent header sent in a request would look like the following.

User-Agent: NewRelic-Python-TelemetrySDK/0.1.0 NewRelic-Python-OpenCensus/0.2.1

Payload

Payloads of different telemetry types cannot be combined.

All JSON payloads sent to New Relic MUST use the New Relic common format. This is an example of the common format:

[
  {
    "common": {
      <intrinsic attributes>
      "attributes" : {
          <custom attributes>
        }
    },
    "<spans|logs|metrics|events>" : [
      {
        <intrinsic attributes>,
        "timestamp": 1522434601409,
        "attributes" : {
          <custom attributes>
        }
      },
      {
        <intrinsic attributes>,
        "timestamp": 1522434601409,
        "attributes" : {
          <custom attributes>
        }
      } ]
  }
]

SDK implementations SHOULD use the top-level common block to reduce the size of repeated attributes in payloads when applicable.

Response codes

The telemetry ingest API validates the basic shape of the request without looking at the POST body. Its responses are documented here.

SDK implementations must perform response code error handling in the Telemetry API as documented below. The telemetry API should provide a mechanism for the consumer of this API to be notified (or react to) any error conditions that may occur rather than hiding all errors from the user.

Response code Description Log error Retry behavior Drop data Other
200 - 299 Successful request
400 Generally invalid request once no yes See: dropping data.
401 Unauthorized once no yes See: dropping data.
403 Authentication failure once no yes See: dropping data.
404 Incorrect path once no yes See: dropping data.
405 Incorrect HTTP method (POST required) once no yes See: dropping data. Should never occur in the Telemetry SDK but should still be handled
408 Request timeout each failure yes not yet
409 Conflict once no yes See: dropping data.
410 Gone once no yes See: dropping data.
411 Missing Content-Length header once no yes See: dropping data. Should never occur in the Telemetry SDK but should still be handled
413 Payload too large (1 MB limit) each failure split data and retry no See: splitting data
429 Too many requests each failure Retry based on Retry-After response header no Retry-After (integer) for how long wait until next retry in seconds
Anything else Unknown each failure Retry with backoff not yet See graceful degradation.

Graceful degradation

The SDK may be unable to communicate with New Relic for a variety of reasons including network outages, misconfiguration or service outages. Telemetry SDKs must provide facilities to gracefully handle these failure cases or allow the consumer to handle them as they see fit. The SDKs must also provide functionality to make a request with no response handling or retrying.

The recommended handling of failed requests to the ingest API is to retry the request at increasing intervals and to eventually drop data if the request cannot be completed.

The amount of time to wait after a request can be computed using this logic:

MIN(backoff_max, backoff_factor * (2 ^ (number_of_retries - 1)))

For a backoff factor of 1 second, and a backoff max of 16 seconds, the retry delay interval should follow a pattern of [0, 1, 2, 4, 8, 16, 16, ...]. Subsequent retries should wait 16 seconds until the request has been retried the configured max retries number of times.

The total retry duration can be computed from the combination of backoff factor and backoff max. SDKs may provide a function to configure retry behavior by specifying the total retry duration instead of max retries.

Backoff example:

  • Backoff factor = 5 seconds
  • Backoff max = 80 seconds
  • Max retries = 8
  • Backoff sequence = [0, 5, 10, 20, 40, 80, 80, 80]
  1. The telemetry SDK attempts to send a payload at t=13:00:00, and receives a 500 response.
  2. The telemetry SDK attempts to send again at
    • +0 : 13:00:00
    • +5 : 13:00:05
    • +10 : 13:00:15
    • +20 : 13:00:35
    • +40 : 13:01:15
    • +80 : 13:02:35
    • +80 : 13:03:55
    • +80 : 13:05:15
    • -- max retries exceeded. The data in this request should be dropped. See dropping data.

Dropping data

Whenever dropping data, the SDK must emit an error level log statement indicating the number of data points dropped.

SDKs should not attempt to merge a failed payload with the rest of the data being stored by the SDK.

SDKs may provide functionality for users to provide their own handler for dropped data, so that a user of the SDK may merge unsent data back into their own data collector in the way that makes sense for their use case.

Splitting data

The New Relic ingest API may return an HTTP 413 (payload too large). The SDK must ensure that data that is or would be rejected due to payload size is successfully sent to New Relic.

Some strategies include:

  • Preemptively splitting large payloads.
  • Splitting and retrying requests in response to an HTTP 413.

If a request results in an HTTP 413, and the payload of that request cannot be split, the SDK should drop the data. See dropping data.