Prometheus server running linkerd does not scrape all endpoints with EOF error #2067

gamer22026 · 2019-01-11T17:18:30Z

Bug Report

What is the issue?

When running linkerd on a prometheus server, not all endpoints are being scraped

How can it be reproduced?

Unsure

Logs, error output, etc

2019-01-11T16:57:00.029203704Z DBUG proxy={server=out listen=127.0.0.1:4140 remote=100.96.66.194:49276} linkerd2_proxy::app::main outbound addr=Some(Socket(V4(10.101.224.11:61623)))
2019-01-11T16:57:00.029240734Z DBUG 10.101.224.11:61623 linkerd2_proxy::app::main outbound dst=Some(DstAddr { addr: Socket(V4(10.101.224.11:61623)), direction: Out })
2019-01-11T16:57:00.029250383Z DBUG 10.101.224.11:61623 linkerd2_proxy::proxy::http::client client request: method=GET uri=http://10.101.224.11:61623/metrics version=HTTP/1.1 headers={"host": "10.101.224.11:61623", "user-agent": "Prometheus/2.4.0", "accept": "text/plain;version=0.0.4;q=1,*/*;q=0.1", "accept-encoding": "gzip", "x-prometheus-scrape-timeout-seconds": "10.000000", "l5d-dst-canonical": "10.101.224.11:61623"}
2019-01-11T16:57:00.252847818Z WARN proxy={server=out listen=127.0.0.1:4140 remote=100.96.66.194:49276} hyper::proto::h1::role transfer-encoding and content-length both found, canceling

2019-01-11T16:57:01.642302959Z DBUG proxy={server=out listen=127.0.0.1:4140 remote=100.96.66.194:37732} linkerd2_proxy::app::main outbound addr=Some(Socket(V4(10.101.224.11:9100)))
2019-01-11T16:57:01.642335682Z DBUG 10.101.224.11:9100 linkerd2_proxy::app::main outbound dst=Some(DstAddr { addr: Socket(V4(10.101.224.11:9100)), direction: Out })
2019-01-11T16:57:01.6423418Z DBUG 10.101.224.11:9100 linkerd2_proxy::proxy::http::client client request: method=GET uri=http://10.101.224.11:9100/metrics version=HTTP/1.1 headers={"host": "10.101.224.11:9100", "user-agent": "Prometheus/2.4.0", "accept": "text/plain;version=0.0.4;q=1,*/*;q=0.1", "accept-encoding": "gzip", "x-prometheus-scrape-timeout-seconds": "10.000000", "l5d-dst-canonical": "10.101.224.11:9100"}

`linkerd check` output

kubernetes-api: can initialize the client..................................[ok]
kubernetes-api: can query the Kubernetes API...............................[ok]
kubernetes-api: is running the minimum Kubernetes API version..............[ok]
linkerd-existence: control plane namespace exists..........................[ok]
linkerd-existence: controller pod is running...............................[ok]
linkerd-existence: can initialize the client...............................[ok]
linkerd-existence: can query the control plane API.........................[ok]
linkerd-api: control plane pods are ready..................................[ok]
linkerd-api: can query the control plane API...............................[ok]
linkerd-api[kubernetes]: control plane can talk to Kubernetes..............[ok]
linkerd-api[prometheus]: control plane can talk to Prometheus..............[ok]
linkerd-api: no invalid service profiles...................................[ok]
linkerd-version: can determine the latest version..........................[ok]
linkerd-version: cli is up-to-date.........................................[ok]
linkerd-version: control plane is up-to-date...............................[ok]

Status check results are [ok]

Environment

Kubernetes Version: 1.10.11
Prometheus Version: 2.4.0
Cluster Environment: kops
Host OS: Debian GNU/Linux 8
Linkerd version: edge-19.1.1

Possible solution

Additional context

The endpoints I am scraping are not running linkerd.
While I am seeing this issue on many endpoints in my existing prometheus server, I took one specific use case to test with. I have a cassandra server that has 2 scrapable endpoints 9100 (node_exporter) and 61623(cassandra metrics).

As you can see, one works, one does not. If I remove the linkerd proxy from the prometheus-server, then all endpoints work as they should. From the logs, the only difference is the role transfer-encoding and content-length both found, canceling warning that shows up on the scrape to 61623.

The text was updated successfully, but these errors were encountered:

gamer22026 · 2019-01-11T18:02:44Z

An interesting pattern emerged as I tested endpoints. Only prometheus endpoints that are java based are failing (kafka, cassandra,some of our java based internal apps). All other endpoints are OK (nodejs, go). Not quite sure what that points to. Some unhandled difference in the prometheus java client code?

gamer22026 · 2019-01-11T18:28:58Z

tcpdump from prometheus-server running linkerd to one of the failed java endpoints:

.._.....GET /metrics HTTP/1.1
host: 100.96.49.5:61622
user-agent: Prometheus/2.4.0
accept: text/plain;version=0.0.4;q=1,*/*;q=0.1
accept-encoding: gzip
x-prometheus-scrape-timeout-seconds: 10.000000
l5d-dst-canonical: 100.96.49.5:61622


18:24:32.581196 IP (tos 0x0, ttl 64, id 49263, offset 0, flags [DF], proto TCP (6), length 52)
    100.96.49.5.61622 > 100.96.66.194.45308: Flags [.], cksum 0x3cae (incorrect -> 0x6ac2), ack 238, win 55, options [nop,nop,TS val 497081503 ecr 486432735], length 0
E..4.o@.@.=.d`1.d`B................7<......
......_.
18:24:33.375679 IP (tos 0x0, ttl 64, id 37360, offset 0, flags [DF], proto TCP (6), length 240)
    100.96.49.5.61622 > 100.96.54.7.42470: Flags [P.], cksum 0x30af (incorrect -> 0xa4ca), seq 1:189, ack 246, win 55, options [nop,nop,TS val 497081701 ecr 492873704], length 188
E.....@.@.xKd`1.d`6.....@1M........70......
...e.`..HTTP/1.1 200 OK
Content-encoding: gzip
Date: Fri, 11 Jan 2019 18:24:33 GMT
Transfer-encoding: chunked
Content-type: text/plain; version=0.0.4; charset=utf-8
Content-length: 381145

gamer22026 · 2019-01-11T18:32:28Z

tcpdump from prometheus-server w/o linkerd to same java endpoint:

.`.....qGET /metrics HTTP/1.1
Host: 100.96.49.5:61622
User-Agent: Prometheus/2.4.0
Accept: application/openmetrics-text; version=0.0.1,text/plain;version=0.0.4;q=0.5,*/*;q=0.1
Accept-Encoding: gzip
X-Prometheus-Scrape-Timeout-Seconds: 30.000000


18:24:32.400362 IP (tos 0x0, ttl 64, id 37359, offset 0, flags [DF], proto TCP (6), length 52)
    100.96.49.5.61622 > 100.96.54.7.42470: Flags [.], cksum 0x2ff3 (incorrect -> 0xce9d), ack 246, win 55, options [nop,nop,TS val 497081458 ecr 492873704], length 0
E..4..@.@.y.d`1.d`6.....@1M........7/......
...r.`..
18:24:32.580685 IP (tos 0x0, ttl 62, id 35373, offset 0, flags [DF], proto TCP (6), length 60)
    100.96.66.194.45308 > 100.96.49.5.61622: Flags [S], cksum 0xde55 (correct), seq 3004487154, win 26733, options [mss 8911,sackOK,TS val 486432735 ecr 0,nop,wscale 9], length 0
E..<.-@.>.v.d`B.d`1...............hm.U...."....
.._........
18:24:32.580711 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    100.96.49.5.61622 > 100.96.66.194.45308: Flags [S.], cksum 0x3cb6 (incorrect -> 0xb7b3), seq 260447471, ack 3004487155, win 26697, options [mss 8911,sackOK,TS val 497081503 ecr 486432735,nop,wscale 9], length 0
E..<..@.@..4d`1.d`B...............hI<....."....
......_....
18:24:32.580968 IP (tos 0x0, ttl 62, id 35374, offset 0, flags [DF], proto TCP (6), length 52)
    100.96.66.194.45308 > 100.96.49.5.61622: Flags [.], cksum 0x6bb1 (correct), ack 1, win 53, options [nop,nop,TS val 486432735 ecr 497081503], length 0
E..4..@.>.v.d`B.d`1................5k......
.._.....
18:24:32.581181 IP (tos 0x0, ttl 62, id 35375, offset 0, flags [DF], proto TCP (6), length 289)
    100.96.66.194.45308 > 100.96.49.5.61622: Flags [P.], cksum 0x2c15 (correct), seq 1:238, ack 1, win 53, options [nop,nop,TS val 486432735 ecr 497081503], length 237
E..!./@.>.u d`B.d`1................5,......

siggy · 2019-01-14T22:18:32Z

@gamer22026 thanks for all the detail! is it possible for you to provide us with a docker image of something like your java app that reproduces the issue?

gamer22026 · 2019-01-14T23:24:48Z

Was an issue with the prometheus java client:
prometheus/client_java#412
prometheus/client_java#413

Updating to latest prometheus java client version (0.6.0) fixes the issue.

wmorgan · 2019-01-15T00:04:43Z

Great!

gamer22026 closed this as completed Jan 14, 2019

github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus server running linkerd does not scrape all endpoints with EOF error #2067

Prometheus server running linkerd does not scrape all endpoints with EOF error #2067

gamer22026 commented Jan 11, 2019 •

edited

Loading

gamer22026 commented Jan 11, 2019 •

edited

Loading

gamer22026 commented Jan 11, 2019

gamer22026 commented Jan 11, 2019

siggy commented Jan 14, 2019

gamer22026 commented Jan 14, 2019

wmorgan commented Jan 15, 2019

Prometheus server running linkerd does not scrape all endpoints with EOF error #2067

Prometheus server running linkerd does not scrape all endpoints with EOF error #2067

Comments

gamer22026 commented Jan 11, 2019 • edited Loading

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

gamer22026 commented Jan 11, 2019 • edited Loading

gamer22026 commented Jan 11, 2019

gamer22026 commented Jan 11, 2019

siggy commented Jan 14, 2019

gamer22026 commented Jan 14, 2019

wmorgan commented Jan 15, 2019

gamer22026 commented Jan 11, 2019 •

edited

Loading

`linkerd check` output

gamer22026 commented Jan 11, 2019 •

edited

Loading