Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus server running linkerd does not scrape all endpoints with EOF error #2067

Closed
gamer22026 opened this issue Jan 11, 2019 · 6 comments

Comments

@gamer22026
Copy link

gamer22026 commented Jan 11, 2019

Bug Report

What is the issue?

When running linkerd on a prometheus server, not all endpoints are being scraped

How can it be reproduced?

Unsure

Logs, error output, etc

2019-01-11T16:57:00.029203704Z DBUG proxy={server=out listen=127.0.0.1:4140 remote=100.96.66.194:49276} linkerd2_proxy::app::main outbound addr=Some(Socket(V4(10.101.224.11:61623)))
2019-01-11T16:57:00.029240734Z DBUG 10.101.224.11:61623 linkerd2_proxy::app::main outbound dst=Some(DstAddr { addr: Socket(V4(10.101.224.11:61623)), direction: Out })
2019-01-11T16:57:00.029250383Z DBUG 10.101.224.11:61623 linkerd2_proxy::proxy::http::client client request: method=GET uri=http://10.101.224.11:61623/metrics version=HTTP/1.1 headers={"host": "10.101.224.11:61623", "user-agent": "Prometheus/2.4.0", "accept": "text/plain;version=0.0.4;q=1,*/*;q=0.1", "accept-encoding": "gzip", "x-prometheus-scrape-timeout-seconds": "10.000000", "l5d-dst-canonical": "10.101.224.11:61623"}
2019-01-11T16:57:00.252847818Z WARN proxy={server=out listen=127.0.0.1:4140 remote=100.96.66.194:49276} hyper::proto::h1::role transfer-encoding and content-length both found, canceling

2019-01-11T16:57:01.642302959Z DBUG proxy={server=out listen=127.0.0.1:4140 remote=100.96.66.194:37732} linkerd2_proxy::app::main outbound addr=Some(Socket(V4(10.101.224.11:9100)))
2019-01-11T16:57:01.642335682Z DBUG 10.101.224.11:9100 linkerd2_proxy::app::main outbound dst=Some(DstAddr { addr: Socket(V4(10.101.224.11:9100)), direction: Out })
2019-01-11T16:57:01.6423418Z DBUG 10.101.224.11:9100 linkerd2_proxy::proxy::http::client client request: method=GET uri=http://10.101.224.11:9100/metrics version=HTTP/1.1 headers={"host": "10.101.224.11:9100", "user-agent": "Prometheus/2.4.0", "accept": "text/plain;version=0.0.4;q=1,*/*;q=0.1", "accept-encoding": "gzip", "x-prometheus-scrape-timeout-seconds": "10.000000", "l5d-dst-canonical": "10.101.224.11:9100"}

linkerd check output

kubernetes-api: can initialize the client..................................[ok]
kubernetes-api: can query the Kubernetes API...............................[ok]
kubernetes-api: is running the minimum Kubernetes API version..............[ok]
linkerd-existence: control plane namespace exists..........................[ok]
linkerd-existence: controller pod is running...............................[ok]
linkerd-existence: can initialize the client...............................[ok]
linkerd-existence: can query the control plane API.........................[ok]
linkerd-api: control plane pods are ready..................................[ok]
linkerd-api: can query the control plane API...............................[ok]
linkerd-api[kubernetes]: control plane can talk to Kubernetes..............[ok]
linkerd-api[prometheus]: control plane can talk to Prometheus..............[ok]
linkerd-api: no invalid service profiles...................................[ok]
linkerd-version: can determine the latest version..........................[ok]
linkerd-version: cli is up-to-date.........................................[ok]
linkerd-version: control plane is up-to-date...............................[ok]

Status check results are [ok]

Environment

  • Kubernetes Version: 1.10.11
  • Prometheus Version: 2.4.0
  • Cluster Environment: kops
  • Host OS: Debian GNU/Linux 8
  • Linkerd version: edge-19.1.1

Possible solution

Additional context

The endpoints I am scraping are not running linkerd.
While I am seeing this issue on many endpoints in my existing prometheus server, I took one specific use case to test with. I have a cassandra server that has 2 scrapable endpoints 9100 (node_exporter) and 61623(cassandra metrics).

screenshot_2019-01-11_09-46-35

As you can see, one works, one does not. If I remove the linkerd proxy from the prometheus-server, then all endpoints work as they should. From the logs, the only difference is the role transfer-encoding and content-length both found, canceling warning that shows up on the scrape to 61623.

@gamer22026
Copy link
Author

gamer22026 commented Jan 11, 2019

An interesting pattern emerged as I tested endpoints. Only prometheus endpoints that are java based are failing (kafka, cassandra,some of our java based internal apps). All other endpoints are OK (nodejs, go). Not quite sure what that points to. Some unhandled difference in the prometheus java client code?

@gamer22026
Copy link
Author

tcpdump from prometheus-server running linkerd to one of the failed java endpoints:

.._.....GET /metrics HTTP/1.1
host: 100.96.49.5:61622
user-agent: Prometheus/2.4.0
accept: text/plain;version=0.0.4;q=1,*/*;q=0.1
accept-encoding: gzip
x-prometheus-scrape-timeout-seconds: 10.000000
l5d-dst-canonical: 100.96.49.5:61622


18:24:32.581196 IP (tos 0x0, ttl 64, id 49263, offset 0, flags [DF], proto TCP (6), length 52)
    100.96.49.5.61622 > 100.96.66.194.45308: Flags [.], cksum 0x3cae (incorrect -> 0x6ac2), ack 238, win 55, options [nop,nop,TS val 497081503 ecr 486432735], length 0
E..4.o@.@.=.d`1.d`B................7<......
......_.
18:24:33.375679 IP (tos 0x0, ttl 64, id 37360, offset 0, flags [DF], proto TCP (6), length 240)
    100.96.49.5.61622 > 100.96.54.7.42470: Flags [P.], cksum 0x30af (incorrect -> 0xa4ca), seq 1:189, ack 246, win 55, options [nop,nop,TS val 497081701 ecr 492873704], length 188
E.....@.@.xKd`1.d`6.....@1M........70......
...e.`..HTTP/1.1 200 OK
Content-encoding: gzip
Date: Fri, 11 Jan 2019 18:24:33 GMT
Transfer-encoding: chunked
Content-type: text/plain; version=0.0.4; charset=utf-8
Content-length: 381145

@gamer22026
Copy link
Author

tcpdump from prometheus-server w/o linkerd to same java endpoint:

.`.....qGET /metrics HTTP/1.1
Host: 100.96.49.5:61622
User-Agent: Prometheus/2.4.0
Accept: application/openmetrics-text; version=0.0.1,text/plain;version=0.0.4;q=0.5,*/*;q=0.1
Accept-Encoding: gzip
X-Prometheus-Scrape-Timeout-Seconds: 30.000000


18:24:32.400362 IP (tos 0x0, ttl 64, id 37359, offset 0, flags [DF], proto TCP (6), length 52)
    100.96.49.5.61622 > 100.96.54.7.42470: Flags [.], cksum 0x2ff3 (incorrect -> 0xce9d), ack 246, win 55, options [nop,nop,TS val 497081458 ecr 492873704], length 0
E..4..@.@.y.d`1.d`6.....@1M........7/......
...r.`..
18:24:32.580685 IP (tos 0x0, ttl 62, id 35373, offset 0, flags [DF], proto TCP (6), length 60)
    100.96.66.194.45308 > 100.96.49.5.61622: Flags [S], cksum 0xde55 (correct), seq 3004487154, win 26733, options [mss 8911,sackOK,TS val 486432735 ecr 0,nop,wscale 9], length 0
E..<.-@.>.v.d`B.d`1...............hm.U...."....
.._........
18:24:32.580711 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    100.96.49.5.61622 > 100.96.66.194.45308: Flags [S.], cksum 0x3cb6 (incorrect -> 0xb7b3), seq 260447471, ack 3004487155, win 26697, options [mss 8911,sackOK,TS val 497081503 ecr 486432735,nop,wscale 9], length 0
E..<..@.@..4d`1.d`B...............hI<....."....
......_....
18:24:32.580968 IP (tos 0x0, ttl 62, id 35374, offset 0, flags [DF], proto TCP (6), length 52)
    100.96.66.194.45308 > 100.96.49.5.61622: Flags [.], cksum 0x6bb1 (correct), ack 1, win 53, options [nop,nop,TS val 486432735 ecr 497081503], length 0
E..4..@.>.v.d`B.d`1................5k......
.._.....
18:24:32.581181 IP (tos 0x0, ttl 62, id 35375, offset 0, flags [DF], proto TCP (6), length 289)
    100.96.66.194.45308 > 100.96.49.5.61622: Flags [P.], cksum 0x2c15 (correct), seq 1:238, ack 1, win 53, options [nop,nop,TS val 486432735 ecr 497081503], length 237
E..!./@.>.u d`B.d`1................5,......

@siggy
Copy link
Member

siggy commented Jan 14, 2019

@gamer22026 thanks for all the detail! is it possible for you to provide us with a docker image of something like your java app that reproduces the issue?

@gamer22026
Copy link
Author

Was an issue with the prometheus java client:
prometheus/client_java#412
prometheus/client_java#413

Updating to latest prometheus java client version (0.6.0) fixes the issue.

@wmorgan
Copy link
Member

wmorgan commented Jan 15, 2019

Great!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants