Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation with slow consumers [v2.9.X, v2.10.X] #5394

Open
nenych opened this issue May 7, 2024 · 11 comments · May be fixed by #6568
Open

Performance degradation with slow consumers [v2.9.X, v2.10.X] #5394

nenych opened this issue May 7, 2024 · 11 comments · May be fixed by #6568
Assignees
Labels
defect Suspected defect such as a bug or regression

Comments

@nenych
Copy link

nenych commented May 7, 2024

Observed behavior

Performance degradation after the slow consumer connection.
As you can see below, we are observing about 30% degradation of the incoming messages when the slow consumer connected, and about 50% after the second one.

CleanShot 2024-05-07 at 16 58 44@2x

Expected behavior

Stop sending messages to the slow consumers until their buffers are empty without slowing down the server.

Server and client version

Server: 2.9.20
Python library: nats-py 2.7.2

Host environment

Local:
MacOS 14.4.1, arm64, Docker 26.0.0
The same behavior with the amd64 emulator (--platform=linux/amd64 flag).

GKE
Container-Optimized OS, amd64, containerd

Steps to reproduce

Prepared required configs and docker-compose file that will start NATS, Prometheus, an exporter, and two consumers: https://github.com/nenych/nats-test.

Steps to run

  1. Clone the repository.
  2. Build the docker image:
docker build -t test/nats:latest .
  1. Install NATS cli: https://docs.nats.io/using-nats/nats-tools/nats_cli
  2. Run docker-compose (will start NATS, prometheus and 2 consumers):
docker-compose -f ./docker-compose.yaml up -d
  1. Start NATS benchmark:
nats bench updates --pub=4 --msgs 1000000000 --size=1000
  1. Wait a little and start the slow consumer:
docker run --rm -it --network=nats-test_default test/nats:latest python3 slow-consumer.py

Explore metrics

  1. Open prometheus: http://localhost:9091/graph
  2. Insert query:
sum by (job) (rate(nats_varz_in_msgs[30s]))
@nenych nenych added the defect Suspected defect such as a bug or regression label May 7, 2024
@ripienaar
Copy link
Contributor

Server 2.9.20 is now quite a while out of date, let us know how latest 2.10 works for you.

@nenych
Copy link
Author

nenych commented May 7, 2024

Below you can see the same test with the NATS 2.10.14, with this version we have even worse results:

CleanShot 2024-05-07 at 18 36 04@2x

@kam1kaze
Copy link

Any updates here? We have the same issue on our cluster. Thanks

@wallyqs wallyqs changed the title Performance degradation Performance degradation on v2.9.20 Aug 28, 2024
@wallyqs wallyqs changed the title Performance degradation on v2.9.20 Performance degradation [v2.9.20, v2.10.14] Aug 28, 2024
@nenych
Copy link
Author

nenych commented Sep 30, 2024

Hello, we still are observing this issue, when at least 1 slow consumer is detected we have up to 90% performance degradation.
On the screenshot below we had 1 slow pod.
Server version: 2.10.19-RC.3-alpine3.20
CleanShot 2024-09-30 at 11 07 55@2x

@derekcollison
Copy link
Member

@nenych are you a Synadia customer?

@nenych
Copy link
Author

nenych commented Oct 2, 2024

@derekcollison No, I am not.

@derekcollison
Copy link
Member

No worries, we will always do our best to help out the ecosystem. We do prioritize customers of course.

We would need to do a video call with you I think as a next step to really understand what is going on.

@nenych
Copy link
Author

nenych commented Oct 3, 2024

@derekcollison Sure, we can have a video call. Right now we have some test infrastructure where can show you the problem and our findings.

@derekcollison
Copy link
Member

Will see if @wallyqs has some time to jump on a call.

@wallyqs
Copy link
Member

wallyqs commented Oct 3, 2024

Hi @nenych ping me at wally@nats.io when you are available and can have a look.

@wallyqs wallyqs changed the title Performance degradation [v2.9.20, v2.10.14] Performance degradation with slow consumers [v2.9.20, v2.10.14] Oct 4, 2024
@kozlovic
Copy link
Member

I think it is just due to the detection of consumer(s) that are falling behind and the server stalls the fast producers. Running the server in Debug mode (-D) should show you messages similar to Timed out of fast producer stall (100ms). It affects the inbound of messages from producers on that subject (meaning the subject matching the slow consumer(s)). But it would not affect other producers that send to non slow consumers (well aside that there is a maximum that the server can handle so the perf per producer may decrease but overall inbound perf increase or be maintained).

That has always been the case (although we did tweak the stalling approach along the years).

@wallyqs wallyqs changed the title Performance degradation with slow consumers [v2.9.20, v2.10.14] Performance degradation with slow consumers [v2.9.20, v2.10.14, v2.10.24] Jan 23, 2025
@wallyqs wallyqs changed the title Performance degradation with slow consumers [v2.9.20, v2.10.14, v2.10.24] Performance degradation with slow consumers [v2.9.20, v2.10.14] Jan 23, 2025
@wallyqs wallyqs changed the title Performance degradation with slow consumers [v2.9.20, v2.10.14] Performance degradation with slow consumers [v2.9.X, v2.10.X] Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants