-
-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer runner tries to start the fetch manager without delay #1384
Comments
@Nevon do we have an ETA for a fix for this one? This issue has a huge impact to our services as it causes the entire service to hang. |
No, that's not how free, open-source projects work. If this is very important to you, you are more than welcome to spend the resources to fix the issue. I'm in the middle of my vacation, and even once I'm back, no one other than me decides what I should spend my free time on. |
Sorry @Nevon if my question offended you and sorry to have interrupted your vacation. Maybe we have a different view on how open-source projects work. If I were a maintainer of a open-source project, I would certainly feel some responsibility for the quality of my product and if the issue has wide impact, I would probably put it as a priority (if I'm not on vacation, and have free time, of course). |
In the issue description I shared a 1-line fix that worked for me. I didn't opened a PR with such fix because I thought that someone with more knowledge about the project could implement a better solution. However, if you think that fix is acceptable I'll be pleased to open a PR. |
In our debugging session, we simply changed the isRunning to false to exit the loop, but I'm sure a better solution might be to even throw an error when the nodeIds return an empty array, or before the cluster becomes empty, stop the runloop. |
I've hit this problem too. I've applied a workaround using logging (horrid, hacky, but temporary) for the time being. The following approach prevents the tight loop (needs tweaked for whatever logging system you're using) without needing a patched copy of kafkajs.
|
Nice find, but brutal! For ppl doesn't want to patch, this is probably a workaround until an official fix.
patches/kafkajs+2.0.2.patch
|
I have today discovered that we are facing the same issue as you describe after several weeks of high CPU load from time to time. I have now tried to implement the workaround that @adripc64 describes by adding the following line;
When doing this a new issue occurs. Every x (10ish) second, reports the following messages, and re-connects:
It seems like this additional 1ms wait causes the heartbeat to be "out of sync". @adripc64 or @Nevon, is there another workaround I possibly can implement? I will try to investigate more to possibly find one, but if you know any obvious, please shout out. |
Now also tried to apply the patch as described by @tugtugtug, and same issue occurs:
|
afaik, this isn't what we encountered, thus the fixes wouldn't apply in your case. You may want to create a separate issue for tracking. |
@tugtugtug yes this is exactly the same for us, but with the patch applied something strange happens to provide this message. Exactly the same happened for the first workaround described @adripc64 |
The fetch manager rebalancing mechanism caused an infinite loop when there were no brokers available, causing the consumer to never become aware of any connection issues. Fixes #1384
Opened up #1402 to fix this. The issue was that the rebalancing mechanism in the FetchManager would detect that the number of available nodes had changed, and trigger the recreation of the fetchers. Because there were no available nodes, no fetchers were created, and we entered the infinite tight loop. By bypassing the rebalancing mechanism in case there are no nodes available, the fetcher ends up attempting to fetch from the now unavailable node, and a KafkaJSConnectionError bubbles up as expected, causing the consumer to try to reconnect and eventually crash - same as in 1.16.0. It's out in 2.1.0-beta.5, and will be included in 2.0.3 later this week. |
Still facing same problem. Anyone have same issue even with 2.1.0 version. Am I missing something. Do I need to set something in configuration. I tried to resolve the problem with #1428. |
The fetch manager rebalancing mechanism caused an infinite loop when there were no brokers available, causing the consumer to never become aware of any connection issues. Fixes tulios#1384
Describe the bug
When the FetchManager detects that a kafka node is lost, it tries to rebalance the fetchers. If there are no kafka nodes available, no fetchers are created. That causes the consumer runner to call FetchManager.start() in an infinite loop without any delay.
This blocks any other processing like responding to http requests. Also, the consumer is not disconnected properly nor can recover when kafka is available again.
To Reproduce
Expected behavior
The consumer should be disconnected properly and start retrying to connect it again.
As a hint, we have managed to solve this issue adding a simple delay of 1ms after the call to FetcherManager.stat(). Example:
Observed behavior
Environment:
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: