"Specified group generation id is not valid" after broker maintenance, consumer stops receiving events #1466

hbazan-pp · 2022-10-17T16:16:49Z

Hi, we are having an issue similar to #1009 but it happens after a broker maintenance.
We have consumers running parallelly on different machines, with a heartbeat check triggered on eachBatch.
We consume multiple topics, with a specific instance of our service per topic.
All of this works fine but we had issues (twice already) when brokers go on maintenance.
Some of the instance (thus some of the topics) stop consuming events, but don't throw errors nor crash (if it crashed we would respawn and everything would be ok).
We do see the error message:
[Consumer] Crash: KafkaJSNonRetriableError: Specified group generation id is not valid
But it doesn't actually crash, and the instance is stale, it won't consume any new message or trigger the heartbeat. If we restart the instance it will consume all pending traffic (given the offset is still current).
Odd thing is some of the topics keep working fine after the maintenance, so the overall system seems to be "up" unless we check each specific topic.

The text was updated successfully, but these errors were encountered:

IvanRogovskiy · 2022-10-20T07:17:06Z

I have pretty the same thing. I have a connection to 11 topics and when I start receiving messages i see the logs below

{"level":"WARN","timestamp":"2022-10-05T08:27:56.258Z","logger":"kafkajs","message":"[ConsumerGroup] Topic has been updated, resync group"


{"level":"ERROR","timestamp":"2022-10-05T08:27:58.856Z","logger":"kafkajs","message":"[Connection] Response SyncGroup(key: 14, version: 3)", error":"Specified group generation id is not valid","correlationId":87,"size":14}

and after it the message that the consumer has been stopped. Increasing of heartbeats interval and sessionTimeout didn't help

alldayalone · 2022-11-02T07:49:14Z

Same thing for us

Nov 2, 2022 @ 09:31:39.581 [error]: [Consumer] Response Heartbeat(key: 12, version: 2) {"broker":"xxx","clientId":"xxx","error":"Specified group generation id is not valid","correlationId":14,"size":10,}
Nov 2, 2022 @ 09:31:44.532 [error]: [Consumer] Crash: KafkaJSNonRetriableError: Specified group generation id is not valid {"stack":"KafkaJSNonRetriableError: Specified group generation id is not valid\n    at ..."}
Nov 2, 2022 @ 09:31:44.538 [info]: [Consumer] Consumer has crashed {"type":"consumer.crash","payload":{"error":{"name":"KafkaJSNonRetriableError","retriable":false,"cause":{"name":"KafkaJSProtocolError","retriable":false,"type":"ILLEGAL_GENERATION","code":22}},"restart":false}}
Nov 2, 2022 @ 09:31:44.538 [info]: [Consumer] Consumer has disconnected {"type":"consumer.disconnect"}
Nov 2, 2022 @ 09:31:44.538 [info]: [Consumer] Consumer has stopped {"type":"consumer.stop"}

After that just hangs until manually restarted

Happened at the end of (or right after) AWS Kafka maintenance "Heal cluster"

jakewins · 2022-11-02T10:42:11Z

Ran into this as well, proposed fix: #1474

h0od · 2022-11-14T05:41:20Z

I've also encountered this. Rejoin should be correct in this case.

ErlendFax · 2022-11-22T09:41:06Z

We are seeing the same thing after a GKE update.

Does anyone know a workaround while we wait?

rpastore-wolt · 2022-11-28T12:16:04Z

We are seeing the same thing after a GKE update.

Does anyone know a workaround while we wait?

@ErlendFax have you found a workaround that is not restart manually the consumer ?

I've also encountered this. Rejoin should be correct in this case.

@h0od when you say rejoin, should the library handle it or should be done withing the consumer code ?

thanks 🙏

ErlendFax · 2022-11-28T12:21:04Z

We have not. Just hoping it won't fail again. I'm interested in a workaround/solution as well.

h0od · 2022-11-28T18:10:03Z

@h0od when you say rejoin, should the library handle it or should be done withing the consumer code ?

The library should try to rejoin, exactly like it does when the group is rebalancing.

vpriem · 2022-12-13T13:47:17Z

Same here as well, node are being rotated and then consumer just stop consuming:

[Connection]: Response Fetch(key: 1, version: 11): This server is not the leader for that topic-partition
[Connection]: Response SyncGroup(key: 14, version: 3): This is not the correct coordinator for this group
[Connection]: Response JoinGroup(key: 11, version: 5): The coordinator is loading and hence can't process requests for this group
[Connection]: Response Heartbeat(key: 12, version: 3): Specified group generation id is not valid
... retries
[Consumer]: Crash: KafkaJSNonRetriableError: Specified group generation id is not valid
[Consumer]: Stopped

I think ILLEGAL_GENERATION should be considered as retriable in KafkaJS to restart consumer in restartOnFailure.

guiestimoneon · 2022-12-23T18:02:46Z

Hello guys

I am having this issue when I scale my application horizontally. The pod is processing normally and out of nowhere I get this error:

I suspect a rebalance has occurred and the pod still tries to commit a message. Im using .NET lib

ErlendFax · 2023-01-09T10:53:56Z

As a workaround, one could try something like this:

kafkaClient.consumer.on("consumer.crash", (event) => {
     if (event.payload.error.name === "KafkaJSNonRetriableError") {

         process.exit(1);  // will initiate a k8s restart

        // ... or do something else like reconnecting and starting `run` again ... 
    }
  });

jakewins mentioned this issue Nov 2, 2022

Consider ILLEGAL_GENERATION error as rebalancing error #1474

Merged

jgoldsmith613 mentioned this issue Jan 25, 2023

Expose KafkaJS Event Listeners nestjs/nest#10950

Closed

1 task

emasab mentioned this issue Feb 27, 2023

"Specified group generation id is not valid" error should be retried #1534

Closed

Nevon closed this as completed in #1474 Feb 27, 2023

oleh-poberezhets mentioned this issue Aug 16, 2024

Crash: KafkaJSNonRetriableError: Specified group generation id is not valid #1712

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Specified group generation id is not valid" after broker maintenance, consumer stops receiving events #1466

"Specified group generation id is not valid" after broker maintenance, consumer stops receiving events #1466

hbazan-pp commented Oct 17, 2022

IvanRogovskiy commented Oct 20, 2022 •

edited

Loading

alldayalone commented Nov 2, 2022 •

edited

Loading

jakewins commented Nov 2, 2022

h0od commented Nov 14, 2022

ErlendFax commented Nov 22, 2022

rpastore-wolt commented Nov 28, 2022

ErlendFax commented Nov 28, 2022

h0od commented Nov 28, 2022

vpriem commented Dec 13, 2022

guiestimoneon commented Dec 23, 2022

ErlendFax commented Jan 9, 2023

"Specified group generation id is not valid" after broker maintenance, consumer stops receiving events #1466

"Specified group generation id is not valid" after broker maintenance, consumer stops receiving events #1466

Comments

hbazan-pp commented Oct 17, 2022

IvanRogovskiy commented Oct 20, 2022 • edited Loading

alldayalone commented Nov 2, 2022 • edited Loading

jakewins commented Nov 2, 2022

h0od commented Nov 14, 2022

ErlendFax commented Nov 22, 2022

rpastore-wolt commented Nov 28, 2022

ErlendFax commented Nov 28, 2022

h0od commented Nov 28, 2022

vpriem commented Dec 13, 2022

guiestimoneon commented Dec 23, 2022

ErlendFax commented Jan 9, 2023

IvanRogovskiy commented Oct 20, 2022 •

edited

Loading

alldayalone commented Nov 2, 2022 •

edited

Loading