-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The initial ACK deadline set for the messages received could be too low #413
Comments
@davidbraden Thanks for the research and the question. Is there an actual bug that you are experiencing or did you just see a possible issue while reading through the code? A streaming pull manager is instantiated when a new subscription is started. At that point the histogram is empty (no messages ACK-ed yet) and just returns the default (10 seconds). The same value would also be returned if the manager read the the ack_deadline property instead (the initial _ack_deadline is also set to 10). But regardless of how that value is obtained, it's just the initial value to use until enough new histogram data is available. The leaser thread itself always uses the manager.ack_deadline property, which is automatically adjustable. Does this answer your question? Let me know if I can help some more. |
Hi, I'm seeing an issue with messages retrying due to "Expired Ack Deadline". I'm not totally sure of the root cause but this seemed a possibility. The other logs don't indicate we are nacking any messages or that any leases are expiring. My understanding from reading the code was |
Does this happens often, or is just on occasional event observed in the logs? If the latter, it could just be requests sometimes getting lost due to random network hiccups. For example, a client might send an ACK, but the server would not receive it. the message would thus expire, and the servers would re-send it again. Applications must therefore expect that a message can be re-delivered even if they have processed it successfully (from their perspective).
That's correct. Whenever a new batch of messages is received, it is handled by
I see where you're getting at. The "live" p99 value initially used in Congratulations, you found a bug! I also spotted another thing in |
A possible scenario how this bug could occur: This is the code that decides whether the Suppose that 63 messages have been ACK-ed so far, which is just above the previous
Then, 37 more messages arrive and are processed in sub 10 seconds:
The live p99 value drops to 10 seconds, but the The leaser goes to sleep for If processing these messages takes longer than that, they will expire, because the leaser will not wake up before 27 seconds, and will be too late to send the MODACK request. |
Fixes #413. This PR makes sure that the ACK deadline set for the received messages is always consistent with what the leaser uses internally when extending the ACK deadlines for the leased messages. See the issue description and a [comment](#413 (comment)) explaining a possible sequence of events that lead to a bug. **PR checklist** - [x] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-pubsub/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [x] Ensure the tests and linter pass - [x] Code coverage does not decrease (if any source code was changed) - [x] Appropriate docs were updated (if necessary)
Is there a reason that https://github.com/googleapis/python-pubsub/blob/master/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py#L493 calls
self.ack_histogram.percentile(99)
directly rather than using https://github.com/googleapis/python-pubsub/blob/master/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py#L225 which is what the Leaser thread uses for how long it sleeps between updates?I think this is causing a bug where the the average message time decrease and so the 99th percentile decreases quicker than the "sticky"
ack_deadline
method. The initial ack_deadline is then shorter than the time between Leaser refreshes and the messages expire before the lease is renewed.The text was updated successfully, but these errors were encountered: