Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The initial ACK deadline set for the messages received could be too low #413

Closed
davidbraden opened this issue May 17, 2021 · 4 comments · Fixed by #416
Closed

The initial ACK deadline set for the messages received could be too low #413

davidbraden opened this issue May 17, 2021 · 4 comments · Fixed by #416
Assignees
Labels
api: pubsub Issues related to the googleapis/python-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@davidbraden
Copy link

Is there a reason that https://github.com/googleapis/python-pubsub/blob/master/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py#L493 calls self.ack_histogram.percentile(99) directly rather than using https://github.com/googleapis/python-pubsub/blob/master/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py#L225 which is what the Leaser thread uses for how long it sleeps between updates?

I think this is causing a bug where the the average message time decrease and so the 99th percentile decreases quicker than the "sticky" ack_deadline method. The initial ack_deadline is then shorter than the time between Leaser refreshes and the messages expire before the lease is renewed.

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/python-pubsub API. label May 17, 2021
@plamut plamut added the type: question Request for information or clarification. Not an issue. label May 17, 2021
@plamut
Copy link
Contributor

plamut commented May 18, 2021

@davidbraden Thanks for the research and the question. Is there an actual bug that you are experiencing or did you just see a possible issue while reading through the code?

A streaming pull manager is instantiated when a new subscription is started. At that point the histogram is empty (no messages ACK-ed yet) and just returns the default (10 seconds). The same value would also be returned if the manager read the the ack_deadline property instead (the initial _ack_deadline is also set to 10).

But regardless of how that value is obtained, it's just the initial value to use until enough new histogram data is available. The leaser thread itself always uses the manager.ack_deadline property, which is automatically adjustable.

Does this answer your question? Let me know if I can help some more.

@davidbraden
Copy link
Author

Hi, I'm seeing an issue with messages retrying due to "Expired Ack Deadline". I'm not totally sure of the root cause but this seemed a possibility. The other logs don't indicate we are nacking any messages or that any leases are expiring.

My understanding from reading the code was _on_response is called for each new messages received. The initial deadline is 10 and then it is modified to self.ack_histogram.percentile(99). The leaser might not run for up to 0.9 * ack_deadline which could be larger than self.ack_histogram.percentile(99) if the subscriber has been running for a while and there have been lots of slow messages followed by some quick ones. This would mean a message could expire it's deadline before it's renewed.
Am I missing something?
Thanks

@plamut
Copy link
Contributor

plamut commented May 20, 2021

I'm seeing an issue with messages retrying due to "Expired Ack Deadline". I'm not totally sure of the root cause but this seemed a possibility.

Does this happens often, or is just on occasional event observed in the logs? If the latter, it could just be requests sometimes getting lost due to random network hiccups. For example, a client might send an ACK, but the server would not receive it. the message would thus expire, and the servers would re-send it again. Applications must therefore expect that a message can be re-delivered even if they have processed it successfully (from their perspective).

My understanding from reading the code was _on_response is called for each new messages received. The initial deadline is 10 and then it is modified to self.ack_histogram.percentile(99).

That's correct. Whenever a new batch of messages is received, it is handled by _on_response(), and the first thing the method does is setting the ACK deadline using the historic ACK data.

The leaser might not run for up to 0.9 * ack_deadline which could be larger than self.ack_histogram.percentile(99) if the subscriber has been running for a while and there have been lots of slow messages followed by some quick ones. This would mean a message could expire it's deadline before it's renewed.

I see where you're getting at. The "live" p99 value initially used in _on_response() might have already decreased, while the more sticky manager.ack_deadline used by the leaser has not been updated yet. The leaser might go to sleep for 0.9 * manager.ack_deadline seconds just before a new message arrives, and that message's ACK deadline will be set to the live p99 value. If processing that message takes too long, the leaser will not wake up in time to extend the ACK deadline.

Congratulations, you found a bug!
Applications whose message processing times vary significantly are more likely to be affected.

I also spotted another thing in ack_deadline(), it does not check if max_duration_per_lease_extension > 10, which is the server-side minimum. AFAIK the backend auto-corrects the values that are too low, but would still be nice if the client always sent correct value in the first place.

@plamut plamut self-assigned this May 20, 2021
@plamut plamut changed the title Incorrect ack_deadline setting The initial ACK deadline set for the messages received could be too low May 20, 2021
@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels May 20, 2021
@plamut
Copy link
Contributor

plamut commented May 20, 2021

A possible scenario how this bug could occur:

This is the code that decides whether the ack_deadline value used by the leaser should be updated with the new 99th percentile based on the histogram data.

Suppose that 63 messages have been ACK-ed so far, which is just above the previous _last_histogram_size threshold of 62. If one of the messages was processed in 30 seconds and the rest in <10, the state is the following:

  • Histogram: data = {30: 1, 10: 62}; length = 63; p99 = 30.
  • Streaming pull manager: _ack_deadline = 30, _last_histogram_size = 63.

Then, 37 more messages arrive and are processed in sub 10 seconds:

  • Histogram: data = {30: 1, 10: 99}; length = 100; p99 = 10.
  • Streaming pull manager: _ack_deadline = 30, _last_histogram_size = 63.

The live p99 value drops to 10 seconds, but the manager.ack_deadline has not been updated yet (it will be at 127 messages), thus the leaser still uses 30 seconds.

The leaser goes to sleep for 0.9 * 30 = 27.0 seconds, and right after a new batch of messages arrive. The _on_response() handler will set their ACK deadline to the current histogram p99 value - 10 seconds.

If processing these messages takes longer than that, they will expire, because the leaser will not wake up before 27 seconds, and will be too late to send the MODACK request.

gcf-merge-on-green bot pushed a commit that referenced this issue May 26, 2021
Fixes #413.

This PR makes sure that the ACK deadline set for the received messages is always consistent with what the leaser uses internally when extending the ACK deadlines for the leased messages.

See the issue description and a [comment](#413 (comment)) explaining a possible sequence of events that lead to a bug.

**PR checklist**
- [x] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-pubsub/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [x] Ensure the tests and linter pass
- [x] Code coverage does not decrease (if any source code was changed)
- [x] Appropriate docs were updated (if necessary)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/python-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants