Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix blob batch persistence #369

Merged
merged 12 commits into from
Apr 16, 2024
Merged

Conversation

sebastianburckhardt
Copy link
Member

Fixes #363.

Previously, the code determined whether it was safe to delete a blob batch by checking the last update event for persistence. However, since there could still be read events after the update event, it meant that the blob may be deleted too early (and then hit a missing blob exception when trying to fetch the blob). This was observed in #363.

A similar mechanism was also used to determine whether a batch needed to be kept around for redelivery when reincarnating a partition. This has the same problems: a batch could be removed from the redelivery queue too early.

This PR fixes and simplifies this problem by
a) precisely tracking the persistence state of a batch by using the full position tuple (seqno, batchpos) in the redelivery queue, as opposed to only using the seqno. Read events can thus stay in the queue until later write events commit.
b) delete the blobs from storage at the same time we remove the batch from the redelivery queue.

@sebastianburckhardt sebastianburckhardt added this to the 1.5.1 milestone Mar 19, 2024
@sebastianburckhardt sebastianburckhardt modified the milestones: 1.5.1, 1.4.3 Mar 29, 2024
@sebastianburckhardt sebastianburckhardt marked this pull request as ready for review April 4, 2024 19:45
@sebastianburckhardt
Copy link
Member Author

I am still running more stress tests but it looks like the latest version of this now fixes the problem.

@davidmrdavid davidmrdavid self-requested a review April 8, 2024 21:19
Copy link
Member

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions

@@ -759,7 +759,7 @@ async ValueTask ProcessUpdate(PartitionUpdateEvent partitionUpdateEvent)
// (note that it may not be the very next in the sequence since readonly events are not persisted in the log)
if (partitionUpdateEvent.NextInputQueuePosition > 0 && partitionUpdateEvent.NextInputQueuePositionTuple.CompareTo(this.InputQueuePosition) <= 0)
{
this.partition.ErrorHandler.HandleError(nameof(ProcessUpdate), "Duplicate event detected", null, false, false);
this.partition.ErrorHandler.HandleError(nameof(ProcessUpdate), $"Duplicate event detected: #{partitionUpdateEvent.NextInputQueuePositionTuple}", null, true, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in the future I'd love for us to specify the parameter names of these last 3 arguments (and similar invocations). Just null, true, false isn't too descriptive :-) . But I'm sure we can tackle that in a future PR, it's not a blocker

Comment on lines 126 to 127
// a download can fail if the lease is lost and the next owner processes and then deletes it first
throw new OperationCanceledException("blob already deleted", exception, token);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have no way of knowing if this is truly what happened, right? As in - no record of some given VM deleting the blob, instead of it disappearing for some other reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I am updating the error message to be a bit more clear.

@sebastianburckhardt sebastianburckhardt modified the milestones: 1.4.3, 1.5.0 Apr 12, 2024
@sebastianburckhardt sebastianburckhardt added bug Something isn't working P1 Priority 1 labels Apr 12, 2024
@sebastianburckhardt
Copy link
Member Author

sebastianburckhardt commented Apr 16, 2024

In the overnight tests I still saw some "failed to read blob" errors. I think I finally understand why this is happening (sigh): since EH can duplicate events internally (as I discovered recently, see #379), it is redelivering a message that was successfully delivered earlier and whose blob was already deleted!

This means I need to handle the "missing blob" situation like a duplicate delivery, i.e. I must ignore it with a warning instead of throwing an error.

Copy link
Member

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick question

Comment on lines 147 to 148
yield return (eventData, new TEvent[0], seqno, null);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this return value represent? Especially confused about this TEvent[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is an empty batch... no events should be processed

…ceiver.cs

Co-authored-by: David Justo <david.justo.1996@gmail.com>
@sebastianburckhardt sebastianburckhardt merged commit ab706eb into main Apr 16, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Priority 1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partition becomes unresponsive because event batch blob was deleted prematurely
2 participants