Fix blob batch persistence #369

sebastianburckhardt · 2024-03-19T21:32:19Z

Fixes #363.

Previously, the code determined whether it was safe to delete a blob batch by checking the last update event for persistence. However, since there could still be read events after the update event, it meant that the blob may be deleted too early (and then hit a missing blob exception when trying to fetch the blob). This was observed in #363.

A similar mechanism was also used to determine whether a batch needed to be kept around for redelivery when reincarnating a partition. This has the same problems: a batch could be removed from the redelivery queue too early.

This PR fixes and simplifies this problem by
a) precisely tracking the persistence state of a batch by using the full position tuple (seqno, batchpos) in the redelivery queue, as opposed to only using the seqno. Read events can thus stay in the queue until later write events commit.
b) delete the blobs from storage at the same time we remove the batch from the redelivery queue.

…d blob

sebastianburckhardt · 2024-04-04T19:46:31Z

I am still running more stress tests but it looks like the latest version of this now fixes the problem.

davidmrdavid

Left some questions

davidmrdavid · 2024-04-08T21:53:55Z

src/DurableTask.Netherite/StorageLayer/Faster/StoreWorker.cs

@@ -759,7 +759,7 @@ async ValueTask ProcessUpdate(PartitionUpdateEvent partitionUpdateEvent)
            // (note that it may not be the very next in the sequence since readonly events are not persisted in the log)
            if (partitionUpdateEvent.NextInputQueuePosition > 0 && partitionUpdateEvent.NextInputQueuePositionTuple.CompareTo(this.InputQueuePosition) <= 0)
            {
-                this.partition.ErrorHandler.HandleError(nameof(ProcessUpdate), "Duplicate event detected", null, false, false);
+                this.partition.ErrorHandler.HandleError(nameof(ProcessUpdate), $"Duplicate event detected: #{partitionUpdateEvent.NextInputQueuePositionTuple}", null, true, false);


nit: in the future I'd love for us to specify the parameter names of these last 3 arguments (and similar invocations). Just null, true, false isn't too descriptive :-) . But I'm sure we can tackle that in a future PR, it's not a blocker

davidmrdavid · 2024-04-08T21:55:51Z

src/DurableTask.Netherite/TransportLayer/EventHubs/BlobBatchReceiver.cs

+                        // a download can fail if the lease is lost and the next owner processes and then deletes it first
+                        throw new OperationCanceledException("blob already deleted", exception, token);


we have no way of knowing if this is truly what happened, right? As in - no record of some given VM deleting the blob, instead of it disappearing for some other reason.

Good point. I am updating the error message to be a bit more clear.

src/DurableTask.Netherite/TransportLayer/EventHubs/BlobBatchReceiver.cs

src/DurableTask.Netherite/TransportLayer/EventHubs/EventHubsProcessor.cs

sebastianburckhardt · 2024-04-16T16:07:38Z

In the overnight tests I still saw some "failed to read blob" errors. I think I finally understand why this is happening (sigh): since EH can duplicate events internally (as I discovered recently, see #379), it is redelivering a message that was successfully delivered earlier and whose blob was already deleted!

This means I need to handle the "missing blob" situation like a duplicate delivery, i.e. I must ignore it with a warning instead of throwing an error.

davidmrdavid

Just a quick question

davidmrdavid · 2024-04-16T18:00:34Z

src/DurableTask.Netherite/TransportLayer/EventHubs/BlobBatchReceiver.cs

+                        yield return (eventData, new TEvent[0], seqno, null);
+                    }


What does this return value represent? Especially confused about this TEvent[0]

it is an empty batch... no events should be processed

src/DurableTask.Netherite/TransportLayer/EventHubs/BlobBatchReceiver.cs

…ceiver.cs Co-authored-by: David Justo <david.justo.1996@gmail.com>

draft

59ed394

sebastianburckhardt added this to the 1.5.1 milestone Mar 19, 2024

sebastianburckhardt modified the milestones: 1.5.1, 1.4.3 Mar 29, 2024

fix handling of batchposition, and tolerate race condition for delete…

5ec7039

…d blob

sebastianburckhardt marked this pull request as ready for review April 4, 2024 19:45

davidmrdavid self-requested a review April 8, 2024 21:19

davidmrdavid reviewed Apr 8, 2024

View reviewed changes

sebastianburckhardt added 2 commits April 11, 2024 09:22

address PR feedback

3449be2

fix incorrect handling of skipped events

0ec23d9

sebastianburckhardt modified the milestones: 1.4.3, 1.5.0 Apr 12, 2024

sebastianburckhardt added bug Something isn't working P1 Priority 1 labels Apr 12, 2024

sebastianburckhardt added 5 commits April 12, 2024 15:21

fix check that removes confirmed events and add detail-level tracing

29b2eb6

fix broken trace statement

3fa1769

fix handling of blob deleted

854c989

add more tracing to blob download

48a1fcc

do not process any more events if already shutting down

87af75d

davidmrdavid approved these changes Apr 15, 2024

View reviewed changes

address PR feedback (add comments)

58eabed

fix handling of missing blobs

cd2850a

davidmrdavid reviewed Apr 16, 2024

View reviewed changes

davidmrdavid approved these changes Apr 16, 2024

View reviewed changes

src/DurableTask.Netherite/TransportLayer/EventHubs/BlobBatchReceiver.cs Show resolved Hide resolved

Update src/DurableTask.Netherite/TransportLayer/EventHubs/BlobBatchRe…

db50ba9

…ceiver.cs Co-authored-by: David Justo <david.justo.1996@gmail.com>

sebastianburckhardt merged commit ab706eb into main Apr 16, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix blob batch persistence #369

Fix blob batch persistence #369

sebastianburckhardt commented Mar 19, 2024

sebastianburckhardt commented Apr 4, 2024

davidmrdavid left a comment

davidmrdavid Apr 8, 2024

davidmrdavid Apr 8, 2024

sebastianburckhardt Apr 11, 2024

sebastianburckhardt commented Apr 16, 2024 •

edited

Loading

davidmrdavid left a comment

davidmrdavid Apr 16, 2024

sebastianburckhardt Apr 16, 2024

		// a download can fail if the lease is lost and the next owner processes and then deletes it first
		throw new OperationCanceledException("blob already deleted", exception, token);

Fix blob batch persistence #369

Fix blob batch persistence #369

Conversation

sebastianburckhardt commented Mar 19, 2024

sebastianburckhardt commented Apr 4, 2024

davidmrdavid left a comment

Choose a reason for hiding this comment

davidmrdavid Apr 8, 2024

Choose a reason for hiding this comment

davidmrdavid Apr 8, 2024

Choose a reason for hiding this comment

sebastianburckhardt Apr 11, 2024

Choose a reason for hiding this comment

sebastianburckhardt commented Apr 16, 2024 • edited Loading

davidmrdavid left a comment

Choose a reason for hiding this comment

davidmrdavid Apr 16, 2024

Choose a reason for hiding this comment

sebastianburckhardt Apr 16, 2024

Choose a reason for hiding this comment

sebastianburckhardt commented Apr 16, 2024 •

edited

Loading