-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky - TestP2PWithMultipleDocumentsWithUpdateAndDeleteBeforeConnectSingleDeleteWithShowDeleted #1338
Comments
Spotted something today in the log below. Comparing the two datastore type runs, you can see the first (badger-IM) fails, and the second (defra-IM) succeeds. Perhaps significant is the order in which the DAG heads are replaced - the failed run replaces
EDIT: I saw this exact same pattern a second time today, with the same reversal of heads - hopefully it should be easy enough to track down. EDIT2: This mode of failure appears much more common on my machine than in the CI, it could be that a faster test run is more likely to hit this particular issue, whereas a slower run is more likely to generate a timeout. |
I see 3 types of failures:
My current timeout theory for some of the timeout with incorrect data is that there is a race condition for newly synced documents. A document-topic will only be subscribed to after the create-event has almost finished processing - this means that if the update/delete event is received before the topic-subscription the update/delete will never sync, and the pushLogEmitter will never emit an EvtReceivedPushLog - causing a timeout in most cases. It is worth noting however, that this mode of failure is not relevant for many tests as they do not rely on Create event to setup the document-topics (many failing tests create the docs before configuration and rely on listAllDocKeys in newServer). I do not understand why sisley/fix/I1338-sub-before-emit might, and appears to affect things - it just feels like there is a race condition there, and more time should be spent to investigate this line of thought. EDIT: https://github.com/sourcenetwork/defradb/actions/runs/4704788576/jobs/8345352190 failed with mode (2), on a branch based off of sisley/fix/I1338-sub-before-emit (https://github.com/sourcenetwork/defradb/compare/sisley/fix/I1338-buf-size?expand=1) so if it does affect the rate of that mode of failure it is not completely curing it. EDIT2:
Also noteworthy is that only P2P update tests failed during these runs, and most of them were TestP2PWithMultipleDocumentUpdatesPerNode failing. No delete test failures were recorded (verify they are in the package :)). This to me suggests the bulk of the CI timeouts may be caused by a deadlock once the event buffer is filled. More testing and investigation required. If true this will likely affect production. |
EventBus related issue should have been fixed in #1359 P2P test timeouts are still occuring however, so there are other root causes. |
Describe the problem
Flakey test.
To Reproduce
make test
Expected behavior
Not be flaky
Platform
latest commit on develop:
bbe8408a (HEAD -> develop, origin/develop) 2 hours ago fix: API address parameter validation (#1311)
Additional context
The text was updated successfully, but these errors were encountered: