Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid race condition #136

Conversation

FinleyMcIlwaine
Copy link
Collaborator

This is a subtle one! There was a race between two thread managers: the System.TimeManager.Manager and the Network.HTTP2.H2.Manager (which wraps the former). The race condition could arise as follows:

  1. Suppose an exception is raised in one of the background threads (e.g., the frame sender).

  2. Then stopAfter writes a Stop instruction the H2.Manager queue; the exception gets then rethrown.

  3. If the server is run in the scope of withManager (as it is done in Network.Run.TCP.Timeout.runTCPServerWithSocket for example), the exception is caught again, causing stopManager to be called.

  4. Now one of two things can happen first (for each managed thread):

    a. stopManager might call the timeout action, which will throw TimeoutThread to the thread (because we registered the threads with registerKillThread)

    b. The H2.Manager will process the Stop instruction, and throw KilledByHTTP2ThreadManager to the thread.

    The choice is non-deterministic, depending on thread scheduling by the RTS.

  5. After this, the other exception is then thrown to same thread (which might at this point be busy handling the first exception).

So not only is the type of exception that is thrown non-deterministic, we also have the possibility that the thread will be interrupted handling that first exception by the second exception.

This commit fixes this by putting the H2.Manager in charge (it is aware of both managers, after all). When stopAfter writes Stop to the H2.Manager queue, it waits until that Stop instruction has been processed; this will (after this commit) also ensure that the threads have been removed from the timeout manager. This guarantees that the exception that is delivered (and the only exception that is delivered) is the KilledByHTTP2ThreadManager from http2.

This is a subtle one! There was a race between two thread managers: the
`System.TimeManager.Manager` and the `Network.HTTP2.H2.Manager` (which wraps
the former). The race condition could arise as follows:

1. Suppose an exception is raised in one of the background threads (e.g.,
   the frame sender).

2. Then `stopAfter` writes a `Stop` instruction the `H2.Manager` queue;
   the exception gets then rethrown.

3. If the server is run in the scope of `withManager` (as it is done in
   `Network.Run.TCP.Timeout.runTCPServerWithSocket` for example), the
   exception is caught again, causing `stopManager` to be called.

4. Now one of two things can happen /first/ (for each managed thread):

   a. `stopManager` might call the timeout action, which will throw
      `TimeoutThread` to the thread
      (because we registered the threads with `registerKillThread`)

   b. The `H2.Manager` will process the `Stop` instruction, and throw
      `KilledByHTTP2ThreadManager` to the thread.

   The choice is non-deterministic, depending on thread scheduling by the RTS.

5. After this, the _other_ exception is then thrown to _same_ thread
   (which might at this point be busy handling the _first_ exception).

So not only is the type of exception that is thrown non-deterministic, we also
have the possibility that the thread will be interrupted handling that first
exception by the second exception.

This commit fixes this by putting the `H2.Manager` in charge (it is aware of
both managers, after all). When `stopAfter` writes `Stop` to the `H2.Manager`
queue, it waits until that `Stop` instruction has been processed; this will
(after this commit) also ensure that the threads have been removed from the
timeout manager. This guarantees that the exception that is delivered (and the
_only_ exception that is delivered) is the `KilledByHTTP2ThreadManager` from
`http2`.
@kazu-yamamoto kazu-yamamoto self-requested a review July 18, 2024 23:00
@kazu-yamamoto
Copy link
Owner

I will take a day off today.
So, the review should be done in the next week.

@FinleyMcIlwaine
Copy link
Collaborator Author

I can reproduce a test failure locally, although I'm seeing thread blocked indefinitely in an MVar operation rather than the undefined in the failure here. It is a very infrequent failure, but it seems the comment in the code about the impossibility of a deadlock in the patch is not accurate. We will look into this.

For reference, I am reproducing by running:

while cabal run http2:spec -- +RTS -N; do :; done

@edsko
Copy link
Collaborator

edsko commented Jul 19, 2024

I think I might know what's going on, but I don't have push rights to Finley's branch, so I'll need to open a new PR. Will do so shortly.

@edsko edsko mentioned this pull request Jul 19, 2024
@edsko
Copy link
Collaborator

edsko commented Jul 19, 2024

Obsoleted by #137.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants