-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse threads in ThreadUtils #2152
Conversation
ps If we could instead address the number of LineReaders getting created and spinning up threads (#2099 (comment)), maybe this won't be needed. |
While that would probably be a good idea independently, your code seems like genuinely good code just by itself? Do you disagree? In reply to: 454516407 [](ancestors = 454516407) |
Maybe. I don't know enough about the usage patterns here to know whether it would be beneficial or not. If there are other cases where we're expecting to be calling this method lots of times in short succession to do work, then yeah, we should keep this. If, however, this situation rarely arises once the LineReader thing is fixed, then this is probably just adding unnecessary complexity. My suggestion would be to take the fix, then work on the LineReader thing, and then re-evaluate whether this can be removed and just replaced with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Close/reopen to trigger build. |
This is hitting the 45min timeout for the tests running. Closing/rubbing some cheetah blood on it/reopening to re-run the CI test. |
Is that likely an indication that I've got a bug here, or is that commonly seen on other PRs? |
@stephentoub: I'm seeing other PRs pass tests w/ extra time to spare. So I think it's a good indication that this PR is slowing down the tests. There are benchmarks which you can run on your new code (and the old) to compare runtimes. |
Thanks. Can you point me to directions on how to run those? |
@Anipik: Do we have instructions on how to run the benchmarks? |
yes, https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.Benchmarks/README.md The sdk version is mentioned here https://github.com/dotnet/machinelearning/blob/master/DotnetCLIVersion.netcoreapp.latest.txt currently we are using preview one for netcore3.0 |
It's internal and no one's using it. And if someone wants this behavior, ConcurrentExclusiveSchedulerPair in the framework can be used, e.g. replacing `new LimitedConcurrencyLevelTaskScheduler(level)` with `new ConcurrentExclusiveSchedulerPair(level).ConcurrentScheduler`.
ThreadUtils.CreateBackgroundThread is being truthful to its name and creating a new thread for every call. However, even for very simple ML.NET usage, such as the MulticlassClassification_Iris demo app, this method is being called ~110,000 times, resulting in ~110,000 threads getting created and destroyed. That adds measurable overhead in normal runs, but when the debugger is attached it makes the solution effectively unusable, as every thread creation/destruction is tracked by Visual Studio, leading to significant overheads that make an F5 execution last "forever" (== I gave up waiting). The right solution is for the higher-level algorithms and architecture to be better about its need for threads, creating them only when necessary and otherwise relying on the .NET ThreadPool for execution, via either ThreadPool.QueueUserWorkItem or via Task.Run or the like. However, as an immediate stop-gap that significantly helps the situation, this commit allows the created threads to be reused for a period of time such that not every call ends up creating a new thread. In my runs: - The same demo that app that created ~110K threads now creates only ~32. - With ctrl-F5 (e.g. no debugger attached), its previously took ~13 seconds, and with this change now takes ~6. - With F5 (debugger attached), execution previously took "forever", now takes ~190 seconds (still too high, but vastly improved). The CreateBackgroundThread method is renamed to StartBackgroundThread, and returns a Task instead of a Thread, such that callers may synchronize with that instead with Thread.Join. In several cases, this avoids additional threads from being consumed, where callers were blocking a thread pool thread doing a synchronous wait for all tasks, and where now that's entirely avoided via Task.WhenAll. Eventually, more call sites can be fixed as well, as more of the code base is moved to async/await; for now this just handles the obvious ones that don't require any significant restructuring.
0bc9176
to
69c7a4e
Compare
The issue was just a functional bug, which I fixed (I was missing a check for whether the queue was empty or not, resulting in a hang). |
I ran the perf tests. Most of them look like they're within the noise, but the TrainIris test dropped from 10.38s before the change to 4.54s after, and a couple others (e.g. TrainSentiment) look like they might have a 15-20% improvement. I'm not sure what's different about those that highlights the impact of this. |
Well, if I examine the test in detail, it doesn't surprise me that much. First, Iris a really tiny dataset -- 150 lines of 4 features and 1 label. For that reason any sort of "setup" costs relating to setup and creation of cursors, cursorsets, reconciliation etc. tend to have their impact exaggerated. SDCA also is an algorithm that tends to perform many iterations over the data, so there are many times for this to have an impact. Also I see that despite this the test was run in such a way that allowed it to run at full concurrency, if it really wanted to. Also caching is not being enabled here so I believe it will opt to do some parallel parsing for the sake of the text parsing. (Since it doesn't know ahead of time that there are only 150 lines, and even if it did somehow know the logic isn't that complicated.) Just wondering: @stephentoub can you merge, or do you not have that privilege in this repo? I'd be happy to do it, just not sure if there was something else you were thinking of doing in the meanwhile. |
@stephentoub : Is this ready to be merged in? @TomFinley : Thanks for the analysis. It offers a good explanation for why Iris & SDCA in particular are doing better. We should likely add caching to the benchmarks (when they were created we had auto-caching); this assumes users know about how to utilize caching and therefore we should keep the benchmark representative of user pipelines -- cc: @Anipik |
Caching to the benchmarks sounds good. I actually rather like that the test itself does not have caching -- I can think of few better stress tests on our infrastructure than running a dataset as tiny as this through the loader/transformation pipe on something as brutal with the pipeline as SDCA. 😆 We also need to document this caching stuff perhaps better than we do. So we found that auto-caching was a real problem -- great, fine, but how do we educate users about when they themselves should be caching, and how? Anyway, more documentation stuff. 😛 That big green button is looking awfully tempting @stephentoub ... I just want to press it. |
Hah 😄 Understood. I'm re-running the tests one more time, just for a sanity check. |
This test failed:
Is this test known to be flaky, or is it likely I've caused this? |
I would assume it's random, given the high number of random test failures we see and since the test passed on other legs of the CI and then on all legs upon retrying. A secondary question is: could this work make the random test failures more likely? This specific failed test is rather odd. The error is saying less than two label classes were found in the training file for Iris. The training file has three classes. In this test, SDCA is being used to train a model (no train/test, no CV, just train). If there was a train/test split (or CV split), I could see how threading could affect the split causing only one of the label classes to be seen, but it's using the whole file, so should always see all three label classes. Failed test: (failed once on one leg, passed on other legs, and upon retry) machinelearning/test/Microsoft.ML.Tests/Scenarios/Api/CookbookSamples/CookbookSamples.cs Lines 171 to 220 in b65d725
Perhaps another one for #1131 (Investigate and fix test failures) -- test in CI run |
Like @justinormont I'm not too concerned... so @stephentoub it is not the test that is flaky, but neither would I blame your change. I think it is a real issue in the existing code. We've observed on the test machines, when reading data, on the first pass that read "ends" early under seemingly very rare circumstances that we cannot repro locally. This test is the Iris dataset of 150 lines, with the 3 classes appearing in sequence in groups of 50 lines. In this case it has read 50 or fewer lines when trying to detect the number of classes, and complains "hey wait a second, there's only one class here, I can't do anything with this." As we see below in your log:
I don't recall seeing this particular test failing too often before, but I consider this failure a manifestation of that same bug. (If only because it seems to be consistent with our hypothesis of what the problem is. The more common case seems to be in SDCA complaining, "wait a second, in my first pass I observed only X instances, suddenly I've seen in a subsequent pass an X+1th instance, what are you trying to pull on me?" Still we have trouble: attempts to repro locally are elusive. As in, I don't think anyone has succeeded in doing so despite many attempts. So even if your change was somehow making the flaw in our code more obvious, we'd obviously welcome it, since then maybe we could finally get a handle on this thing. I could easily believe your change might make the bug more obvious. For various reasons that wouldn't be terribly interesting to review, even though we don't have solid information on the bug, indirect information suggests a race condition. One of the text loader, the column-set consolidator, or the cache data view, is inappropriately reporting that there is no more data from its source when in fact there is more data. Delays in multithreaded computation have a way of making race conditions less obvious; you got rid of a lot of delays, as we see from your benchmarks, so maybe suddenly we'll start seeing this more often. Maybe even, if we're extra lucky, even in a debugger where we can actually see what the heck is going on, finally. But that's just my hope. Who knows. |
Thanks for the solid explanation. Good enough for me. |
@TomFinley, I don't have the privilege. Enjoy 😄 |
ThreadUtils.CreateBackgroundThread is being truthful to its name and creating a new thread for every call. However, even for very simple ML.NET usage, such as the MulticlassClassification_Iris demo app, this method is being called ~110,000 times, resulting in ~110,000 threads getting created and destroyed. That adds measurable overhead in normal runs, but when the debugger is attached it makes the solution effectively unusable, as every thread creation/destruction is tracked by Visual Studio, leading to significant overheads that make an F5 execution last "forever" (== I gave up waiting).
The right solution is for the higher-level algorithms and architecture to be better about its need for threads, creating them only when necessary and otherwise relying on the .NET ThreadPool for execution, via either ThreadPool.QueueUserWorkItem or via Task.Run or the like.
However, as an immediate stop-gap that significantly helps the situation, this commit allows the created threads to be reused for a period of time such that not every call ends up creating a new thread. In my runs:
The CreateBackgroundThread method is renamed to StartBackgroundThread, and returns a Task instead of a Thread, such that callers may synchronize with that instead with Thread.Join. In several cases, this avoids additional threads from being consumed, where callers were blocking a thread pool thread doing a synchronous wait for all tasks, and where now that's entirely avoided via Task.WhenAll. Eventually, more call sites can be fixed as well, as more of the code base is moved to async/await; for now this just handles the obvious ones that don't require any significant restructuring.
Contributes to #2099