Reuse threads in ThreadUtils #2152

stephentoub · 2019-01-15T19:17:57Z

ThreadUtils.CreateBackgroundThread is being truthful to its name and creating a new thread for every call. However, even for very simple ML.NET usage, such as the MulticlassClassification_Iris demo app, this method is being called ~110,000 times, resulting in ~110,000 threads getting created and destroyed. That adds measurable overhead in normal runs, but when the debugger is attached it makes the solution effectively unusable, as every thread creation/destruction is tracked by Visual Studio, leading to significant overheads that make an F5 execution last "forever" (== I gave up waiting).

The right solution is for the higher-level algorithms and architecture to be better about its need for threads, creating them only when necessary and otherwise relying on the .NET ThreadPool for execution, via either ThreadPool.QueueUserWorkItem or via Task.Run or the like.

However, as an immediate stop-gap that significantly helps the situation, this commit allows the created threads to be reused for a period of time such that not every call ends up creating a new thread. In my runs:

The same demo that app that created ~110K threads now creates only ~32.
With ctrl-F5 (e.g. no debugger attached), it previously took ~13 seconds, and with this change now takes ~6.
With F5 (debugger attached), execution previously took "forever", now takes ~190 seconds (still way too high, but vastly improved).

The CreateBackgroundThread method is renamed to StartBackgroundThread, and returns a Task instead of a Thread, such that callers may synchronize with that instead with Thread.Join. In several cases, this avoids additional threads from being consumed, where callers were blocking a thread pool thread doing a synchronous wait for all tasks, and where now that's entirely avoided via Task.WhenAll. Eventually, more call sites can be fixed as well, as more of the code base is moved to async/await; for now this just handles the obvious ones that don't require any significant restructuring.

Contributes to #2099

stephentoub · 2019-01-15T19:19:34Z

ps If we could instead address the number of LineReaders getting created and spinning up threads (#2099 (comment)), maybe this won't be needed.

TomFinley · 2019-01-16T00:47:41Z

While that would probably be a good idea independently, your code seems like genuinely good code just by itself? Do you disagree?

In reply to: 454516407 [](ancestors = 454516407)

src/Microsoft.ML.Core/Utilities/ThreadUtils.cs

stephentoub · 2019-01-16T02:51:40Z

While that would probably be a good idea independently, your code seems like genuinely good code just by itself? Do you disagree?

Maybe. I don't know enough about the usage patterns here to know whether it would be beneficial or not. If there are other cases where we're expecting to be calling this method lots of times in short succession to do work, then yeah, we should keep this. If, however, this situation rarely arises once the LineReader thing is fixed, then this is probably just adding unnecessary complexity.

My suggestion would be to take the fix, then work on the LineReader thing, and then re-evaluate whether this can be removed and just replaced with new Thread(...) { IsBackground = true }.Start() again.

TomFinley

Thank you!

sfilipi · 2019-01-18T06:26:00Z

Close/reopen to trigger build.

justinormont · 2019-01-18T07:52:09Z

This is hitting the 45min timeout for the tests running.

Closing/rubbing some cheetah blood on it/reopening to re-run the CI test.

stephentoub · 2019-01-18T14:50:27Z

This is hitting the 45min timeout for the tests running.

Is that likely an indication that I've got a bug here, or is that commonly seen on other PRs?

justinormont · 2019-01-18T20:50:55Z

@stephentoub: I'm seeing other PRs pass tests w/ extra time to spare. So I think it's a good indication that this PR is slowing down the tests. There are benchmarks which you can run on your new code (and the old) to compare runtimes.

stephentoub · 2019-01-18T20:52:55Z

There are benchmarks which you can run on your new code (and the old) to compare runtimes.

Thanks. Can you point me to directions on how to run those?

justinormont · 2019-01-18T21:07:01Z

@Anipik: Do we have instructions on how to run the benchmarks?

Anipik · 2019-01-18T21:10:19Z

yes, https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.Benchmarks/README.md

The sdk version is mentioned here https://github.com/dotnet/machinelearning/blob/master/DotnetCLIVersion.netcoreapp.latest.txt

currently we are using preview one for netcore3.0

It's internal and no one's using it. And if someone wants this behavior, ConcurrentExclusiveSchedulerPair in the framework can be used, e.g. replacing `new LimitedConcurrencyLevelTaskScheduler(level)` with `new ConcurrentExclusiveSchedulerPair(level).ConcurrentScheduler`.

ThreadUtils.CreateBackgroundThread is being truthful to its name and creating a new thread for every call. However, even for very simple ML.NET usage, such as the MulticlassClassification_Iris demo app, this method is being called ~110,000 times, resulting in ~110,000 threads getting created and destroyed. That adds measurable overhead in normal runs, but when the debugger is attached it makes the solution effectively unusable, as every thread creation/destruction is tracked by Visual Studio, leading to significant overheads that make an F5 execution last "forever" (== I gave up waiting). The right solution is for the higher-level algorithms and architecture to be better about its need for threads, creating them only when necessary and otherwise relying on the .NET ThreadPool for execution, via either ThreadPool.QueueUserWorkItem or via Task.Run or the like. However, as an immediate stop-gap that significantly helps the situation, this commit allows the created threads to be reused for a period of time such that not every call ends up creating a new thread. In my runs: - The same demo that app that created ~110K threads now creates only ~32. - With ctrl-F5 (e.g. no debugger attached), its previously took ~13 seconds, and with this change now takes ~6. - With F5 (debugger attached), execution previously took "forever", now takes ~190 seconds (still too high, but vastly improved). The CreateBackgroundThread method is renamed to StartBackgroundThread, and returns a Task instead of a Thread, such that callers may synchronize with that instead with Thread.Join. In several cases, this avoids additional threads from being consumed, where callers were blocking a thread pool thread doing a synchronous wait for all tasks, and where now that's entirely avoided via Task.WhenAll. Eventually, more call sites can be fixed as well, as more of the code base is moved to async/await; for now this just handles the obvious ones that don't require any significant restructuring.

stephentoub · 2019-01-20T00:40:40Z

The issue was just a functional bug, which I fixed (I was missing a check for whether the queue was empty or not, resulting in a hang).

stephentoub · 2019-01-20T01:47:46Z

I ran the perf tests. Most of them look like they're within the noise, but the TrainIris test dropped from 10.38s before the change to 4.54s after, and a couple others (e.g. TrainSentiment) look like they might have a 15-20% improvement. I'm not sure what's different about those that highlights the impact of this.

TomFinley · 2019-01-20T02:26:15Z

I ran the perf tests. Most of them look like they're within the noise, but the TrainIris test dropped from 10.38s before the change to 4.54s after, and a couple others (e.g. TrainSentiment) look like they might have a 15-20% improvement. I'm not sure what's different about those that highlights the impact of this.

Well, if I examine the test in detail, it doesn't surprise me that much. First, Iris a really tiny dataset -- 150 lines of 4 features and 1 label. For that reason any sort of "setup" costs relating to setup and creation of cursors, cursorsets, reconciliation etc. tend to have their impact exaggerated. SDCA also is an algorithm that tends to perform many iterations over the data, so there are many times for this to have an impact. Also I see that despite this the test was run in such a way that allowed it to run at full concurrency, if it really wanted to. Also caching is not being enabled here so I believe it will opt to do some parallel parsing for the sake of the text parsing. (Since it doesn't know ahead of time that there are only 150 lines, and even if it did somehow know the logic isn't that complicated.)

Just wondering: @stephentoub can you merge, or do you not have that privilege in this repo? I'd be happy to do it, just not sure if there was something else you were thinking of doing in the meanwhile.

justinormont · 2019-01-20T06:05:20Z

@stephentoub : Is this ready to be merged in?

@TomFinley : Thanks for the analysis. It offers a good explanation for why Iris & SDCA in particular are doing better. We should likely add caching to the benchmarks (when they were created we had auto-caching); this assumes users know about how to utilize caching and therefore we should keep the benchmark representative of user pipelines -- cc: @Anipik

TomFinley · 2019-01-20T06:16:47Z

@TomFinley : Thanks for the analysis. It offers a good explanation for why Iris & SDCA in particular are doing better. We should likely add caching to the benchmarks (when they were created we had auto-caching); this assumes users know about how to utilize benmarking and therefore we should keep the benchmark representative of user pipelines -- cc: @Anipik

Caching to the benchmarks sounds good. I actually rather like that the test itself does not have caching -- I can think of few better stress tests on our infrastructure than running a dataset as tiny as this through the loader/transformation pipe on something as brutal with the pipeline as SDCA. 😆

We also need to document this caching stuff perhaps better than we do. So we found that auto-caching was a real problem -- great, fine, but how do we educate users about when they themselves should be caching, and how? Anyway, more documentation stuff. 😛

That big green button is looking awfully tempting @stephentoub ... I just want to press it.

stephentoub · 2019-01-20T14:59:19Z

That big green button is looking awfully tempting @stephentoub ... I just want to press it.

Hah 😄 Understood. I'm re-running the tests one more time, just for a sanity check.

stephentoub · 2019-01-20T15:21:11Z

This test failed:

2019-01-20T15:11:50.3687871Z [xUnit.net 00:03:30.61]     Microsoft.ML.Tests.Scenarios.Api.CookbookSamples.CookbookSamples.TrainAndPredictOnIris [FAIL]
2019-01-20T15:11:50.3688053Z Failed   Microsoft.ML.Tests.Scenarios.Api.CookbookSamples.CookbookSamples.TrainAndPredictOnIris
2019-01-20T15:11:50.3695067Z Error Message:
2019-01-20T15:11:50.3696166Z  System.ArgumentOutOfRangeException : Must be at least 2.
2019-01-20T15:11:50.3696444Z Parameter name: numClasses
2019-01-20T15:11:50.3696623Z Stack Trace:
2019-01-20T15:11:50.3696823Z    at Microsoft.ML.Contracts.CheckParam(Boolean f, String paramName, String msg) in D:\a\1\s\src\Microsoft.ML.Core\Utilities\Contracts.cs:line 540
2019-01-20T15:11:50.3697817Z    at Microsoft.ML.Learners.MulticlassLogisticRegressionModelParameters..ctor(IHostEnvironment env, VBuffer`1[] weights, Single[] bias, Int32 numClasses, Int32 numFeatures, String[] labelNames, LinearModelStatistics stats) in D:\a\1\s\src\Microsoft.ML.StandardLearners\Standard\LogisticRegression\MulticlassLogisticRegression.cs:line 438
2019-01-20T15:11:50.3698259Z    at Microsoft.ML.Trainers.SdcaMultiClassTrainer.CreatePredictor(VBuffer`1[] weights, Single[] bias) in D:\a\1\s\src\Microsoft.ML.StandardLearners\Standard\SdcaMultiClass.cs:line 419
2019-01-20T15:11:50.3699041Z    at Microsoft.ML.Trainers.SdcaTrainerBase`3.TrainCore(IChannel ch, RoleMappedData data, LinearModelParameters predictor, Int32 weightSetCount) in D:\a\1\s\src\Microsoft.ML.StandardLearners\Standard\SdcaBinary.cs:line 634
2019-01-20T15:11:50.3699513Z    at Microsoft.ML.Learners.StochasticTrainerBase`2.TrainModelCore(TrainContext context) in D:\a\1\s\src\Microsoft.ML.StandardLearners\Standard\StochasticTrainerBase.cs:line 42
2019-01-20T15:11:50.3699778Z    at Microsoft.ML.Training.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor) in D:\a\1\s\src\Microsoft.ML.Data\Training\TrainerEstimatorBase.cs:line 134
2019-01-20T15:11:50.3700003Z    at Microsoft.ML.Training.TrainerEstimatorBase`2.Fit(IDataView input) in D:\a\1\s\src\Microsoft.ML.Data\Training\TrainerEstimatorBase.cs:line 67
2019-01-20T15:11:50.3700231Z    at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input) in D:\a\1\s\src\Microsoft.ML.Data\DataLoadSave\EstimatorChain.cs:line 67
2019-01-20T15:11:50.3700428Z    at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input) in D:\a\1\s\src\Microsoft.ML.Data\DataLoadSave\EstimatorChain.cs:line 67
2019-01-20T15:11:50.3700653Z    at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input) in D:\a\1\s\src\Microsoft.ML.Data\DataLoadSave\EstimatorChain.cs:line 67
2019-01-20T15:11:50.3701020Z    at Microsoft.ML.StaticPipe.Estimator`3.Fit(DataView`1 view) in D:\a\1\s\src\Microsoft.ML.StaticPipe\Estimator.cs:line 34
2019-01-20T15:11:50.3701581Z    at Microsoft.ML.Tests.Scenarios.Api.CookbookSamples.CookbookSamples.TrainOnIris(String irisDataPath) in D:\a\1\s\test\Microsoft.ML.Tests\Scenarios\Api\CookbookSamples\CookbookSamples.cs:line 218
2019-01-20T15:11:50.3701954Z    at Microsoft.ML.Tests.Scenarios.Api.CookbookSamples.CookbookSamples.TrainAndPredictOnIris() in D:\a\1\s\test\Microsoft.ML.Tests\Scenarios\Api\CookbookSamples\CookbookSamples.cs:line 247

Is this test known to be flaky, or is it likely I've caused this?

justinormont · 2019-01-20T18:21:42Z

I would assume it's random, given the high number of random test failures we see and since the test passed on other legs of the CI and then on all legs upon retrying. A secondary question is: could this work make the random test failures more likely?

This specific failed test is rather odd. The error is saying less than two label classes were found in the training file for Iris. The training file has three classes. In this test, SDCA is being used to train a model (no train/test, no CV, just train). If there was a train/test split (or CV split), I could see how threading could affect the split causing only one of the label classes to be seen, but it's using the whole file, so should always see all three label classes.

Failed test: (failed once on one leg, passed on other legs, and upon retry)

machinelearning/test/Microsoft.ML.Tests/Scenarios/Api/CookbookSamples/CookbookSamples.cs

Lines 171 to 220 in b65d725

    
           private ITransformer TrainOnIris(string irisDataPath) 
        
           { 
        
               // Create a new context for ML.NET operations. It can be used for exception tracking and logging,  
        
               // as a catalog of available operations and as the source of randomness. 
        
               var mlContext = new MLContext(); 
        
               // Step one: read the data as an IDataView. 
        
               // First, we define the reader: specify the data columns and where to find them in the text file. 
        
               var reader = mlContext.Data.CreateTextReader(ctx => ( 
        
                       // The four features of the Iris dataset. 
        
                       SepalLength: ctx.LoadFloat(0), 
        
                       SepalWidth: ctx.LoadFloat(1), 
        
                       PetalLength: ctx.LoadFloat(2), 
        
                       PetalWidth: ctx.LoadFloat(3), 
        
                       // Label: kind of iris. 
        
                       Label: ctx.LoadText(4) 
        
                   ), 
        
                   // Default separator is tab, but the dataset has comma. 
        
                   separator: ','); 
        
               // Retrieve the training data. 
        
               var trainData = reader.Read(irisDataPath); 
        
               // Build the training pipeline. 
        
               var learningPipeline = reader.MakeNewEstimator() 
        
                   .Append(r => ( 
        
                       r.Label, 
        
                       // Concatenate all the features together into one column 'Features'. 
        
                       Features: r.SepalLength.ConcatWith(r.SepalWidth, r.PetalLength, r.PetalWidth))) 
        
                   // We add a step for caching data in memory so that the downstream iterative training 
        
                   // algorithm can efficiently scan through the data multiple times. Otherwise, the following 
        
                   // trainer will read data from disk multiple times. The caching mechanism uses an on-demand strategy. 
        
                   // The data accessed in any downstream step will be cached since its first use. In general, you only 
        
                   // need to add a caching step before trainable step, because caching is not helpful if the data is 
        
                   // only scanned once. 
        
                   .AppendCacheCheckpoint() 
        
                   .Append(r => ( 
        
                       r.Label, 
        
                       // Train the multi-class SDCA model to predict the label using features. 
        
                       // Note that the label is a text, so it needs to be converted to key using 'ToKey' estimator. 
        
                       Predictions: mlContext.MulticlassClassification.Trainers.Sdca(r.Label.ToKey(), r.Features))) 
        
                   // Apply the inverse conversion from 'predictedLabel' key back to string value. 
        
                   // Note that the final output column is only one, and we didn't assign a name to it. 
        
                   // In this case, ML.NET auto-assigns the name 'Data' to the produced column. 
        
                   .Append(r => r.Predictions.predictedLabel.ToValue()); 
        
               // Train the model. 
        
               var model = learningPipeline.Fit(trainData).AsDynamic; 
        
               return model; 
        
           }

Perhaps another one for #1131 (Investigate and fix test failures) -- test in CI run

TomFinley · 2019-01-20T21:48:50Z

Like @justinormont I'm not too concerned... so @stephentoub it is not the test that is flaky, but neither would I blame your change. I think it is a real issue in the existing code. We've observed on the test machines, when reading data, on the first pass that read "ends" early under seemingly very rare circumstances that we cannot repro locally. This test is the Iris dataset of 150 lines, with the 3 classes appearing in sequence in groups of 50 lines. In this case it has read 50 or fewer lines when trying to detect the number of classes, and complains "hey wait a second, there's only one class here, I can't do anything with this." As we see below in your log:

2019-01-20T15:11:50.3696166Z  System.ArgumentOutOfRangeException : Must be at least 2.
2019-01-20T15:11:50.3696444Z Parameter name: numClasses

I don't recall seeing this particular test failing too often before, but I consider this failure a manifestation of that same bug. (If only because it seems to be consistent with our hypothesis of what the problem is. The more common case seems to be in SDCA complaining, "wait a second, in my first pass I observed only X instances, suddenly I've seen in a subsequent pass an X+1th instance, what are you trying to pull on me?"

Still we have trouble: attempts to repro locally are elusive. As in, I don't think anyone has succeeded in doing so despite many attempts. So even if your change was somehow making the flaw in our code more obvious, we'd obviously welcome it, since then maybe we could finally get a handle on this thing.

I could easily believe your change might make the bug more obvious. For various reasons that wouldn't be terribly interesting to review, even though we don't have solid information on the bug, indirect information suggests a race condition. One of the text loader, the column-set consolidator, or the cache data view, is inappropriately reporting that there is no more data from its source when in fact there is more data. Delays in multithreaded computation have a way of making race conditions less obvious; you got rid of a lot of delays, as we see from your benchmarks, so maybe suddenly we'll start seeing this more often. Maybe even, if we're extra lucky, even in a debugger where we can actually see what the heck is going on, finally.

But that's just my hope. Who knows.

stephentoub · 2019-01-21T02:12:23Z

Thanks for the solid explanation. Good enough for me.

stephentoub · 2019-01-21T02:12:58Z

Just wondering: @stephentoub can you merge, or do you not have that privilege in this repo?

@TomFinley, I don't have the privilege. Enjoy 😄

TomFinley reviewed Jan 16, 2019

View reviewed changes