deepjavalibrary · zachgk · Apr 10, 2023 · Apr 10, 2023
@@ -1,45 +1,54 @@
 # Inference Performance Optimization
 
-The document covers several tricks of how you can tune your inference performance based on the engine you use 
-including multithreading support, engine threads configuration and how to enable DNNL(MKLDNN).
+This document covers several tricks of how you can tune your inference performance based on the engine you use 
+including multithreading support, engine threads configuration, and how to enable DNNL(MKLDNN).
 
 ## Multithreading Support
 
 One of the advantage of Deep Java Library (DJL) is Multi-threaded inference support.
 It can help to increase the throughput of your inference on multi-core CPUs and GPUs and reduce
-memory consumption compare to Python.
+memory consumption compared to Python.
 
-DJL `Predictor` is not designed to be thread-safe (although some implementation is),
-we recommend creating a new [Predictor](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/inference/Predictor.html) for each thread.
+The DJL `Predictor` is not designed to be thread-safe.
+Because some engines are not thread-safe, we use the Predictor to help cover the differences.
+For engines that are thread-safe, we do nothing and for ones that are not, we will make copies as necessary.
+
+Therefore, we recommend creating a new [Predictor](https://javadoc.io/doc/ai.djl/api/latest/ai/djl/inference/Predictor.html) for each thread.
+That Predictor should be reused if the thread does multiple predictions.
+Alternatively, you can also use a pool of Predictors or you can leverage the [DJL Serving WorkLoadManager](http://docs.djl.ai/docs/serving/wlm/index.html).
 
 For a reference implementation, see [Multi-threaded Benchmark](https://github.com/deepjavalibrary/djl-serving/blob/master/benchmark/src/main/java/ai/djl/benchmark/MultithreadedBenchmark.java).
 
-you need to set corresponding configuration based on the engine you want to use.
+In addition, you may need to set engine-specific configurations as well.
+Engine-specific details are given below.
+You can also reference the [list of all DJL system configurations](http://docs.djl.ai/docs/serving/serving/docs/configurations.html).
 
 ### Apache MXNet
 
 #### Engine configuration
-To use Apache MXNet Engine to run multi-threading, complete the following steps.
+To use Apache MXNet Engine to run multi-threading, complete the following steps:
 
 #### Enable NaiveEngine with Apache MXNet
-If using the MXNet engine for a multi-threaded inference case, you need to specify the 
+By default, MXNet tries to execute operators lazily and with internal multi-threading designed to occupy the entire machine.
+This works well during Training and if you want to focus on latency over throughput.
+If you are using the MXNet engine for a Java multi-threaded inference case, you need to specify the 
 'MXNET_ENGINE_TYPE' environment variable using the following command:
 
-```
+```bash
 export MXNET_ENGINE_TYPE=NaiveEngine
 ```
 
-To get the best throughput, you may also need to set 'OMP_NUM_THREADS' environment variable:
+To get the best throughput, you may also need to set the 'OMP_NUM_THREADS' environment variable:
 
-```
+```bash
 export OMP_NUM_THREADS=1
 ```
 
 ### PyTorch
 
 #### Graph Executor Optimization
 
-PyTorch graph executor optimizer (JIT tensorexpr fuser) is enabled by default. When the first
+The PyTorch graph executor optimizer (JIT tensorexpr fuser) is enabled by default. When the first
 a few inferences is made on a **new batch size**, torchscript generates an optimized execution graph for
 the model.
 
@@ -55,7 +64,7 @@ You can disable graph executor optimizer globally by setting the following syste
 -Dai.djl.pytorch.graph_optimizer=false
 ```
 
-The graph executor optimizer is per thread configuration. It is important to note that it must
+The graph executor optimizer is a per thread configuration. It is important to note that it must
 be disabled for each thread that you are using the model from.
 If you forget to disable it on a thread, it will cause the optimization to be performed and delay
 any running/pending executions.
@@ -66,24 +75,25 @@ JniUtils.setGraphExecutorOptimize(false);
 ```
 
 #### oneDNN(MKLDNN) acceleration
-Unlike TensorFlow and Apache MXNet, PyTorch by default doesn't enable MKLDNN which is treated as a device type like CPU and GPU.
-You can enable it by
+Unlike TensorFlow and Apache MXNet, PyTorch by default doesn't enable MKLDNN by default.
+Instead, it is treated as a device type like CPU and GPU.
+You can enable it by setting the environment variable:
 
 ```
 -Dai.djl.pytorch.use_mkldnn=true
 ```
 
-You might see the exception if certain data type or operator is not supported with the oneDNN device.
+You might see an exception if a data type or operator is not supported with the oneDNN device.
 
 #### CuDNN acceleration
-PyTorch has a special flags that used for CNN or related network speed up. If your input size won't change frequently,
+PyTorch has a special flag that is used for a CNN or related network speed up. If your input size won't change frequently,
 you may benefit from enabling this configuration in your model:
 
 ```
 -Dai.djl.pytorch.cudnn_benchmark=true
 ```
 
-If your input shape changed frequently, this change may stall your performance. For more information, check this 
+If your input shape changes frequently, this change may stall your performance. For more information, check this 
 [article](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#enable-cudnn-auto-tuner).
 
 #### Thread configuration
@@ -93,35 +103,35 @@ There are two configurations you can set to optimize the inference performance.
 -Dai.djl.pytorch.num_interop_threads=[num of the interop threads]
 ```
 
-It configures the number of the operations JIT interpreter fork to execute in parallel.
+This configures the number of operations the JIT interpreter forks will execute in parallel.
 
 ```
 -Dai.djl.pytorch.num_threads=[num of the threads]
 ```
 
-It configures the number of the threads within the operation. It is set to number of CPU cores by default.
+This configures the number of threads within the operation. It is set to the number of CPU cores by default.
 
-You can find more detail in [PyTorch](https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html).
+You can find more details in [PyTorch](https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html).
 
 ### TensorFlow
 
 #### Multithreading Inference
-You can follow the same steps as other engines for running multithreading inference using TensorFlow engine.
+You can follow the same steps as for other engines for running multithreading inference using the TensorFlow engine.
 It's recommended to use one `Predictor` for each thread and avoid using a new `Predictor` for each inference call.
 You can refer to our [Multithreading Benchmark](https://github.com/deepjavalibrary/djl-serving/blob/master/benchmark/src/main/java/ai/djl/benchmark/MultithreadedBenchmark.java) as an example,
-here is how to run it using TensorFlow engine.
+Here is how to run it using TensorFlow engine.
 
 ```bash
 cd djl-serving
 ./gradlew benchmark --args='-e TensorFlow -c 100 -t -1 -u djl://ai.djl.tensorflow/resnet/0.0.1/resnet50 -s 1,224,224,3'
 ```
 
 #### oneDNN(MKLDNN) acceleration
-By default, TensorFlow engine comes with oneDNN enabled, no special configuration needed.
+By default, the TensorFlow engine comes with oneDNN enabled, no special configuration needed.
 
 #### Thread configuration
 It's recommended to use 1 thread for operator parallelism during multithreading inference. 
-You can configure it by setting the following 3 envinronment variables:
+You can configure it by setting the following 3 environment variables:
 
 ```bash
 export OMP_NUM_THREADS=1
@@ -135,19 +145,23 @@ export TF_NUM_INTRAOP_THREADS=1
 
 You can use the following settings for thread optimization in Criteria
 
-```
-.optOption("interOpNumThreads", <num_of_thread>)
-.optOption("intraOpNumThreads", <num_of_thread>)
+```java
+Criteria.builder()
+    .optOption("interOpNumThreads", <num_of_thread>)
+    .optOption("intraOpNumThreads", <num_of_thread>)
+    ...
 ```
 
 Tips: Set to 1 on both of them at the beginning to see the performance. 
-Then set to total_cores/total_java_inference_thread on one of them to see how performance goes.
+Then, set to `total_cores`/`total_java_inference_thread` on one of them to see how performance goes.
 
 #### (GPU) TensorRT Backend
 
 If you have tensorRT installed, you can try with the following backend on ONNXRuntime for performance optimization in Criteria
 
-```
-.optOption("ortDevice", "TensorRT")
+```java
+Criteria.builder()
+    .optOption("ortDevice", "TensorRT")
+    ...
 ```