Memory issues while using Hybrid Engine #1516

andreabrduque · 2022-03-01T15:38:59Z

Description

We have increased memory consumption when using ONNX together with another engine (Pytorch or MxNet). In the example code I made, with fixed inputs and a single-threaded predictor, the memory creeps up very slowly in values of 20MB.
In around 40mins, memory increases circa 1GB. The Java heap is fine, but RSS memory of the Java process continues increasing. The increase of memory happens a lot faster in our production environment with variable inputs than in the example I made, possibly because of the rate of predictions.

This also happens exclusively while using the hybrid engine (which I understand happens when I try an operation that is currently not supported by the onnx engine). If I don't do any unsupported operations, I don't have any problems.

We've also noticed that the memory problem happens when I hold the model loaded with ModelZoo.loadModel for a while. If I always load the model and close it before every predictions, I don't have any memory problems, although inferences seem slower. It seems to me that some residual memory is attached to the model resources with every prediction and will only be released when the model is released.

Expected Behavior

That memory would not increase over time when doing predictions.

How to Reproduce?

I made an example App here

The critical part is in the translator:

 @Override
        public EmptyClassification processOutput(TranslatorContext ctx, NDList list) {
            NDArray candidates = list.get(1);

            //if I don't do unsupported operations and comment line below I don't have any problems, even if I keep the model open
            NDArray batchIds = candidates.get(new NDIndex(":, 0"));

            return new EmptyClassification();
        }

Model checkpoint:
https://drive.google.com/file/d/1Hm8Q4CnAjpychj3L3c4C4p6pQ517ZLef/view?usp=sharing

What have you tried to solve it?

We were able to workaround this issue by controlling the NDManager of the alternative engine by ourselves, for example on our translator:

class CustomTranslator {
 private val alternativeEngine = Engine.getEngine("MXNet")

  override def processOutput(ctx: TranslatorContext, list: NDList): Classification = {
      val alternativeManager = alternativeEngine.newBaseManager(Device.cpu()) 
     
      //do postprocessing attaching NDArrays into alternativeManager

      alternativeManager.close()

   ...
}

Environment Info

[DEBUG] - Registering EngineProvider: XGBoost
[DEBUG] - Registering EngineProvider: MXNet
[DEBUG] - Registering EngineProvider: PyTorch
[DEBUG] - Registering EngineProvider: TensorFlow
[DEBUG] - Found default engine: MXNet
----------- System Properties -----------
java.specification.version: 15
sun.jnu.encoding: UTF-8
java.class.path: /djl/integration/build/classes/java/main:/djl/integration/build/resources/main:/root/.gradle/caches/modules-2/files-2.1/commons-cli/commons-cli/1.4/c51c00206bb913cd8612b24abd9fa98ae89719b1/commons-cli-1.4.jar:/root/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-slf4j-impl/2.17.1/84692d456bcce689355d33d68167875e486954dd/log4j-slf4j-impl-2.17.1.jar:/djl/basicdataset/build/libs/basicdataset-0.16.0-SNAPSHOT.jar:/djl/model-zoo/build/libs/model-zoo-0.16.0-SNAPSHOT.jar:/djl/testing/build/libs/testing-0.16.0-SNAPSHOT.jar:/root/.gradle/caches/modules-2/files-2.1/org.testng/testng/7.5/1416a607fae667c14e390b484e8d02b5824c0674/testng-7.5.jar:/djl/engines/mxnet/mxnet-model-zoo/build/libs/mxnet-model-zoo-0.16.0-SNAPSHOT.jar:/djl/engines/pytorch/pytorch-model-zoo/build/libs/pytorch-model-zoo-0.16.0-SNAPSHOT.jar:/djl/engines/tensorflow/tensorflow-model-zoo/build/libs/tensorflow-model-zoo-0.16.0-SNAPSHOT.jar:/djl/engines/ml/xgboost/build/libs/xgboost-0.16.0-SNAPSHOT.jar:/djl/engines/mxnet/mxnet-engine/build/libs/mxnet-engine-0.16.0-SNAPSHOT.jar:/djl/engines/pytorch/pytorch-engine/build/libs/pytorch-engine-0.16.0-SNAPSHOT.jar:/djl/engines/tensorflow/tensorflow-engine/build/libs/tensorflow-engine-0.16.0-SNAPSHOT.jar:/djl/api/build/libs/api-0.16.0-SNAPSHOT.jar:/root/.gradle/caches/modules-2/files-2.1/org.slf4j/slf4j-api/1.7.32/cdcff33940d9f2de763bc41ea05a0be5941176c3/slf4j-api-1.7.32.jar:/root/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-core/2.17.1/779f60f3844dadc3ef597976fcb1e5127b1f343d/log4j-core-2.17.1.jar:/root/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-api/2.17.1/d771af8e336e372fb5399c99edabe0919aeaf5b2/log4j-api-2.17.1.jar:/root/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-csv/1.8/37ca9a9aa2d4be2599e55506a6d3170dd7a3df4/commons-csv-1.8.jar:/root/.gradle/caches/modules-2/files-2.1/com.google.code.findbugs/jsr305/3.0.1/f7be08ec23c21485b9b5a1cf1654c2ec8c58168d/jsr305-3.0.1.jar:/root/.gradle/caches/modules-2/files-2.1/com.beust/jcommander/1.78/a3927de9bd6f351429bcf763712c9890629d8f51/jcommander-1.78.jar:/root/.gradle/caches/modules-2/files-2.1/org.webjars/jquery/3.5.1/2392938e374f561c27c53872bdc9b6b351b6ba34/jquery-3.5.1.jar:/root/.gradle/caches/modules-2/files-2.1/ml.dmlc/xgboost4j_2.12/1.4.1/3c769c6be531e06e20b2ac4e18d7d0cd75c0f1bb/xgboost4j_2.12-1.4.1.jar:/root/.gradle/caches/modules-2/files-2.1/commons-logging/commons-logging/1.2/4bfc12adfe4842bf07b657f0369c4cb522955686/commons-logging-1.2.jar:/root/.gradle/caches/modules-2/files-2.1/com.google.code.gson/gson/2.8.9/8a432c1d6825781e21a02db2e2c33c5fde2833b9/gson-2.8.9.jar:/root/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.9.0/8f503e6d9b500ceff299052d6be75b38c7257758/jna-5.9.0.jar:/root/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-compress/1.21/4ec95b60d4e86b5c95a0e919cb172a0af98011ef/commons-compress-1.21.jar:/djl/engines/tensorflow/tensorflow-api/build/libs/tensorflow-api-0.16.0-SNAPSHOT.jar:/root/.gradle/caches/modules-2/files-2.1/org.tensorflow/tensorflow-core-api/0.4.0/2ac35ca087607cce0e5419953cc1ef0c3a5edaea/tensorflow-core-api-0.4.0.jar:/root/.gradle/caches/modules-2/files-2.1/org.bytedeco/javacpp/1.5.6/1f18a820aadd943577b0b372554f9e35e1232e25/javacpp-1.5.6.jar:/root/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protobuf-java/3.19.2/e958ce38f96b612d3819ff1c753d4d70609aea74/protobuf-java-3.19.2.jar:/root/.gradle/caches/modules-2/files-2.1/org.tensorflow/ndarray/0.3.3/1b6d8cc3e3762f6e465b884580d9fc17ab7aeb4/ndarray-0.3.3.jar
java.vm.vendor: AdoptOpenJDK
sun.arch.data.model: 64
user.variant:
java.vendor.url: https://adoptopenjdk.net/
user.timezone: Etc/UTC
java.vm.specification.version: 15
os.name: Linux
sun.java.launcher: SUN_STANDARD
user.country: US
sun.boot.library.path: /opt/java/openjdk/lib:/opt/java/openjdk/lib
sun.java.command: ai.djl.integration.util.DebugEnvironment
jdk.debug: release
sun.cpu.endian: little
user.home: /root
org.gradle.appname: gradlew
user.language: en
java.specification.vendor: Oracle Corporation
java.version.date: 2021-01-19
java.home: /opt/java/openjdk
ai.djl.logging.level: debug
org.gradle.internal.http.connectionTimeout: 60000
file.separator: /
java.vm.compressedOopsMode: Zero based
line.separator:

java.vm.specification.vendor: Oracle Corporation
java.specification.name: Java Platform API Specification
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
java.runtime.version: 15.0.2+7-202101212352
user.name: root
path.separator: :
os.version: 5.11.0-1029-gcp
java.runtime.name: OpenJDK Runtime Environment
file.encoding: UTF-8
java.vm.name: OpenJDK 64-Bit Server VM
java.vendor.version: AdoptOpenJDK
java.vendor.url.bug: https://github.com/AdoptOpenJDK/openjdk-support/issues
java.io.tmpdir: /tmp
org.gradle.internal.http.socketTimeout: 120000
java.version: 15.0.2
user.dir: /djl/integration
os.arch: amd64
java.vm.specification.name: Java Virtual Machine Specification
java.library.path: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
java.vm.info: mixed mode, sharing
java.vendor: AdoptOpenJDK
java.vm.version: 15.0.2+7-202101212352
sun.io.unicode.encoding: UnicodeLittle
library.jansi.path: /root/.gradle/native/jansi/1.18/linux64
java.class.version: 59.0
org.gradle.internal.publish.checksums.insecure: true

--------- Environment Variables ---------
PATH: /opt/java/openjdk/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
NV_LIBCUSPARSE_VERSION: 11.6.0.120-1
NCCL_VERSION: 2.11.4-1
NV_NVTX_VERSION: 11.4.120-1
NV_CUDA_LIB_VERSION: 11.4.2-1
NVIDIA_REQUIRE_CUDA: cuda>=11.4 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=460,driver<461
JAVA_HOME: /opt/java/openjdk
CUDA_VERSION: 11.4.2
NVIDIA_VISIBLE_DEVICES: all
NV_LIBCUBLAS_PACKAGE_NAME: libcublas-11-4
NV_CUDA_COMPAT_PACKAGE: cuda-compat-11-4
TERM: xterm
NV_LIBNCCL_PACKAGE_NAME: libnccl2
LANG: en_US.UTF-8
NV_CUDNN_VERSION: 8.2.4.15
LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
NV_LIBCUBLAS_PACKAGE: libcublas-11-4=11.6.1.51-1
PWD: /djl
JAVA_VERSION: jdk15u
NV_LIBNPP_VERSION: 11.4.0.110-1
_: ./gradlew
NV_LIBNCCL_PACKAGE: libnccl2=2.11.4-1+cuda11.4
LANGUAGE: en_US:en
NV_CUDA_CUDART_VERSION: 11.4.108-1
NVARCH: x86_64
OLDPWD: /djl
NVIDIA_DRIVER_CAPABILITIES: compute,utility
NV_LIBNCCL_PACKAGE_VERSION: 2.11.4-1
NV_LIBNPP_PACKAGE: libnpp-11-4=11.4.0.110-1
HOSTNAME: 73ada51a8863
LC_ALL: en_US.UTF-8
NV_CUDNN_PACKAGE: libcudnn8=8.2.4.15-1+cuda11.4
NV_CUDNN_PACKAGE_NAME: libcudnn8
LS_COLORS: rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
HOME: /root
SHLVL: 1
NV_LIBCUBLAS_VERSION: 11.6.1.51-1

-------------- Directories --------------
temp directory: /tmp
DJL cache directory: /root/.djl.ai
Engine cache directory: /root/.djl.ai

------------------ CUDA -----------------
GPU Count: 1
CUDA: 114
ARCH: 75
GPU(0) memory used: 6607077376 bytes

----------------- Engines ---------------
DJL version: 0.16.0
Default Engine: MXNet
[WARN ] - No matching cuda flavor for linux found: cu114mkl/sm_75.
[DEBUG] - Loading mxnet library from: /root/.djl.ai/mxnet/1.8.0-mkl-linux-x86_64/libmxnet.so
Default Device: cpu()
PyTorch: 2
MXNet: 0
XGBoost: 10
TensorFlow: 3

--------------- Hardware --------------
Available processors (cores): 4
Byte Order: LITTLE_ENDIAN
Free memory (bytes): 209689240
Maximum memory (bytes): 3932160000
Total memory available to JVM (bytes): 251658240
Heap committed: 251658240
Heap nonCommitted: 30212096
GCC:
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

The text was updated successfully, but these errors were encountered:

frankfliu · 2022-03-02T18:41:21Z

@andreabrduque

I run your test code and I can confirm that there is memory leak in hybrid NDManager. On every inference there is one NDArray is leaked.

I will debug more and see how to fix it.

frankfliu · 2022-03-03T16:58:10Z

The root cause is we lost track of alternativeManager when a new NDManager is attached to the NDArray: https://github.com/deepjavalibrary/djl/blob/master/api/src/main/java/ai/djl/ndarray/NDArrayAdapter.java#L70

Once a new NDManager is attached to NDArray, all NDArray created should under the new NDManager. For OrtNDArray, since all NDArray is created by alternativeManager, so the alternativeManager should also be updated. the alternativeManager should either be a child of new NDManager of point to the new NDManager itself.

visionwxc · 2023-11-02T09:22:09Z

When I use the onnx model and Pytroch's PtNDArray, it seems that this issue still arises

frankfliu · 2023-11-02T19:57:52Z

When I use the onnx model and Pytroch's PtNDArray, it seems that this issue still arises

Do you have code that can reproduce your issue?

visionwxc · 2023-11-03T01:46:57Z

When I use the onnx model and Pytroch's PtNDArray, it seems that this issue still arises

Do you have code that can reproduce your issue?

A code example:

 private void doOcr(List<Image> images){
        System.setProperty("PYTORCH_PRECXX11", "true");
        OcrV3Detection detection = new OcrV3Detection();
        OcrV3Recognition recognition = new OcrV3Recognition();
        try (ZooModel detectionModel = ModelZoo.loadModel(Objects.requireNonNull(detection.detectCriteria()));
             Predictor detector = detectionModel.newPredictor();
             ZooModel recognitionModel = ModelZoo.loadModel(Objects.requireNonNull(recognition.recognizeCriteria()));
             Predictor recognizer = recognitionModel.newPredictor()) {
            recognition.predictListStringBatch(images, detector, recognizer);
        }catch (Exception e){
            logger.error("ocr fail",e);
        }
    }

@Override
    public String processOutput(TranslatorContext ctx, NDList list) throws IOException {
        StringBuilder sb = new StringBuilder();
        NDArray tokens = list.singletonOrThrow();
// this line  of code prompts that memory overflow cannot be allocated
        long[] indices = tokens.get(0).argMax(1).toLongArray();
        boolean[] selection = new boolean[indices.length];
        Arrays.fill(selection, true);
        for (int i = 1; i < indices.length; i++) {
            if (indices[i] == indices[i - 1]) {
                selection[i] = false;
            }
        }
        int lastIdx = 0;
        for (int i = 0; i < indices.length; i++) {
            if (selection[i] == true && indices[i] > 0 && !(i > 0 && indices[i] == lastIdx)) {
                sb.append(table.get((int) indices[i]));
            }
        }
        return sb.toString();
    }

environmental information:
CentOS Linux release 7.9.2009 (Core)
CPU
4-CORE 8G

frankfliu · 2023-11-05T18:55:34Z

What's the error message and stacktrace?

I don't think this error is related to memory leak. can you print out the shape of tokens? does the size exceed 4G?

andreabrduque added the bug Something isn't working label Mar 1, 2022

frankfliu mentioned this issue Mar 2, 2022

[api] Fixes memory leak in hybrid engine #1518

Merged

frankfliu closed this as completed in #1518 Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues while using Hybrid Engine #1516

Memory issues while using Hybrid Engine #1516

andreabrduque commented Mar 1, 2022

frankfliu commented Mar 2, 2022

frankfliu commented Mar 3, 2022

visionwxc commented Nov 2, 2023

frankfliu commented Nov 2, 2023

visionwxc commented Nov 3, 2023 •

edited by frankfliu

Loading

frankfliu commented Nov 5, 2023

Memory issues while using Hybrid Engine #1516

Memory issues while using Hybrid Engine #1516

Comments

andreabrduque commented Mar 1, 2022

Description

Expected Behavior

How to Reproduce?

What have you tried to solve it?

Environment Info

frankfliu commented Mar 2, 2022

frankfliu commented Mar 3, 2022

visionwxc commented Nov 2, 2023

frankfliu commented Nov 2, 2023

visionwxc commented Nov 3, 2023 • edited by frankfliu Loading

frankfliu commented Nov 5, 2023

visionwxc commented Nov 3, 2023 •

edited by frankfliu

Loading