Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues while using Hybrid Engine #1516

Closed
andreabrduque opened this issue Mar 1, 2022 · 6 comments · Fixed by #1518
Closed

Memory issues while using Hybrid Engine #1516

andreabrduque opened this issue Mar 1, 2022 · 6 comments · Fixed by #1518
Labels
bug Something isn't working

Comments

@andreabrduque
Copy link
Contributor

Description

We have increased memory consumption when using ONNX together with another engine (Pytorch or MxNet). In the example code I made, with fixed inputs and a single-threaded predictor, the memory creeps up very slowly in values of 20MB.
In around 40mins, memory increases circa 1GB. The Java heap is fine, but RSS memory of the Java process continues increasing. The increase of memory happens a lot faster in our production environment with variable inputs than in the example I made, possibly because of the rate of predictions.

This also happens exclusively while using the hybrid engine (which I understand happens when I try an operation that is currently not supported by the onnx engine). If I don't do any unsupported operations, I don't have any problems.

We've also noticed that the memory problem happens when I hold the model loaded with ModelZoo.loadModel for a while. If I always load the model and close it before every predictions, I don't have any memory problems, although inferences seem slower. It seems to me that some residual memory is attached to the model resources with every prediction and will only be released when the model is released.

Expected Behavior

That memory would not increase over time when doing predictions.

How to Reproduce?

I made an example App here

The critical part is in the translator:

 @Override
        public EmptyClassification processOutput(TranslatorContext ctx, NDList list) {
            NDArray candidates = list.get(1);

            //if I don't do unsupported operations and comment line below I don't have any problems, even if I keep the model open
            NDArray batchIds = candidates.get(new NDIndex(":, 0"));

            return new EmptyClassification();
        }

Model checkpoint:
https://drive.google.com/file/d/1Hm8Q4CnAjpychj3L3c4C4p6pQ517ZLef/view?usp=sharing

What have you tried to solve it?

We were able to workaround this issue by controlling the NDManager of the alternative engine by ourselves, for example on our translator:

class CustomTranslator {
 private val alternativeEngine = Engine.getEngine("MXNet")

  override def processOutput(ctx: TranslatorContext, list: NDList): Classification = {
      val alternativeManager = alternativeEngine.newBaseManager(Device.cpu()) 
     
      //do postprocessing attaching NDArrays into alternativeManager

      alternativeManager.close()

   ...
}

Environment Info

[DEBUG] - Registering EngineProvider: XGBoost
[DEBUG] - Registering EngineProvider: MXNet
[DEBUG] - Registering EngineProvider: PyTorch
[DEBUG] - Registering EngineProvider: TensorFlow
[DEBUG] - Found default engine: MXNet
----------- System Properties -----------
java.specification.version: 15
sun.jnu.encoding: UTF-8
java.class.path: /djl/integration/build/classes/java/main:/djl/integration/build/resources/main:/root/.gradle/caches/modules-2/files-2.1/commons-cli/commons-cli/1.4/c51c00206bb913cd8612b24abd9fa98ae89719b1/commons-cli-1.4.jar:/root/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-slf4j-impl/2.17.1/84692d456bcce689355d33d68167875e486954dd/log4j-slf4j-impl-2.17.1.jar:/djl/basicdataset/build/libs/basicdataset-0.16.0-SNAPSHOT.jar:/djl/model-zoo/build/libs/model-zoo-0.16.0-SNAPSHOT.jar:/djl/testing/build/libs/testing-0.16.0-SNAPSHOT.jar:/root/.gradle/caches/modules-2/files-2.1/org.testng/testng/7.5/1416a607fae667c14e390b484e8d02b5824c0674/testng-7.5.jar:/djl/engines/mxnet/mxnet-model-zoo/build/libs/mxnet-model-zoo-0.16.0-SNAPSHOT.jar:/djl/engines/pytorch/pytorch-model-zoo/build/libs/pytorch-model-zoo-0.16.0-SNAPSHOT.jar:/djl/engines/tensorflow/tensorflow-model-zoo/build/libs/tensorflow-model-zoo-0.16.0-SNAPSHOT.jar:/djl/engines/ml/xgboost/build/libs/xgboost-0.16.0-SNAPSHOT.jar:/djl/engines/mxnet/mxnet-engine/build/libs/mxnet-engine-0.16.0-SNAPSHOT.jar:/djl/engines/pytorch/pytorch-engine/build/libs/pytorch-engine-0.16.0-SNAPSHOT.jar:/djl/engines/tensorflow/tensorflow-engine/build/libs/tensorflow-engine-0.16.0-SNAPSHOT.jar:/djl/api/build/libs/api-0.16.0-SNAPSHOT.jar:/root/.gradle/caches/modules-2/files-2.1/org.slf4j/slf4j-api/1.7.32/cdcff33940d9f2de763bc41ea05a0be5941176c3/slf4j-api-1.7.32.jar:/root/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-core/2.17.1/779f60f3844dadc3ef597976fcb1e5127b1f343d/log4j-core-2.17.1.jar:/root/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-api/2.17.1/d771af8e336e372fb5399c99edabe0919aeaf5b2/log4j-api-2.17.1.jar:/root/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-csv/1.8/37ca9a9aa2d4be2599e55506a6d3170dd7a3df4/commons-csv-1.8.jar:/root/.gradle/caches/modules-2/files-2.1/com.google.code.findbugs/jsr305/3.0.1/f7be08ec23c21485b9b5a1cf1654c2ec8c58168d/jsr305-3.0.1.jar:/root/.gradle/caches/modules-2/files-2.1/com.beust/jcommander/1.78/a3927de9bd6f351429bcf763712c9890629d8f51/jcommander-1.78.jar:/root/.gradle/caches/modules-2/files-2.1/org.webjars/jquery/3.5.1/2392938e374f561c27c53872bdc9b6b351b6ba34/jquery-3.5.1.jar:/root/.gradle/caches/modules-2/files-2.1/ml.dmlc/xgboost4j_2.12/1.4.1/3c769c6be531e06e20b2ac4e18d7d0cd75c0f1bb/xgboost4j_2.12-1.4.1.jar:/root/.gradle/caches/modules-2/files-2.1/commons-logging/commons-logging/1.2/4bfc12adfe4842bf07b657f0369c4cb522955686/commons-logging-1.2.jar:/root/.gradle/caches/modules-2/files-2.1/com.google.code.gson/gson/2.8.9/8a432c1d6825781e21a02db2e2c33c5fde2833b9/gson-2.8.9.jar:/root/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.9.0/8f503e6d9b500ceff299052d6be75b38c7257758/jna-5.9.0.jar:/root/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-compress/1.21/4ec95b60d4e86b5c95a0e919cb172a0af98011ef/commons-compress-1.21.jar:/djl/engines/tensorflow/tensorflow-api/build/libs/tensorflow-api-0.16.0-SNAPSHOT.jar:/root/.gradle/caches/modules-2/files-2.1/org.tensorflow/tensorflow-core-api/0.4.0/2ac35ca087607cce0e5419953cc1ef0c3a5edaea/tensorflow-core-api-0.4.0.jar:/root/.gradle/caches/modules-2/files-2.1/org.bytedeco/javacpp/1.5.6/1f18a820aadd943577b0b372554f9e35e1232e25/javacpp-1.5.6.jar:/root/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protobuf-java/3.19.2/e958ce38f96b612d3819ff1c753d4d70609aea74/protobuf-java-3.19.2.jar:/root/.gradle/caches/modules-2/files-2.1/org.tensorflow/ndarray/0.3.3/1b6d8cc3e3762f6e465b884580d9fc17ab7aeb4/ndarray-0.3.3.jar
java.vm.vendor: AdoptOpenJDK
sun.arch.data.model: 64
user.variant:
java.vendor.url: https://adoptopenjdk.net/
user.timezone: Etc/UTC
java.vm.specification.version: 15
os.name: Linux
sun.java.launcher: SUN_STANDARD
user.country: US
sun.boot.library.path: /opt/java/openjdk/lib:/opt/java/openjdk/lib
sun.java.command: ai.djl.integration.util.DebugEnvironment
jdk.debug: release
sun.cpu.endian: little
user.home: /root
org.gradle.appname: gradlew
user.language: en
java.specification.vendor: Oracle Corporation
java.version.date: 2021-01-19
java.home: /opt/java/openjdk
ai.djl.logging.level: debug
org.gradle.internal.http.connectionTimeout: 60000
file.separator: /
java.vm.compressedOopsMode: Zero based
line.separator:

java.vm.specification.vendor: Oracle Corporation
java.specification.name: Java Platform API Specification
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
java.runtime.version: 15.0.2+7-202101212352
user.name: root
path.separator: :
os.version: 5.11.0-1029-gcp
java.runtime.name: OpenJDK Runtime Environment
file.encoding: UTF-8
java.vm.name: OpenJDK 64-Bit Server VM
java.vendor.version: AdoptOpenJDK
java.vendor.url.bug: https://github.com/AdoptOpenJDK/openjdk-support/issues
java.io.tmpdir: /tmp
org.gradle.internal.http.socketTimeout: 120000
java.version: 15.0.2
user.dir: /djl/integration
os.arch: amd64
java.vm.specification.name: Java Virtual Machine Specification
java.library.path: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
java.vm.info: mixed mode, sharing
java.vendor: AdoptOpenJDK
java.vm.version: 15.0.2+7-202101212352
sun.io.unicode.encoding: UnicodeLittle
library.jansi.path: /root/.gradle/native/jansi/1.18/linux64
java.class.version: 59.0
org.gradle.internal.publish.checksums.insecure: true

--------- Environment Variables ---------
PATH: /opt/java/openjdk/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
NV_LIBCUSPARSE_VERSION: 11.6.0.120-1
NCCL_VERSION: 2.11.4-1
NV_NVTX_VERSION: 11.4.120-1
NV_CUDA_LIB_VERSION: 11.4.2-1
NVIDIA_REQUIRE_CUDA: cuda>=11.4 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=460,driver<461
JAVA_HOME: /opt/java/openjdk
CUDA_VERSION: 11.4.2
NVIDIA_VISIBLE_DEVICES: all
NV_LIBCUBLAS_PACKAGE_NAME: libcublas-11-4
NV_CUDA_COMPAT_PACKAGE: cuda-compat-11-4
TERM: xterm
NV_LIBNCCL_PACKAGE_NAME: libnccl2
LANG: en_US.UTF-8
NV_CUDNN_VERSION: 8.2.4.15
LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
NV_LIBCUBLAS_PACKAGE: libcublas-11-4=11.6.1.51-1
PWD: /djl
JAVA_VERSION: jdk15u
NV_LIBNPP_VERSION: 11.4.0.110-1
_: ./gradlew
NV_LIBNCCL_PACKAGE: libnccl2=2.11.4-1+cuda11.4
LANGUAGE: en_US:en
NV_CUDA_CUDART_VERSION: 11.4.108-1
NVARCH: x86_64
OLDPWD: /djl
NVIDIA_DRIVER_CAPABILITIES: compute,utility
NV_LIBNCCL_PACKAGE_VERSION: 2.11.4-1
NV_LIBNPP_PACKAGE: libnpp-11-4=11.4.0.110-1
HOSTNAME: 73ada51a8863
LC_ALL: en_US.UTF-8
NV_CUDNN_PACKAGE: libcudnn8=8.2.4.15-1+cuda11.4
NV_CUDNN_PACKAGE_NAME: libcudnn8
LS_COLORS: rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
HOME: /root
SHLVL: 1
NV_LIBCUBLAS_VERSION: 11.6.1.51-1

-------------- Directories --------------
temp directory: /tmp
DJL cache directory: /root/.djl.ai
Engine cache directory: /root/.djl.ai

------------------ CUDA -----------------
GPU Count: 1
CUDA: 114
ARCH: 75
GPU(0) memory used: 6607077376 bytes

----------------- Engines ---------------
DJL version: 0.16.0
Default Engine: MXNet
[WARN ] - No matching cuda flavor for linux found: cu114mkl/sm_75.
[DEBUG] - Loading mxnet library from: /root/.djl.ai/mxnet/1.8.0-mkl-linux-x86_64/libmxnet.so
Default Device: cpu()
PyTorch: 2
MXNet: 0
XGBoost: 10
TensorFlow: 3

--------------- Hardware --------------
Available processors (cores): 4
Byte Order: LITTLE_ENDIAN
Free memory (bytes): 209689240
Maximum memory (bytes): 3932160000
Total memory available to JVM (bytes): 251658240
Heap committed: 251658240
Heap nonCommitted: 30212096
GCC:
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

@andreabrduque andreabrduque added the bug Something isn't working label Mar 1, 2022
@frankfliu
Copy link
Contributor

@andreabrduque

I run your test code and I can confirm that there is memory leak in hybrid NDManager. On every inference there is one NDArray is leaked.

I will debug more and see how to fix it.

@frankfliu
Copy link
Contributor

The root cause is we lost track of alternativeManager when a new NDManager is attached to the NDArray: https://github.com/deepjavalibrary/djl/blob/master/api/src/main/java/ai/djl/ndarray/NDArrayAdapter.java#L70

Once a new NDManager is attached to NDArray, all NDArray created should under the new NDManager. For OrtNDArray, since all NDArray is created by alternativeManager, so the alternativeManager should also be updated. the alternativeManager should either be a child of new NDManager of point to the new NDManager itself.

@visionwxc
Copy link

When I use the onnx model and Pytroch's PtNDArray, it seems that this issue still arises

@frankfliu
Copy link
Contributor

When I use the onnx model and Pytroch's PtNDArray, it seems that this issue still arises

Do you have code that can reproduce your issue?

@visionwxc
Copy link

visionwxc commented Nov 3, 2023

When I use the onnx model and Pytroch's PtNDArray, it seems that this issue still arises

Do you have code that can reproduce your issue?

A code example:

 private void doOcr(List<Image> images){
        System.setProperty("PYTORCH_PRECXX11", "true");
        OcrV3Detection detection = new OcrV3Detection();
        OcrV3Recognition recognition = new OcrV3Recognition();
        try (ZooModel detectionModel = ModelZoo.loadModel(Objects.requireNonNull(detection.detectCriteria()));
             Predictor detector = detectionModel.newPredictor();
             ZooModel recognitionModel = ModelZoo.loadModel(Objects.requireNonNull(recognition.recognizeCriteria()));
             Predictor recognizer = recognitionModel.newPredictor()) {
            recognition.predictListStringBatch(images, detector, recognizer);
        }catch (Exception e){
            logger.error("ocr fail",e);
        }
    }

@Override
    public String processOutput(TranslatorContext ctx, NDList list) throws IOException {
        StringBuilder sb = new StringBuilder();
        NDArray tokens = list.singletonOrThrow();
// this line  of code prompts that memory overflow cannot be allocated
        long[] indices = tokens.get(0).argMax(1).toLongArray();
        boolean[] selection = new boolean[indices.length];
        Arrays.fill(selection, true);
        for (int i = 1; i < indices.length; i++) {
            if (indices[i] == indices[i - 1]) {
                selection[i] = false;
            }
        }
        int lastIdx = 0;
        for (int i = 0; i < indices.length; i++) {
            if (selection[i] == true && indices[i] > 0 && !(i > 0 && indices[i] == lastIdx)) {
                sb.append(table.get((int) indices[i]));
            }
        }
        return sb.toString();
    }

environmental information:
CentOS Linux release 7.9.2009 (Core)
CPU
4-CORE 8G

@frankfliu
Copy link
Contributor

What's the error message and stacktrace?

I don't think this error is related to memory leak. can you print out the shape of tokens? does the size exceed 4G?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants