Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tokenizer] Tokenizer always padding with [PAD], not the pad token in tokenizer.json #2669

Closed
lozybean opened this issue Jun 21, 2023 · 5 comments · Fixed by #2741
Closed
Labels
bug Something isn't working

Comments

@lozybean
Copy link

lozybean commented Jun 21, 2023

Description

tokenizer.json is like:

{
    "padding": {
        "strategy": {
            "Fixed": 64
        },
        "direction": "Right",
        "pad_to_multiple_of": null,
        "pad_id": 1,
        "pad_type_id": 0,
        "pad_token": "<pad>"
    },
    "added_tokens": [
        {
            "id": 0,
            "special": true,
            "content": "<s>",
            "single_word": false,
            "lstrip": false,
            "rstrip": false,
            "normalized": false
        },
        {
            "id": 1,
            "special": true,
            "content": "<pad>",
            "single_word": false,
            "lstrip": false,
            "rstrip": false,
            "normalized": false
        },
     ...
   ]
}

Python will pad 1 (pad token <pad>) to the end, but java always pad 0(pad token [PAD])

Expected Behavior

pad as the same with tokenizer.json

Error Message

Python pandding result:

image

Java padding result:

image

How to Reproduce?

Python transformer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base', use_fast=True, model_max_length = 64, padding='max_length')
tokenizer("test sentence", padding="max_length")
tokenizer.save_pretrained('./xlm-roberta-base')

load xml-reberta-base/tokenizer.json in java, and do encode

Environment Info

Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:

----------- System Properties -----------
java.specification.version: 16
sun.jnu.encoding: UTF-8
java.class.path: /Users/bytedance/Documents/Work/djl/integration/build/classes/java/main:/Users/bytedance/Documents/Work/djl/integration/build/resources/main:/Users/bytedance/.gradle/caches/modules-2/files-2.1/commons-cli/commons-cli/1.5.0/dc98be5d5390230684a092589d70ea76a147925c/commons-cli-1.5.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-slf4j-impl/2.20.0/7ab4f082fd162f60afcaf2b8744a3d959feab3e8/log4j-slf4j-impl-2.20.0.jar:/Users/bytedance/Documents/Work/djl/basicdataset/build/libs/basicdataset-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/model-zoo/build/libs/model-zoo-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/testing/build/libs/testing-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/mxnet/mxnet-model-zoo/build/libs/mxnet-model-zoo-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/pytorch/pytorch-model-zoo/build/libs/pytorch-model-zoo-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/pytorch/pytorch-jni/build/libs/pytorch-jni-1.13.1-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/tensorflow/tensorflow-model-zoo/build/libs/tensorflow-model-zoo-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/ml/xgboost/build/libs/xgboost-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/ml/lightgbm/build/libs/lightgbm-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/onnxruntime/onnxruntime-engine/build/libs/onnxruntime-engine-0.23.0-SNAPSHOT.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-core/2.20.0/eb2a9a47b1396e00b5eee1264296729a70565cc0/log4j-core-2.20.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-api/2.20.0/1fe6082e660daf07c689a89c94dc0f49c26b44bb/log4j-api-2.20.0.jar:/Users/bytedance/Documents/Work/djl/engines/mxnet/mxnet-engine/build/libs/mxnet-engine-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/pytorch/pytorch-engine/build/libs/pytorch-engine-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/tensorflow/tensorflow-engine/build/libs/tensorflow-engine-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/api/build/libs/api-0.23.0-SNAPSHOT.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.testng/testng/7.8.0/90ff6902a350432ce23ef209b2f109bcf587069c/testng-7.8.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.slf4j/slf4j-api/1.7.36/6c62681a2f655b49963a5983b8b0950a6120ae14/slf4j-api-1.7.36.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-csv/1.10.0/8669bee353424c3223c93723291b5c3753260c1c/commons-csv-1.10.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/ml.dmlc/xgboost4j_2.12/1.7.5/5f074b329677d4e23222668597eb82de4b845a4b/xgboost4j_2.12-1.7.5.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/commons-logging/commons-logging/1.2/4bfc12adfe4842bf07b657f0369c4cb522955686/commons-logging-1.2.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/com.microsoft.ml.lightgbm/lightgbmlib/3.2.110/f6c85e5d7cc44d49c4544240ea5c96004680007b/lightgbmlib-3.2.110.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/com.microsoft.onnxruntime/onnxruntime/1.15.0/6db39caba947384ce09c3071e84cb73437a77e74/onnxruntime-1.15.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/com.google.code.gson/gson/2.10.1/b3add478d4382b78ea20b1671390a858002feb6c/gson-2.10.1.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.13.0/1200e7ebeedbe0d10062093f32925a912020e747/jna-5.13.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-compress/1.23.0/4af2060ea9b0c8b74f1854c6cafe4d43cfc161fc/commons-compress-1.23.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/com.beust/jcommander/1.82/a7c5fef184d238065de38f81bbc6ee50cca2e21/jcommander-1.82.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.webjars/jquery/3.6.1/d08df6250157cd2db3d9b01b11b76e9b7225083a/jquery-3.6.1.jar:/Users/bytedance/Documents/Work/djl/engines/tensorflow/tensorflow-api/build/libs/tensorflow-api-0.23.0-SNAPSHOT.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.tensorflow/tensorflow-core-api/0.5.0/6dfb7f13a9d96e6c4bd0705f122bd00d3b596b0d/tensorflow-core-api-0.5.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.bytedeco/javacpp/1.5.9/bee92b783ea619381df7577527f8739f778cf2f6/javacpp-1.5.9.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protobuf-java/3.23.3/6ea96f2109fb6cf8f827aa58eebf784c4708d01f/protobuf-java-3.23.3.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.tensorflow/ndarray/0.4.0/7ab74f002dbec93944b7feb38de013afe8d4e8de/ndarray-0.4.0.jar
java.vm.vendor: AdoptOpenJDK
sun.arch.data.model: 64
user.variant:
java.vendor.url: https://adoptopenjdk.net/
user.timezone: Asia/Shanghai
java.vm.specification.version: 16
os.name: Mac OS X
sun.java.launcher: SUN_STANDARD
user.country: CN
sun.boot.library.path: /Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/lib:/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/lib
sun.java.command: ai.djl.integration.util.DebugEnvironment
http.nonProxyHosts: local|*.local|169.254/16|*.169.254/16
jdk.debug: release
sun.cpu.endian: little
user.home: /Users/bytedance
org.gradle.appname: gradlew
user.language: zh
java.specification.vendor: Oracle Corporation
java.version.date: 2021-04-20
java.home: /Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home
ai.djl.logging.level: debug
org.gradle.internal.http.connectionTimeout: 60000
file.separator: /
java.vm.compressedOopsMode: Zero based
line.separator:

java.vm.specification.vendor: Oracle Corporation
java.specification.name: Java Platform API Specification
user.script: Hans
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
ftp.nonProxyHosts: local|*.local|169.254/16|*.169.254/16
java.runtime.version: 16.0.1+9
user.name: bytedance
path.separator: :
os.version: 11.4
java.runtime.name: OpenJDK Runtime Environment
file.encoding: UTF-8
java.vm.name: OpenJDK 64-Bit Server VM
java.vendor.version: AdoptOpenJDK-16.0.1+9
java.vendor.url.bug: https://github.com/AdoptOpenJDK/openjdk-support/issues
java.io.tmpdir: /var/folders/7g/38x8p3894p3_jp0jl5stv8gw0000gn/T/
org.gradle.internal.http.socketTimeout: 120000
java.version: 16.0.1
user.dir: /Users/bytedance/Documents/Work/djl/integration
os.arch: x86_64
java.vm.specification.name: Java Virtual Machine Specification
java.library.path: /Users/bytedance/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.
java.vm.info: mixed mode, sharing
java.vendor: AdoptOpenJDK
java.vm.version: 16.0.1+9
sun.io.unicode.encoding: UnicodeBig
library.jansi.path: /Users/bytedance/.gradle/native/jansi/1.18/osx
java.class.version: 60.0
socksNonProxyHosts: local|*.local|169.254/16|*.169.254/16
org.gradle.internal.publish.checksums.insecure: true

--------- Environment Variables ---------
PYENV_SHELL: zsh
PATH: /Users/bytedance/.gvm/pkgsets/go1.18/global/bin:/Users/bytedance/.gvm/gos/go1.18/bin:/Users/bytedance/.gvm/pkgsets/go1.18/global/overlay/bin:/Users/bytedance/.gvm/bin:/Users/bytedance/.gvm/bin:/Users/bytedance/.pyenv/shims:/Users/bytedance/.bytebm/bin:/Users/bytedance/go/bin:/usr/local/opt/thrift@0.9/bin:/Library/Java/JavaVirtualMachines/jdk1.8.0_301.jdk/Contents/Home/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/puppetlabs/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin
PKG_CONFIG_PATH: /Users/bytedance/.gvm/pkgsets/go1.18/global/overlay/lib/pkgconfig:
LC_TERMINAL: iTerm2
AUTOJUMP_SOURCED: 1
COLORTERM: truecolor
LD_LIBRARY_PATH: /Users/bytedance/.gvm/pkgsets/go1.18/global/overlay/lib:
TCE_HOST_IP: 10.231.238.115
LOGNAME: bytedance
BYTEBM_SOURCED: 1
PWD: /Users/bytedance/Documents/Work/djl
TERM_PROGRAM_VERSION: 3.4.15
SHELL: /bin/zsh
JAVA_MAIN_CLASS_57298: org.gradle.wrapper.GradleWrapperMain
IDL_TOKEN: *********
PAGER: less
CONSUL_HTTP_HOST: 10.231.8.103
PYENV_ROOT: /Users/bytedance/.pyenv
SECURITYSESSIONID: 186c0
GOPATH: /Users/bytedance/.gvm/pkgsets/go1.18/global
gvm_go_name: go1.18
OLDPWD: /Users/bytedance/Documents/Work/djl
ZSH: /Users/bytedance/.oh-my-zsh
ITERM_PROFILE: Default
TMPDIR: /var/folders/7g/38x8p3894p3_jp0jl5stv8gw0000gn/T/
JAVA_MAIN_CLASS_57527: ai.djl.integration.util.DebugEnvironment
XPC_FLAGS: 0x0
TERM_SESSION_ID: w0t0p0:12661C62-3E49-4347-95B2-BF7EA406A09C
__CF_USER_TEXT_ENCODING: 0x1F5:0x19:0x34
LESS: -R
COLORFGBG: 7;0
SHLVL: 1
AUTOJUMP_ERROR_PATH: /Users/bytedance/Library/autojump/errors.log
GVM_PATH_BACKUP: /Users/bytedance/.gvm/bin:/Users/bytedance/.pyenv/shims:/Users/bytedance/.bytebm/bin:/Users/bytedance/go/bin:/usr/local/opt/thrift@0.9/bin:/Library/Java/JavaVirtualMachines/jdk1.8.0_301.jdk/Contents/Home/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/puppetlabs/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin
CONSUL_HTTP_PORT: 8018
KMS_ZONE: IBOE
JAVA_HOME: /Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home
GVM_VERSION: 1.0.22
TERM: xterm-256color
LANG: zh_CN.UTF-8
COMMAND_MODE: unix2003
RUNTIME_IDC_NAME: boei18n
GVM_ROOT: /Users/bytedance/.gvm
ITERM_SESSION_ID: w0t0p0:12661C62-3E49-4347-95B2-BF7EA406A09C
GVM_OVERLAY_PREFIX: /Users/bytedance/.gvm/pkgsets/go1.18/global/overlay
XPC_SERVICE_NAME: 0
JABBA_HOME: /Users/bytedance/.jabba
__CFBundleIdentifier: com.googlecode.iterm2
LC_TERMINAL_VERSION: 3.4.15
APP_ICON_57298: /Users/bytedance/Documents/Work/djl/media/gradle.icns
TERM_PROGRAM: iTerm.app
LSCOLORS: Gxfxcxdxbxegedabagacad
USER: bytedance
CLASSPATH: /Users/bytedance/Documents/Work/djl/gradle/wrapper/gradle-wrapper.jar
GOROOT: /Users/bytedance/.gvm/gos/go1.18
gvm_pkgset_name: global
LaunchInstanceID: E82D6652-45C9-4017-91B3-FE4D341A66CE
SSH_AUTH_SOCK: /private/tmp/com.apple.launchd.FQmWbztyM7/Listeners
APP_NAME_57298: Gradle
HOME: /Users/bytedance

-------------- Directories --------------
temp directory: /var/folders/7g/38x8p3894p3_jp0jl5stv8gw0000gn/T
DJL cache directory: /Users/bytedance/.djl.ai
Engine cache directory: /Users/bytedance/.djl.ai

------------------ CUDA -----------------
[DEBUG] - cudart library not found.
GPU Count: 0

----------------- Engines ---------------
DJL version: 0.23.0-SNAPSHOT
[DEBUG] - Using cache dir: /Users/bytedance/.djl.ai/mxnet/1.9.1-mkl-osx-x86_64
[INFO ] - Downloading libmxnet.dylib ...
[DEBUG] - Loading mxnet library from: /Users/bytedance/.djl.ai/mxnet/1.9.1-mkl-osx-x86_64/libmxnet.dylib
[DEBUG] - Using cache dir: /Users/bytedance/.djl.ai/mxnet/1.9.1-mkl-osx-x86_64
Default Engine: MXNet:1.9.0, capabilities: [
        BLAS_APPLE,
        CPU_SSE,
        SIGNAL_HANDLER,
        LAPACK,
        CPU_SSE2,
        CPU_SSE3,
        CPU_SSE4_1,
        OPENCV,
        MKLDNN,
]
MXNet Library: /Users/bytedance/.djl.ai/mxnet/1.9.1-mkl-osx-x86_64/libmxnet.dylib
Default Device: cpu()
PyTorch: 2
MXNet: 0
XGBoost: 10
LightGBM: 10
OnnxRuntime: 10
TensorFlow: 3

--------------- Hardware --------------
Available processors (cores): 8
Byte Order: LITTLE_ENDIAN
Free memory (bytes): 487349248
Maximum memory (bytes): 8589934592
Total memory available to JVM (bytes): 545259520
Heap committed: 545259520
Heap nonCommitted: 30408704
GCC:
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: x86_64-apple-darwin22.5.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
@lozybean lozybean added the bug Something isn't working label Jun 21, 2023
@KexinFeng
Copy link
Contributor

Could you provide a java code that reproduces this error?

Also I'm wondering what it returns when you decode it back. Is '0' decoded as [PAD] in java? If so, does it work for you to use '0' as padding_id?

@lozybean
Copy link
Author

lozybean commented Aug 9, 2023

Could you provide a java code that reproduces this error?

Also I'm wondering what it returns when you decode it back. Is '0' decoded as [PAD] in java? If so, does it work for you to use '0' as padding_id?

just use tokenizer.json from xlm-roberta-base, in this tokenizer.json, padding token is and it should be encode as '1'

@frankfliu
Copy link
Contributor

I'm able to reproduce your issue:

        HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.builder()
                .optTokenizerName("xlm-roberta-base")
                .optMaxLength(50)
                .optPadToMaxLength()
                .build();
        Encoding encoding = tokenizer.encode("test sentence");
        String[] tokens = encoding.getTokens();
        System.out.println(Arrays.toString(tokens));

[<s>, ▁test, ▁sentence, </s>, [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD]]

@siddvenk can you take a look why python and rust produce different result?

@siddvenk
Copy link
Contributor

siddvenk commented Aug 9, 2023

Thanks for raising this issue - there was an issue where we would overwrite the padding and truncation settings in the tokenizer.json file with the defaults from the tokenizer rust library. This issue will be fixed via #2741

@siddvenk
Copy link
Contributor

The PR has been merged, you can try out the latest snapshot version tomorrow and that should fix the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants