[tokenizer] Tokenizer always padding with [PAD], not the pad token in tokenizer.json #2669

lozybean · 2023-06-21T04:51:39Z

Description

tokenizer.json is like:

{
    "padding": {
        "strategy": {
            "Fixed": 64
        },
        "direction": "Right",
        "pad_to_multiple_of": null,
        "pad_id": 1,
        "pad_type_id": 0,
        "pad_token": "<pad>"
    },
    "added_tokens": [
        {
            "id": 0,
            "special": true,
            "content": "<s>",
            "single_word": false,
            "lstrip": false,
            "rstrip": false,
            "normalized": false
        },
        {
            "id": 1,
            "special": true,
            "content": "<pad>",
            "single_word": false,
            "lstrip": false,
            "rstrip": false,
            "normalized": false
        },
     ...
   ]
}

Python will pad 1 (pad token <pad>) to the end, but java always pad 0(pad token [PAD])

Expected Behavior

pad as the same with tokenizer.json

Error Message

Python pandding result:

Java padding result:

How to Reproduce?

Python transformer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base', use_fast=True, model_max_length = 64, padding='max_length')
tokenizer("test sentence", padding="max_length")
tokenizer.save_pretrained('./xlm-roberta-base')

load xml-reberta-base/tokenizer.json in java, and do encode

Environment Info

Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:

----------- System Properties -----------
java.specification.version: 16
sun.jnu.encoding: UTF-8
java.class.path: /Users/bytedance/Documents/Work/djl/integration/build/classes/java/main:/Users/bytedance/Documents/Work/djl/integration/build/resources/main:/Users/bytedance/.gradle/caches/modules-2/files-2.1/commons-cli/commons-cli/1.5.0/dc98be5d5390230684a092589d70ea76a147925c/commons-cli-1.5.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-slf4j-impl/2.20.0/7ab4f082fd162f60afcaf2b8744a3d959feab3e8/log4j-slf4j-impl-2.20.0.jar:/Users/bytedance/Documents/Work/djl/basicdataset/build/libs/basicdataset-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/model-zoo/build/libs/model-zoo-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/testing/build/libs/testing-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/mxnet/mxnet-model-zoo/build/libs/mxnet-model-zoo-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/pytorch/pytorch-model-zoo/build/libs/pytorch-model-zoo-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/pytorch/pytorch-jni/build/libs/pytorch-jni-1.13.1-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/tensorflow/tensorflow-model-zoo/build/libs/tensorflow-model-zoo-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/ml/xgboost/build/libs/xgboost-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/ml/lightgbm/build/libs/lightgbm-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/onnxruntime/onnxruntime-engine/build/libs/onnxruntime-engine-0.23.0-SNAPSHOT.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-core/2.20.0/eb2a9a47b1396e00b5eee1264296729a70565cc0/log4j-core-2.20.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-api/2.20.0/1fe6082e660daf07c689a89c94dc0f49c26b44bb/log4j-api-2.20.0.jar:/Users/bytedance/Documents/Work/djl/engines/mxnet/mxnet-engine/build/libs/mxnet-engine-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/pytorch/pytorch-engine/build/libs/pytorch-engine-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/engines/tensorflow/tensorflow-engine/build/libs/tensorflow-engine-0.23.0-SNAPSHOT.jar:/Users/bytedance/Documents/Work/djl/api/build/libs/api-0.23.0-SNAPSHOT.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.testng/testng/7.8.0/90ff6902a350432ce23ef209b2f109bcf587069c/testng-7.8.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.slf4j/slf4j-api/1.7.36/6c62681a2f655b49963a5983b8b0950a6120ae14/slf4j-api-1.7.36.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-csv/1.10.0/8669bee353424c3223c93723291b5c3753260c1c/commons-csv-1.10.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/ml.dmlc/xgboost4j_2.12/1.7.5/5f074b329677d4e23222668597eb82de4b845a4b/xgboost4j_2.12-1.7.5.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/commons-logging/commons-logging/1.2/4bfc12adfe4842bf07b657f0369c4cb522955686/commons-logging-1.2.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/com.microsoft.ml.lightgbm/lightgbmlib/3.2.110/f6c85e5d7cc44d49c4544240ea5c96004680007b/lightgbmlib-3.2.110.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/com.microsoft.onnxruntime/onnxruntime/1.15.0/6db39caba947384ce09c3071e84cb73437a77e74/onnxruntime-1.15.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/com.google.code.gson/gson/2.10.1/b3add478d4382b78ea20b1671390a858002feb6c/gson-2.10.1.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.13.0/1200e7ebeedbe0d10062093f32925a912020e747/jna-5.13.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-compress/1.23.0/4af2060ea9b0c8b74f1854c6cafe4d43cfc161fc/commons-compress-1.23.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/com.beust/jcommander/1.82/a7c5fef184d238065de38f81bbc6ee50cca2e21/jcommander-1.82.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.webjars/jquery/3.6.1/d08df6250157cd2db3d9b01b11b76e9b7225083a/jquery-3.6.1.jar:/Users/bytedance/Documents/Work/djl/engines/tensorflow/tensorflow-api/build/libs/tensorflow-api-0.23.0-SNAPSHOT.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.tensorflow/tensorflow-core-api/0.5.0/6dfb7f13a9d96e6c4bd0705f122bd00d3b596b0d/tensorflow-core-api-0.5.0.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.bytedeco/javacpp/1.5.9/bee92b783ea619381df7577527f8739f778cf2f6/javacpp-1.5.9.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protobuf-java/3.23.3/6ea96f2109fb6cf8f827aa58eebf784c4708d01f/protobuf-java-3.23.3.jar:/Users/bytedance/.gradle/caches/modules-2/files-2.1/org.tensorflow/ndarray/0.4.0/7ab74f002dbec93944b7feb38de013afe8d4e8de/ndarray-0.4.0.jar
java.vm.vendor: AdoptOpenJDK
sun.arch.data.model: 64
user.variant:
java.vendor.url: https://adoptopenjdk.net/
user.timezone: Asia/Shanghai
java.vm.specification.version: 16
os.name: Mac OS X
sun.java.launcher: SUN_STANDARD
user.country: CN
sun.boot.library.path: /Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/lib:/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/lib
sun.java.command: ai.djl.integration.util.DebugEnvironment
http.nonProxyHosts: local|*.local|169.254/16|*.169.254/16
jdk.debug: release
sun.cpu.endian: little
user.home: /Users/bytedance
org.gradle.appname: gradlew
user.language: zh
java.specification.vendor: Oracle Corporation
java.version.date: 2021-04-20
java.home: /Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home
ai.djl.logging.level: debug
org.gradle.internal.http.connectionTimeout: 60000
file.separator: /
java.vm.compressedOopsMode: Zero based
line.separator:

java.vm.specification.vendor: Oracle Corporation
java.specification.name: Java Platform API Specification
user.script: Hans
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
ftp.nonProxyHosts: local|*.local|169.254/16|*.169.254/16
java.runtime.version: 16.0.1+9
user.name: bytedance
path.separator: :
os.version: 11.4
java.runtime.name: OpenJDK Runtime Environment
file.encoding: UTF-8
java.vm.name: OpenJDK 64-Bit Server VM
java.vendor.version: AdoptOpenJDK-16.0.1+9
java.vendor.url.bug: https://github.com/AdoptOpenJDK/openjdk-support/issues
java.io.tmpdir: /var/folders/7g/38x8p3894p3_jp0jl5stv8gw0000gn/T/
org.gradle.internal.http.socketTimeout: 120000
java.version: 16.0.1
user.dir: /Users/bytedance/Documents/Work/djl/integration
os.arch: x86_64
java.vm.specification.name: Java Virtual Machine Specification
java.library.path: /Users/bytedance/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.
java.vm.info: mixed mode, sharing
java.vendor: AdoptOpenJDK
java.vm.version: 16.0.1+9
sun.io.unicode.encoding: UnicodeBig
library.jansi.path: /Users/bytedance/.gradle/native/jansi/1.18/osx
java.class.version: 60.0
socksNonProxyHosts: local|*.local|169.254/16|*.169.254/16
org.gradle.internal.publish.checksums.insecure: true

--------- Environment Variables ---------
PYENV_SHELL: zsh
PATH: /Users/bytedance/.gvm/pkgsets/go1.18/global/bin:/Users/bytedance/.gvm/gos/go1.18/bin:/Users/bytedance/.gvm/pkgsets/go1.18/global/overlay/bin:/Users/bytedance/.gvm/bin:/Users/bytedance/.gvm/bin:/Users/bytedance/.pyenv/shims:/Users/bytedance/.bytebm/bin:/Users/bytedance/go/bin:/usr/local/opt/thrift@0.9/bin:/Library/Java/JavaVirtualMachines/jdk1.8.0_301.jdk/Contents/Home/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/puppetlabs/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin
PKG_CONFIG_PATH: /Users/bytedance/.gvm/pkgsets/go1.18/global/overlay/lib/pkgconfig:
LC_TERMINAL: iTerm2
AUTOJUMP_SOURCED: 1
COLORTERM: truecolor
LD_LIBRARY_PATH: /Users/bytedance/.gvm/pkgsets/go1.18/global/overlay/lib:
TCE_HOST_IP: 10.231.238.115
LOGNAME: bytedance
BYTEBM_SOURCED: 1
PWD: /Users/bytedance/Documents/Work/djl
TERM_PROGRAM_VERSION: 3.4.15
SHELL: /bin/zsh
JAVA_MAIN_CLASS_57298: org.gradle.wrapper.GradleWrapperMain
IDL_TOKEN: *********
PAGER: less
CONSUL_HTTP_HOST: 10.231.8.103
PYENV_ROOT: /Users/bytedance/.pyenv
SECURITYSESSIONID: 186c0
GOPATH: /Users/bytedance/.gvm/pkgsets/go1.18/global
gvm_go_name: go1.18
OLDPWD: /Users/bytedance/Documents/Work/djl
ZSH: /Users/bytedance/.oh-my-zsh
ITERM_PROFILE: Default
TMPDIR: /var/folders/7g/38x8p3894p3_jp0jl5stv8gw0000gn/T/
JAVA_MAIN_CLASS_57527: ai.djl.integration.util.DebugEnvironment
XPC_FLAGS: 0x0
TERM_SESSION_ID: w0t0p0:12661C62-3E49-4347-95B2-BF7EA406A09C
__CF_USER_TEXT_ENCODING: 0x1F5:0x19:0x34
LESS: -R
COLORFGBG: 7;0
SHLVL: 1
AUTOJUMP_ERROR_PATH: /Users/bytedance/Library/autojump/errors.log
GVM_PATH_BACKUP: /Users/bytedance/.gvm/bin:/Users/bytedance/.pyenv/shims:/Users/bytedance/.bytebm/bin:/Users/bytedance/go/bin:/usr/local/opt/thrift@0.9/bin:/Library/Java/JavaVirtualMachines/jdk1.8.0_301.jdk/Contents/Home/bin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/puppetlabs/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin
CONSUL_HTTP_PORT: 8018
KMS_ZONE: IBOE
JAVA_HOME: /Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home
GVM_VERSION: 1.0.22
TERM: xterm-256color
LANG: zh_CN.UTF-8
COMMAND_MODE: unix2003
RUNTIME_IDC_NAME: boei18n
GVM_ROOT: /Users/bytedance/.gvm
ITERM_SESSION_ID: w0t0p0:12661C62-3E49-4347-95B2-BF7EA406A09C
GVM_OVERLAY_PREFIX: /Users/bytedance/.gvm/pkgsets/go1.18/global/overlay
XPC_SERVICE_NAME: 0
JABBA_HOME: /Users/bytedance/.jabba
__CFBundleIdentifier: com.googlecode.iterm2
LC_TERMINAL_VERSION: 3.4.15
APP_ICON_57298: /Users/bytedance/Documents/Work/djl/media/gradle.icns
TERM_PROGRAM: iTerm.app
LSCOLORS: Gxfxcxdxbxegedabagacad
USER: bytedance
CLASSPATH: /Users/bytedance/Documents/Work/djl/gradle/wrapper/gradle-wrapper.jar
GOROOT: /Users/bytedance/.gvm/gos/go1.18
gvm_pkgset_name: global
LaunchInstanceID: E82D6652-45C9-4017-91B3-FE4D341A66CE
SSH_AUTH_SOCK: /private/tmp/com.apple.launchd.FQmWbztyM7/Listeners
APP_NAME_57298: Gradle
HOME: /Users/bytedance

-------------- Directories --------------
temp directory: /var/folders/7g/38x8p3894p3_jp0jl5stv8gw0000gn/T
DJL cache directory: /Users/bytedance/.djl.ai
Engine cache directory: /Users/bytedance/.djl.ai

------------------ CUDA -----------------
[DEBUG] - cudart library not found.
GPU Count: 0

----------------- Engines ---------------
DJL version: 0.23.0-SNAPSHOT
[DEBUG] - Using cache dir: /Users/bytedance/.djl.ai/mxnet/1.9.1-mkl-osx-x86_64
[INFO ] - Downloading libmxnet.dylib ...
[DEBUG] - Loading mxnet library from: /Users/bytedance/.djl.ai/mxnet/1.9.1-mkl-osx-x86_64/libmxnet.dylib
[DEBUG] - Using cache dir: /Users/bytedance/.djl.ai/mxnet/1.9.1-mkl-osx-x86_64
Default Engine: MXNet:1.9.0, capabilities: [
        BLAS_APPLE,
        CPU_SSE,
        SIGNAL_HANDLER,
        LAPACK,
        CPU_SSE2,
        CPU_SSE3,
        CPU_SSE4_1,
        OPENCV,
        MKLDNN,
]
MXNet Library: /Users/bytedance/.djl.ai/mxnet/1.9.1-mkl-osx-x86_64/libmxnet.dylib
Default Device: cpu()
PyTorch: 2
MXNet: 0
XGBoost: 10
LightGBM: 10
OnnxRuntime: 10
TensorFlow: 3

--------------- Hardware --------------
Available processors (cores): 8
Byte Order: LITTLE_ENDIAN
Free memory (bytes): 487349248
Maximum memory (bytes): 8589934592
Total memory available to JVM (bytes): 545259520
Heap committed: 545259520
Heap nonCommitted: 30408704
GCC:
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: x86_64-apple-darwin22.5.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

The text was updated successfully, but these errors were encountered:

KexinFeng · 2023-07-05T19:48:30Z

Could you provide a java code that reproduces this error?

Also I'm wondering what it returns when you decode it back. Is '0' decoded as [PAD] in java? If so, does it work for you to use '0' as padding_id?

lozybean · 2023-08-09T12:11:39Z

Could you provide a java code that reproduces this error?

Also I'm wondering what it returns when you decode it back. Is '0' decoded as [PAD] in java? If so, does it work for you to use '0' as padding_id?

just use tokenizer.json from xlm-roberta-base, in this tokenizer.json, padding token is and it should be encode as '1'

frankfliu · 2023-08-09T17:58:17Z

I'm able to reproduce your issue:

        HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.builder()
                .optTokenizerName("xlm-roberta-base")
                .optMaxLength(50)
                .optPadToMaxLength()
                .build();
        Encoding encoding = tokenizer.encode("test sentence");
        String[] tokens = encoding.getTokens();
        System.out.println(Arrays.toString(tokens));

[<s>, ▁test, ▁sentence, </s>, [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD], [PAD]]

@siddvenk can you take a look why python and rust produce different result?

siddvenk · 2023-08-09T23:00:08Z

Thanks for raising this issue - there was an issue where we would overwrite the padding and truncation settings in the tokenizer.json file with the defaults from the tokenizer rust library. This issue will be fixed via #2741

siddvenk · 2023-08-10T00:26:32Z

The PR has been merged, you can try out the latest snapshot version tomorrow and that should fix the issue

lozybean added the bug Something isn't working label Jun 21, 2023

siddvenk mentioned this issue Aug 9, 2023

Fix issue with setPadding and setTruncation overriding configurations… #2741

Merged

siddvenk closed this as completed in #2741 Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tokenizer] Tokenizer always padding with [PAD], not the pad token in tokenizer.json #2669

[tokenizer] Tokenizer always padding with [PAD], not the pad token in tokenizer.json #2669

lozybean commented Jun 21, 2023 •

edited

Loading

KexinFeng commented Jul 5, 2023

lozybean commented Aug 9, 2023

frankfliu commented Aug 9, 2023

siddvenk commented Aug 9, 2023

siddvenk commented Aug 10, 2023

[tokenizer] Tokenizer always padding with [PAD], not the pad token in tokenizer.json #2669

[tokenizer] Tokenizer always padding with [PAD], not the pad token in tokenizer.json #2669

Comments

lozybean commented Jun 21, 2023 • edited Loading

Description

Expected Behavior

Error Message

How to Reproduce?

Environment Info

KexinFeng commented Jul 5, 2023

lozybean commented Aug 9, 2023

frankfliu commented Aug 9, 2023

siddvenk commented Aug 9, 2023

siddvenk commented Aug 10, 2023

lozybean commented Jun 21, 2023 •

edited

Loading