-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tokenizer] Tokenizer always padding with [PAD], not the pad token in tokenizer.json #2669
Comments
Could you provide a java code that reproduces this error? Also I'm wondering what it returns when you decode it back. Is '0' decoded as [PAD] in java? If so, does it work for you to use '0' as padding_id? |
just use tokenizer.json from |
I'm able to reproduce your issue:
@siddvenk can you take a look why python and rust produce different result? |
Thanks for raising this issue - there was an issue where we would overwrite the padding and truncation settings in the tokenizer.json file with the defaults from the tokenizer rust library. This issue will be fixed via #2741 |
The PR has been merged, you can try out the latest snapshot version tomorrow and that should fix the issue |
Description
tokenizer.json is like:
Python will pad 1 (pad token
<pad>
) to the end, but java always pad 0(pad token[PAD]
)Expected Behavior
pad as the same with tokenizer.json
Error Message
Python pandding result:
Java padding result:
How to Reproduce?
Python transformer:
load
xml-reberta-base/tokenizer.json
in java, and do encodeEnvironment Info
Please run the command
./gradlew debugEnv
from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:The text was updated successfully, but these errors were encountered: