You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For some inputs the character spans returned by Encoding.getCharTokenSpans() does not correspond to the correct character span in the input string.
Expected Behavior
Encoding.getCharTokenSpans() should return the correct character span, and it should return the same character spans as when using the same tokenizer in Python.
Error Message
No error message.
Steps to reproduce
The error can be reproduced with this short Scala snippet
val someText = "just some string including a £9.00 price"
val tokenizer = HuggingFaceTokenizer.newInstance("distilbert-base-uncased")
val encodedText = tokenizer.encode(someText)
val charSpans = encodedText.getCharTokenSpans()
val tokens = encodedText.getTokens()
for (i <- 1 until charSpans.length) {
println(s"$token: ${tokens(i)}, orig string: ${someText.substring(charSpans(i).getStart, charSpans(i).getEnd)}, span start=${charSpans(i).getStart}, span end=${charSpans(i).getEnd}")
}
I.e for some tokens, in this case "£", the DJL-version of the tokenizer returns a different span than the regular Python version. The returned span from DJL is of length 2, even though the token is clearly just one character long.
We are able to reproduce the issue you are having. After trying with $, we didn't observe the similar issue. The current guessing is a different intepretation on the ASCII character more than 127, where most of them takes 2 bytes instead of one. That's why you can see the length of 2. We are trying to root cause this issue.
Description
For some inputs the character spans returned by
Encoding.getCharTokenSpans()
does not correspond to the correct character span in the input string.Expected Behavior
Encoding.getCharTokenSpans()
should return the correct character span, and it should return the same character spans as when using the same tokenizer in Python.Error Message
No error message.
Steps to reproduce
The error can be reproduced with this short Scala snippet
This script has the following output
So for some reason the 6th token ("£") gets a character span of two characters, and then the rest of the spans are off by one.
Running this very similar Python script
gives the following output
And the 6th token ("£") gets a character span of one characters, as expected.
Environment Info
"ai.djl.huggingface:tokenizers:0.19.0"
The text was updated successfully, but these errors were encountered: