Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returned character token spans are not correct for some inputs #2112

Closed
muskedunder opened this issue Nov 1, 2022 · 4 comments · Fixed by #2137
Closed

Returned character token spans are not correct for some inputs #2112

muskedunder opened this issue Nov 1, 2022 · 4 comments · Fixed by #2137
Labels
bug Something isn't working

Comments

@muskedunder
Copy link

muskedunder commented Nov 1, 2022

Description

For some inputs the character spans returned by Encoding.getCharTokenSpans() does not correspond to the correct character span in the input string.

Expected Behavior

Encoding.getCharTokenSpans() should return the correct character span, and it should return the same character spans as when using the same tokenizer in Python.

Error Message

No error message.

Steps to reproduce

The error can be reproduced with this short Scala snippet

    val someText = "just some string including a £9.00 price"
    val tokenizer = HuggingFaceTokenizer.newInstance("distilbert-base-uncased")
    val encodedText = tokenizer.encode(someText)

    val charSpans = encodedText.getCharTokenSpans()
    val tokens = encodedText.getTokens()
    for (i <- 1 until charSpans.length) {
        println(s"$token: ${tokens(i)}, orig string: ${someText.substring(charSpans(i).getStart, charSpans(i).getEnd)}, span start=${charSpans(i).getStart}, span end=${charSpans(i).getEnd}")
    }

This script has the following output

token: just, orig string: just, span start=0, span end=4
token: some, orig string: some, span start=5, span end=9
token: string, orig string: string, span start=10, span end=16
token: including, orig string: including, span start=17, span end=26
token: a, orig string: a, span start=27, span end=28
token: £, orig string: £9, span start=29, span end=31
token: ##9, orig string: ., span start=31, span end=32
token: ., orig string: 0, span start=32, span end=33
token: 00, orig string: 0 , span start=33, span end=35

So for some reason the 6th token ("£") gets a character span of two characters, and then the rest of the spans are off by one.

Running this very similar Python script

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

some_string = "just some string including a £9.00 price"
encoded_input = tokenizer(some_string, return_offsets_mapping=True)

tokens = tokenizer.convert_ids_to_tokens(encoded_input["input_ids"])
for token, span in zip(tokens[1:-1], encoded_input["offset_mapping"][1:-1]):
    print(f"token: {token}, orig string: {some_string[span[0]:span[1]]}, span start={span[0]}, span end={span[1]}")

gives the following output

token: just, orig string: just, span start=0, span end=4
token: some, orig string: some, span start=5, span end=9
token: string, orig string: string, span start=10, span end=16
token: including, orig string: including, span start=17, span end=26
token: a, orig string: a, span start=27, span end=28
token: £, orig string: £, span start=29, span end=30
token: ##9, orig string: 9, span start=30, span end=31
token: ., orig string: ., span start=31, span end=32
token: 00, orig string: 00, span start=32, span end=34
token: price, orig string: price, span start=35, span end=40

And the 6th token ("£") gets a character span of one characters, as expected.

Environment Info

"ai.djl.huggingface:tokenizers:0.19.0"

@muskedunder muskedunder added the bug Something isn't working label Nov 1, 2022
@lanking520
Copy link
Contributor

Hi @muskedunder did you see anything different comparing to python?

@muskedunder
Copy link
Author

muskedunder commented Nov 2, 2022

Yes, that's what I explained above.

I.e for some tokens, in this case "£", the DJL-version of the tokenizer returns a different span than the regular Python version. The returned span from DJL is of length 2, even though the token is clearly just one character long.

@lanking520
Copy link
Contributor

We are able to reproduce the issue you are having. After trying with $, we didn't observe the similar issue. The current guessing is a different intepretation on the ASCII character more than 127, where most of them takes 2 bytes instead of one. That's why you can see the length of 2. We are trying to root cause this issue.

@lanking520
Copy link
Contributor

Here is the fix, after applying, you should see the error gone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants