Returned character token spans are not correct for some inputs #2112

muskedunder · 2022-11-01T13:26:30Z

Description

For some inputs the character spans returned by Encoding.getCharTokenSpans() does not correspond to the correct character span in the input string.

Expected Behavior

Encoding.getCharTokenSpans() should return the correct character span, and it should return the same character spans as when using the same tokenizer in Python.

Error Message

No error message.

Steps to reproduce

The error can be reproduced with this short Scala snippet

    val someText = "just some string including a £9.00 price"
    val tokenizer = HuggingFaceTokenizer.newInstance("distilbert-base-uncased")
    val encodedText = tokenizer.encode(someText)

    val charSpans = encodedText.getCharTokenSpans()
    val tokens = encodedText.getTokens()
    for (i <- 1 until charSpans.length) {
        println(s"$token: ${tokens(i)}, orig string: ${someText.substring(charSpans(i).getStart, charSpans(i).getEnd)}, span start=${charSpans(i).getStart}, span end=${charSpans(i).getEnd}")
    }

This script has the following output

token: just, orig string: just, span start=0, span end=4
token: some, orig string: some, span start=5, span end=9
token: string, orig string: string, span start=10, span end=16
token: including, orig string: including, span start=17, span end=26
token: a, orig string: a, span start=27, span end=28
token: £, orig string: £9, span start=29, span end=31
token: ##9, orig string: ., span start=31, span end=32
token: ., orig string: 0, span start=32, span end=33
token: 00, orig string: 0 , span start=33, span end=35

So for some reason the 6th token ("£") gets a character span of two characters, and then the rest of the spans are off by one.

Running this very similar Python script

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

some_string = "just some string including a £9.00 price"
encoded_input = tokenizer(some_string, return_offsets_mapping=True)

tokens = tokenizer.convert_ids_to_tokens(encoded_input["input_ids"])
for token, span in zip(tokens[1:-1], encoded_input["offset_mapping"][1:-1]):
    print(f"token: {token}, orig string: {some_string[span[0]:span[1]]}, span start={span[0]}, span end={span[1]}")

gives the following output

token: just, orig string: just, span start=0, span end=4
token: some, orig string: some, span start=5, span end=9
token: string, orig string: string, span start=10, span end=16
token: including, orig string: including, span start=17, span end=26
token: a, orig string: a, span start=27, span end=28
token: £, orig string: £, span start=29, span end=30
token: ##9, orig string: 9, span start=30, span end=31
token: ., orig string: ., span start=31, span end=32
token: 00, orig string: 00, span start=32, span end=34
token: price, orig string: price, span start=35, span end=40

And the 6th token ("£") gets a character span of one characters, as expected.

Environment Info

"ai.djl.huggingface:tokenizers:0.19.0"

The text was updated successfully, but these errors were encountered:

lanking520 · 2022-11-01T21:39:23Z

Hi @muskedunder did you see anything different comparing to python?

muskedunder · 2022-11-02T09:59:42Z

Yes, that's what I explained above.

I.e for some tokens, in this case "£", the DJL-version of the tokenizer returns a different span than the regular Python version. The returned span from DJL is of length 2, even though the token is clearly just one character long.

lanking520 · 2022-11-04T16:12:47Z

We are able to reproduce the issue you are having. After trying with $, we didn't observe the similar issue. The current guessing is a different intepretation on the ASCII character more than 127, where most of them takes 2 bytes instead of one. That's why you can see the length of 2. We are trying to root cause this issue.

lanking520 · 2022-11-05T01:15:30Z

Here is the fix, after applying, you should see the error gone

muskedunder added the bug Something isn't working label Nov 1, 2022

lanking520 mentioned this issue Nov 5, 2022

[Tokenizer] fix char offset #2137

Merged

lanking520 closed this as completed in #2137 Nov 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Returned character token spans are not correct for some inputs #2112

Returned character token spans are not correct for some inputs #2112

muskedunder commented Nov 1, 2022 •

edited

Loading

lanking520 commented Nov 1, 2022

muskedunder commented Nov 2, 2022 •

edited

Loading

lanking520 commented Nov 4, 2022

lanking520 commented Nov 5, 2022

Returned character token spans are not correct for some inputs #2112

Returned character token spans are not correct for some inputs #2112

Comments

muskedunder commented Nov 1, 2022 • edited Loading

Description

Expected Behavior

Error Message

Steps to reproduce

Environment Info

lanking520 commented Nov 1, 2022

muskedunder commented Nov 2, 2022 • edited Loading

lanking520 commented Nov 4, 2022

lanking520 commented Nov 5, 2022

muskedunder commented Nov 1, 2022 •

edited

Loading

muskedunder commented Nov 2, 2022 •

edited

Loading