Fixing OPT fast tokenizer option. #18753

Narsil · 2022-08-24T15:39:23Z

What does this PR do?

Fixes the relevant issues:

https://huggingface.co/wjmcat/opt-350m-paddle/discussions/1
https://huggingface.slack.com/archives/C01N44FJDHT/p1653511495183519 (internal link)
#17088 (comment)

Basically, the OPT tokenizer is a GPT2 tokenizer that adds a BOS token at the start
of the tokens.
Lots of back&forth at the time, but the truth is that the ByteLevel(trim_offsets=False)
post_processor, actually doesn't do anything, so we can just replace it with a simple
TemplateProcessing processor and everything works correctly.

This PR fixes the biggest culprit (missing BOS token on the fast tokenizer version).

Call to witness on other issues I might have missed.

@saulu
@patrickvonplaten
@Mishig

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-08-24T15:52:23Z

The documentation is not available anymore as the PR was closed or merged.

SaulLu · 2022-08-25T08:39:52Z

Good point!

Indeed, I had not noticed at all that ByteLevel(trim_offsets=False) was not doing anything. As trim_offset is not an argument in the __init__ either, I don't see any problem that your solution could cause! 😊

SaulLu · 2022-08-25T08:41:27Z

tests/models/gpt2/test_tokenization_gpt2.py

+
+
+@require_tokenizers
+class OPTTokenizationTest(unittest.TestCase):


Out of curiosity, why create a new test class? 🤗

No mixin, it's a simple test.

If there are better/more suited locations for this test I can move it.
I just thought it didn't fit the Mixin type of tests (which are intended to by highly generic, right ?)

Fine for me - think in the normal GPT2 Fast Test we test the tokenizer without leading BOS token. @Narsil could we maybe add 1,2 more tests here to check that fast gives identical output to slow and maybe also a test that the first token (BOS TOKEN) can be changed to whatever token the user wants with save / re-load?

I added two tests, are those what you had in mind ?

patrickvonplaten

Wuhuu super cool PR @Narsil thanks a mille!

If not too much of a hussle it'd be very nice to add 2,3 more tests for the fast OPT tokenizer

patrickvonplaten

Thanks!

* Fixing OPT fast tokenizer option. * Remove dependency on `pt`. * Move it to GPT2 tokenization tests. * Added a few tests.

Narsil requested a review from patrickvonplaten August 25, 2022 07:00

SaulLu reviewed Aug 25, 2022

View reviewed changes

patrickvonplaten approved these changes Aug 31, 2022

View reviewed changes

patrickvonplaten approved these changes Sep 2, 2022

View reviewed changes

Narsil added 4 commits September 15, 2022 16:34

Fixing OPT fast tokenizer option.

73b8932

Remove dependency on pt.

2c248c6

Move it to GPT2 tokenization tests.

9ca6e93

Added a few tests.

1e4b9ee

Narsil force-pushed the fix_opt_fast branch from 64d9eb8 to 1e4b9ee Compare September 15, 2022 14:35

Narsil merged commit 68bb33d into huggingface:main Sep 15, 2022

Narsil deleted the fix_opt_fast branch September 15, 2022 15:13

LysandreJik pushed a commit that referenced this pull request Sep 16, 2022

Fixing OPT fast tokenizer option. (#18753)

1504b53

* Fixing OPT fast tokenizer option. * Remove dependency on `pt`. * Move it to GPT2 tokenization tests. * Added a few tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing OPT fast tokenizer option. #18753

Fixing OPT fast tokenizer option. #18753

Narsil commented Aug 24, 2022

HuggingFaceDocBuilderDev commented Aug 24, 2022 •

edited

Loading

SaulLu commented Aug 25, 2022

SaulLu Aug 25, 2022

Narsil Aug 25, 2022

patrickvonplaten Aug 31, 2022

Narsil Sep 1, 2022

patrickvonplaten left a comment

patrickvonplaten left a comment



		@require_tokenizers
		class OPTTokenizationTest(unittest.TestCase):

Fixing OPT fast tokenizer option. #18753

Fixing OPT fast tokenizer option. #18753

Conversation

Narsil commented Aug 24, 2022

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 24, 2022 • edited Loading

SaulLu commented Aug 25, 2022

SaulLu Aug 25, 2022

Choose a reason for hiding this comment

Narsil Aug 25, 2022

Choose a reason for hiding this comment

patrickvonplaten Aug 31, 2022

Choose a reason for hiding this comment

Narsil Sep 1, 2022

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 24, 2022 •

edited

Loading