Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing OPT fast tokenizer option. #18753

Merged
merged 4 commits into from
Sep 15, 2022
Merged

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Aug 24, 2022

What does this PR do?

Fixes the relevant issues:

https://huggingface.co/wjmcat/opt-350m-paddle/discussions/1
https://huggingface.slack.com/archives/C01N44FJDHT/p1653511495183519 (internal link)
#17088 (comment)

Basically, the OPT tokenizer is a GPT2 tokenizer that adds a BOS token at the start
of the tokens.
Lots of back&forth at the time, but the truth is that the ByteLevel(trim_offsets=False)
post_processor, actually doesn't do anything, so we can just replace it with a simple
TemplateProcessing processor and everything works correctly.

This PR fixes the biggest culprit (missing BOS token on the fast tokenizer version).

Call to witness on other issues I might have missed.

@saulu
@patrickvonplaten
@Mishig

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 24, 2022

The documentation is not available anymore as the PR was closed or merged.

@SaulLu
Copy link
Contributor

SaulLu commented Aug 25, 2022

Good point!

Indeed, I had not noticed at all that ByteLevel(trim_offsets=False) was not doing anything. As trim_offset is not an argument in the __init__ either, I don't see any problem that your solution could cause! 😊



@require_tokenizers
class OPTTokenizationTest(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why create a new test class? 🤗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No mixin, it's a simple test.

If there are better/more suited locations for this test I can move it.
I just thought it didn't fit the Mixin type of tests (which are intended to by highly generic, right ?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me - think in the normal GPT2 Fast Test we test the tokenizer without leading BOS token. @Narsil could we maybe add 1,2 more tests here to check that fast gives identical output to slow and maybe also a test that the first token (BOS TOKEN) can be changed to whatever token the user wants with save / re-load?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added two tests, are those what you had in mind ?

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wuhuu super cool PR @Narsil thanks a mille!

If not too much of a hussle it'd be very nice to add 2,3 more tests for the fast OPT tokenizer

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Narsil Narsil merged commit 68bb33d into huggingface:main Sep 15, 2022
@Narsil Narsil deleted the fix_opt_fast branch September 15, 2022 15:13
LysandreJik pushed a commit that referenced this pull request Sep 16, 2022
* Fixing OPT fast tokenizer option.

* Remove dependency on `pt`.

* Move it to GPT2 tokenization tests.

* Added a few tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants