-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing OPT fast tokenizer option. #18753
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Good point! Indeed, I had not noticed at all that |
|
||
|
||
@require_tokenizers | ||
class OPTTokenizationTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why create a new test class? 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No mixin, it's a simple test.
If there are better/more suited locations for this test I can move it.
I just thought it didn't fit the Mixin type of tests (which are intended to by highly generic, right ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for me - think in the normal GPT2 Fast Test we test the tokenizer without leading BOS token. @Narsil could we maybe add 1,2 more tests here to check that fast gives identical output to slow and maybe also a test that the first token (BOS TOKEN) can be changed to whatever token the user wants with save / re-load?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added two tests, are those what you had in mind ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wuhuu super cool PR @Narsil thanks a mille!
If not too much of a hussle it'd be very nice to add 2,3 more tests for the fast OPT tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
64d9eb8
to
1e4b9ee
Compare
* Fixing OPT fast tokenizer option. * Remove dependency on `pt`. * Move it to GPT2 tokenization tests. * Added a few tests.
What does this PR do?
Fixes the relevant issues:
https://huggingface.co/wjmcat/opt-350m-paddle/discussions/1
https://huggingface.slack.com/archives/C01N44FJDHT/p1653511495183519 (internal link)
#17088 (comment)
Basically, the OPT tokenizer is a GPT2 tokenizer that adds a BOS token at the start
of the tokens.
Lots of back&forth at the time, but the truth is that the
ByteLevel(trim_offsets=False)
post_processor, actually doesn't do anything, so we can just replace it with a simple
TemplateProcessing
processor and everything works correctly.This PR fixes the biggest culprit (missing BOS token on the fast tokenizer version).
Call to witness on other issues I might have missed.
@saulu
@patrickvonplaten
@Mishig
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.