Fix and improve CTRL doctests #16573

jeremyadamsfisher · 2022-04-04T04:49:52Z

Improve CTRL doctests and fix test assertions, where appropriate

What does this PR do?

This PR addresses the CTRL doc test failures and replaces the example text with one that is more appropriate for CTRL specifically (i.e., by prefacing it with a control code)

Motivated as part of the doctest sprint: #16292

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@patrickvonplaten @ydshieh @patil-suraj

HuggingFaceDocBuilderDev · 2022-04-04T05:02:55Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh

Thank you for adding the examples, @jeremyadamsfisher!

I run it locally and the tests all pass 🚀

BTW, you forgot to add the ctrl model to utils/documentation_tests.txt.

LGTM! I left a few tiny comments.

I would also like to have my colleagues to review this PR too 🙂.

src/transformers/models/ctrl/modeling_ctrl.py

ydshieh · 2022-04-05T16:18:27Z

src/transformers/models/ctrl/modeling_ctrl.py

+
+        >>> inputs = tokenizer("Opinion my dog is cute", return_tensors="pt")
+        >>> outputs = model(**inputs, labels=inputs["input_ids"])
+        >>> loss = outputs.loss


we can actually provide the expected value

ydshieh · 2022-04-05T16:18:49Z

src/transformers/models/ctrl/modeling_ctrl.py

+        >>> inputs = tokenizer("Opinion my dog is cute", return_tensors="pt")
+        >>> outputs = model(**inputs, labels=inputs["input_ids"])
+        >>> loss = outputs.loss
+        >>> logits = outputs.logits


we can actually provide the expected value (shape of the logit)

ydshieh · 2022-04-05T16:19:36Z

src/transformers/models/ctrl/modeling_ctrl.py

+        >>> tokenizer = CTRLTokenizer.from_pretrained("sshleifer/tiny-ctrl")
+        >>> model = CTRLForSequenceClassification.from_pretrained("sshleifer/tiny-ctrl")
+
+        >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")


Any reason here for not having Opinion at the beginning?

Nope, this was an oversight

ydshieh · 2022-04-05T16:24:32Z

src/transformers/models/ctrl/modeling_ctrl.py

+        >>> from transformers import CTRLTokenizer, CTRLModel
+        >>> import torch
+
+        >>> tokenizer = CTRLTokenizer.from_pretrained("sshleifer/tiny-ctrl")


We can actually use ctrl as the checkpoint for the base model.

ydshieh · 2022-04-05T16:31:38Z

@patrickvonplaten @sgugger

Could you also take a look when you have some time?

In particular, I don't find any existing documentation mentioning the usage of Opinion ... (put the control code as the first word), although I feel this is the way to go.

(still don't feel very comfortable without seeing this usage)

sgugger · 2022-04-05T17:11:46Z

Let's ping @LysandreJik as he might know more on CTRL ;-)

jeremyadamsfisher · 2022-04-05T17:56:33Z

Thanks for the review! I'll address these comments asap.

As for the control code coming first, there's actually an example right here:

transformers/examples/pytorch/text-generation/run_generation.py

Line 88 in 7732148

def prepare_ctrl_input(args, _, tokenizer, prompt_text):

ydshieh · 2022-04-05T19:13:30Z

Thanks for the review! I'll address these comments asap.

As for the control code coming first, there's actually an example right here:

transformers/examples/pytorch/text-generation/run_generation.py

Line 88 in 7732148

def prepare_ctrl_input(args, _, tokenizer, prompt_text):

Thank you for this info. @jeremyadamsfisher

jeremyadamsfisher · 2022-04-06T01:53:27Z

Thanks again for the feedback.

I've added assertions on lines 383 and 562-565 and changed the model from from sshleifer/tiny-ctrl to ctrl

ydshieh

Thanks for the update. I added a few more nit comments.

Let's wait a bit to see if Lysandre has any comment, but it is very appreciated that you provided the information about the ctrl code usage :-)

ydshieh · 2022-04-06T08:32:45Z

src/transformers/models/ctrl/modeling_ctrl.py

+        >>> inputs = tokenizer("Opinion my dog is cute", return_tensors="pt")
+        >>> outputs = model(**inputs, labels=inputs["input_ids"])
+        >>> outputs.loss.item()
+        5.788386821746826


This is too precise 😅, and very likely to fail the test (especially when running on other machines). We use the following instead

>>> round(outputs.loss.item(), 2)

ydshieh · 2022-04-06T08:38:37Z

src/transformers/models/ctrl/modeling_ctrl.py

+        5.788386821746826
+
+        >>> outputs.logits.shape
+        torch.Size([1, 5, 246534])


Let's use list(outputs.logits.shape) and put the output as [1, 5, 246534].

(simpler than torch.Size([1, 5, 246534]))

ydshieh · 2022-04-06T08:40:23Z

src/transformers/models/ctrl/modeling_ctrl.py

+
+        >>> last_hidden_states = outputs.last_hidden_state
+        >>> last_hidden_states.shape
+        torch.Size([1, 5, 1280])


Let's use list(last_hidden_states.shape) and put the output as [1, 5, 1280].

(simpler than torch.Size([1, 5, 1280]))

jeremyadamsfisher · 2022-04-06T18:42:05Z

Sure thing, those are easy changes.

To clarify the control code coming first, would it make sense to add something like this?

>>> # CTRL was trained with control codes as the first token
>>> inputs = tokenizer("Opinion my dog is cute", return_tensors="pt")
>>> assert inputs[0] in tokenizer.control_codes.values()

jeremyadamsfisher · 2022-04-06T23:59:33Z

To clarify the control code coming first, would it make sense to add something like this?

>>> # CTRL was trained with control codes as the first token
>>> inputs = tokenizer("Opinion my dog is cute", return_tensors="pt")
>>> assert inputs[0] in tokenizer.control_codes.values()

That doesn't seem to work, will tinker with this a bit more:

UNEXPECTED EXCEPTION: KeyError('Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers')
Traceback (most recent call last):
  File "/Users/jeremyfisher/.pyenv/versions/3.8.12/lib/python3.8/doctest.py", line 1336, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest transformers.models.ctrl.modeling_ctrl.CTRLModel.forward[5]>", line 1, in <module>
  File "/Users/jeremyfisher/Documents/transformers/src/transformers/tokenization_utils_base.py", line 239, in __getitem__
    raise KeyError(
KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers'

jeremyadamsfisher · 2022-04-07T00:05:02Z

Aha! This works:

>>> # CTRL was trained with control codes as the first token
>>> inputs = tokenizer("Opinion my dog is cute", return_tensors="pt")
>>> assert inputs["input_ids"][0,0].item() in tokenizer.control_codes.values()

Added this to the doctest wherever there was a inputs = tokenizer(...)

jeremyadamsfisher · 2022-04-07T02:32:17Z

@ydshieh heads up -- I've addressed your second set of comments and the checks have passed :)

Still waiting on @LysandreJik would love to hear your thoughts

patrickvonplaten · 2022-04-07T08:51:11Z

src/transformers/models/ctrl/modeling_ctrl.py

+        >>> tokenizer = CTRLTokenizer.from_pretrained("ctrl")
+        >>> model = CTRLLMHeadModel.from_pretrained("ctrl")
+
+        >>> # CTRL was trained with control codes as the first token


src/transformers/models/ctrl/modeling_ctrl.py

patrickvonplaten

Very nice - exactly right to use the CTRL control codes here :-) Could we maybe also add a generate example to the model?

LysandreJik

Very cool to leverage control codes for the examples!

src/transformers/models/ctrl/modeling_ctrl.py

patrickvonplaten

Super nice - looks good to me!

I let @ydshieh take a final look here

ydshieh

LGTM 💯 Thank you, @jeremyadamsfisher! Run locally (GCP VM 32GB RAM) and all tests pass!

One thing I am a bit worried: On CPU, some tests (sequence classification) require ~24GB memory to run. I am not sure if we will get GPU OOM when it runs on CI.

Any comment here, @patrickvonplaten ?

ydshieh · 2022-04-12T12:45:21Z

src/transformers/models/ctrl/modeling_ctrl.py

+        >>> sequence_ids = model.generate(inputs["input_ids"])
+        >>> sequences = tokenizer.batch_decode(sequence_ids)
+        >>> sequences
+        ['Wikipedia The llama is a member of the family Bovidae. It is native to the Andes of Peru,']


(nit) I would use sequences[0] and show the output as a string instead of a list

ydshieh · 2022-04-12T12:57:37Z

src/transformers/models/ctrl/modeling_ctrl.py

+
+        ```python
+        >>> from transformers import CTRLTokenizer, CTRLModel
+        >>> import torch


this line could be removed (torch imported but not used)

@patrickvonplaten Do you have strong opinion here? (should we do the same for doc.py?)

Yes would be cleaner to remove torch, but happy to leave it for a future PR as well

patrickvonplaten

Let's keep it in the test for now even if it takes 24GB of RAM - if it fails we can adapt afterward

ydshieh · 2022-04-12T16:49:20Z

Hi, @jeremyadamsfisher Could you try to resolve the conflicts in

src/transformers/models/ctrl/modeling_ctrl.py

Then we are ready to merge :-) Thanks!

(I can help on this if you need, just let me know)

ydshieh · 2022-04-13T13:18:58Z

Hi, @jeremyadamsfisher Just to let you know: I resolved the conflict lines, and pushed to this PR branch.
I think we are ready to merge (once the CI tests are green).

If you need to make some more changes (if any), don't forget to git pull first. Thanks.

ydshieh · 2022-04-13T13:44:52Z

Merged! Thank you again, @jeremyadamsfisher !

jeremyadamsfisher · 2022-04-14T04:25:57Z

Merged! Thank you again, @jeremyadamsfisher !

Thank you @ydshieh! Apologies I wasn't able to fix the merge conflicts myself, but it is much appreciated!

* Improve CTRL doctests * Fix `CTRLForSequenceClassification` flakiness with inconsistent losses * Remove unused * Fixup * Add CTRL to documentation_tests.txt * Fix control code not being first * Add output assertions * Change from sshleifer/tiny-ctrl -> ctrl * Run `make fixup` * apply `list` to output logits shape for clarity * Reduce output loss precision to make assertion more robust * Add assertion of control code being first * Fix docstyle * upper case sentence following control code * Weird bug fixes * Add a better generation example Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>

ydshieh self-requested a review April 4, 2022 10:01

ydshieh reviewed Apr 5, 2022

View reviewed changes

ydshieh requested review from sgugger and patrickvonplaten April 5, 2022 16:27

jeremyadamsfisher force-pushed the fix-ctrl-doctest-failures branch from 558f4af to cce1ab7 Compare April 6, 2022 01:51

ydshieh reviewed Apr 6, 2022

View reviewed changes

jeremyadamsfisher force-pushed the fix-ctrl-doctest-failures branch from 7c5ed04 to b9424d4 Compare April 6, 2022 23:56

patrickvonplaten reviewed Apr 7, 2022

View reviewed changes

src/transformers/models/ctrl/modeling_ctrl.py Show resolved Hide resolved

patrickvonplaten reviewed Apr 7, 2022

View reviewed changes

LysandreJik reviewed Apr 7, 2022

View reviewed changes

src/transformers/models/ctrl/modeling_ctrl.py Outdated Show resolved Hide resolved

jeremyadamsfisher added 9 commits April 7, 2022 19:59

Improve CTRL doctests

93629f4

Fix CTRLForSequenceClassification flakiness with inconsistent losses

892bf2c

Remove unused

f26b6f6

Fixup

578e07a

Add CTRL to documentation_tests.txt

933d939

Fix control code not being first

6c3bef4

Add output assertions

c2ea83b

Change from sshleifer/tiny-ctrl -> ctrl

08acd24

Run make fixup

8686656

jeremyadamsfisher added 7 commits April 7, 2022 19:59

apply list to output logits shape for clarity

3841fd7

Reduce output loss precision to make assertion more robust

cd8989e

Add assertion of control code being first

e7b14ad

Fix docstyle

a6e773e

upper case sentence following control code

17554b0

Weird bug fixes

c82f0e3

Add a better generation example

dddc87c

jeremyadamsfisher force-pushed the fix-ctrl-doctest-failures branch from b45df8a to dddc87c Compare April 8, 2022 01:05

patrickvonplaten requested a review from ydshieh April 12, 2022 11:18

patrickvonplaten approved these changes Apr 12, 2022

View reviewed changes

ydshieh approved these changes Apr 12, 2022

View reviewed changes

patrickvonplaten approved these changes Apr 12, 2022

View reviewed changes

Merge branch 'main' into fix-ctrl-doctest-failures

5d3ca24

ydshieh merged commit 0235bc5 into huggingface:main Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and improve CTRL doctests #16573

Fix and improve CTRL doctests #16573

jeremyadamsfisher commented Apr 4, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 4, 2022 •

edited

Loading

ydshieh left a comment •

edited

Loading

ydshieh Apr 5, 2022

ydshieh Apr 5, 2022

ydshieh Apr 5, 2022

jeremyadamsfisher Apr 6, 2022

ydshieh Apr 5, 2022

ydshieh commented Apr 5, 2022

sgugger commented Apr 5, 2022

jeremyadamsfisher commented Apr 5, 2022

ydshieh commented Apr 5, 2022

jeremyadamsfisher commented Apr 6, 2022

ydshieh left a comment

ydshieh Apr 6, 2022

ydshieh Apr 6, 2022

ydshieh Apr 6, 2022

jeremyadamsfisher commented Apr 6, 2022

jeremyadamsfisher commented Apr 6, 2022 •

edited

Loading

jeremyadamsfisher commented Apr 7, 2022 •

edited

Loading

jeremyadamsfisher commented Apr 7, 2022

patrickvonplaten Apr 7, 2022

patrickvonplaten left a comment

LysandreJik left a comment

patrickvonplaten left a comment

ydshieh left a comment

ydshieh Apr 12, 2022

ydshieh Apr 12, 2022

ydshieh Apr 12, 2022

patrickvonplaten Apr 12, 2022

patrickvonplaten left a comment

ydshieh commented Apr 12, 2022 •

edited

Loading

ydshieh commented Apr 13, 2022

ydshieh commented Apr 13, 2022

jeremyadamsfisher commented Apr 14, 2022

Fix and improve CTRL doctests #16573

Fix and improve CTRL doctests #16573

Conversation

jeremyadamsfisher commented Apr 4, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Apr 4, 2022 • edited Loading

ydshieh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh commented Apr 5, 2022

sgugger commented Apr 5, 2022

jeremyadamsfisher commented Apr 5, 2022

ydshieh commented Apr 5, 2022

jeremyadamsfisher commented Apr 6, 2022

ydshieh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyadamsfisher commented Apr 6, 2022

jeremyadamsfisher commented Apr 6, 2022 • edited Loading

jeremyadamsfisher commented Apr 7, 2022 • edited Loading

jeremyadamsfisher commented Apr 7, 2022

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

ydshieh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

ydshieh commented Apr 12, 2022 • edited Loading

ydshieh commented Apr 13, 2022

ydshieh commented Apr 13, 2022

jeremyadamsfisher commented Apr 14, 2022

jeremyadamsfisher commented Apr 4, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 4, 2022 •

edited

Loading

ydshieh left a comment •

edited

Loading

jeremyadamsfisher commented Apr 6, 2022 •

edited

Loading

jeremyadamsfisher commented Apr 7, 2022 •

edited

Loading

ydshieh commented Apr 12, 2022 •

edited

Loading