Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: distinguish the parameters that require grad and that is not for PyTorch models #51

Closed
2catycm opened this issue Dec 6, 2024 · 3 comments · Fixed by #59
Labels
feature-request New feature or request

Comments

@2catycm
Copy link

2catycm commented Dec 6, 2024

No description provided.

@2catycm 2catycm changed the title Feature Request: distinguish the parameters that requires the grad and that is not for PyTorch Feature Request: distinguish the parameters that require grad and that is not for PyTorch Dec 6, 2024
@2catycm 2catycm changed the title Feature Request: distinguish the parameters that require grad and that is not for PyTorch Feature Request: distinguish the parameters that require grad and that is not for PyTorch models Dec 6, 2024
@danieldjohnson
Copy link
Collaborator

Hi @2catycm, can you clarify what you mean by this?

@danieldjohnson danieldjohnson added the needs-clarification Further information is requested label Jan 21, 2025
@2catycm
Copy link
Author

2catycm commented Jan 22, 2025

Hi @2catycm, can you clarify what you mean by this?

Hi, thanks for your reply. Sorry my description was not clear.

In pytorch, some torch.Tensor require grad, which means you want to compute the gradient of this tensor w.r.t. the loss tensor, when calling loss.backward(). We can pass requires_grad=True to the Tensor.__init__, view the attribute via requires_grad, and we can change the attribute via requires_grad_.

In deep learning, usually we need to compute the gradient of all the model parameters w.r.t. the loss, in order to train the model via optimizers like SGD. But in transfer learning and parameter-efficient fine-tuning, it is suggested that not all the model parameters are needed to be modified, some can be freezed in order to preserve the knowledge learned in previous task, in order to prevent catastrophic forgetting. If we train all the parameters, it is called full fine-tuning. If we only train part of the model, for example only the bias(BitFit method), only the LayerNorm layer(LN_Tuning), or add some new modules and only train those modules (like LoRA and Adapter and Prompt Tuning), it is called parameter-efficient fine-tuning.

As models like LLM are very large, fine-tuning the pretrained model on many downstream tasks requires someone to save the modificationof the models many times. If the modification is only partial, then it saves storage a lot.

Then, it becomes a useful feature when we want to visualize the model before training. It is really helpful to see, which parts of the model we prepared for the training recipe, are frozen(don't requires grad and don't requires storage), while which parts of the model are modified (requires grad).

Another Model Visualization library called bigmodelvis supports this feature. That lib shows the model in rich tree format for the console, and assigns different colors to the activated parameters and frozen parameters.

@danieldjohnson
Copy link
Collaborator

Thanks for the clarification! Added the requires_grad info to the rendered summary of parameters and other torch tensors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants