-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Zipformer from Dan #672
Conversation
… the 43->48 change.
Remember to squash when you merge (657 commits.. will bulk up the repo size.) |
Thanks. I will. I am making it support torchscript. |
|
||
class ZipformerEncoderLayer(nn.Module): | ||
""" | ||
ZipformerEncoderLayer is made up of self-attn, feedforward and convolution networks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs should be updated.
The tensorboard log is available at |
Here is the comparison between zipformer and our previous reworked conformer. Note: I only list our previous best results without LM rescoring and without trained with extra-data.
|
Will merge it after the CI passes. |
You can try the pre-trained model of this PR from your browser without installing anything. Just go to https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition and record your voice for recognition. |
All the changes in this PR are from @danpovey
Things to take:
(1) The model is trained using only LibriSpeech and the number of parameters is about 70.37 M
(2) We can use a much larger
max_duration
, i.e., 750(3) We use half-precision during training
(4) The model converges much faster and yields the best WER we have on LibriSpeech when not using extra data from GigaSpeech
Here are the results:
(Hint: You can find the results of the Conformer paper at https://arxiv.org/pdf/2005.08100.pdf)
Training command:
Things to note:
To give you an idea of the training time per epoch:
It is about 1 hour and 20 minutes per epoch.
The number of model parameters:
That is, there are about 70.37 M parameters.
Decoding commands
(Note: I only list --epoch 30, --avg 9 but I have searched almost all combinations of
--epoch --avg
and--epoch 30 --avg 9
is the best one)To give you an idea about how the model performs in the early epochs:
(1) Validation loss
(2) WERs for earlier epochs (greedy search)
I am uploading the pre-trained models, tensorboard logs and decoding results to hugging face.