Skip to content

Commit 1d44da8

Browse files
authored
RNN-T Conformer training for LibriSpeech (#143)
* Begin to add RNN-T training for librispeech. * Copy files from conformer_ctc. Will edit it. * Use conformer/transformer model as encoder. * Begin to add training script. * Add training code. * Remove long utterances to avoid OOM when a large max_duraiton is used. * Begin to add decoding script. * Add decoding script. * Minor fixes. * Add beam search. * Use LSTM layers for the encoder. Need more tunings. * Use stateless decoder. * Minor fixes to make it ready for merge. * Fix README. * Update RESULT.md to include RNN-T Conformer. * Minor fixes. * Fix tests. * Minor fixes. * Minor fixes. * Fix tests.
1 parent 76a51bf commit 1d44da8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+8964
-11
lines changed

.github/workflows/test.yml

+8-4
Original file line numberDiff line numberDiff line change
@@ -103,8 +103,10 @@ jobs:
103103
cd egs/librispeech/ASR/conformer_ctc
104104
pytest -v -s
105105
106-
cd ..
107-
pytest -v -s ./transducer
106+
if [[ ${{ matrix.torchaudio }} == "0.10.0" ]]; then
107+
cd ../transducer
108+
pytest -v -s
109+
fi
108110
109111
- name: Run tests
110112
if: startsWith(matrix.os, 'macos')
@@ -120,5 +122,7 @@ jobs:
120122
cd egs/librispeech/ASR/conformer_ctc
121123
pytest -v -s
122124
123-
cd ..
124-
pytest -v -s ./transducer
125+
if [[ ${{ matrix.torchaudio }} == "0.10.0" ]]; then
126+
cd ../transducer
127+
pytest -v -s
128+
fi

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ exp*/
88
download
99
*.bak
1010
*-bak
11+
*bak.py

README.md

+20-2
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,11 @@ We do provide a Colab notebook for this recipe.
3434

3535
### LibriSpeech
3636

37-
We provide two models for this recipe: [conformer CTC model][LibriSpeech_conformer_ctc]
38-
and [TDNN LSTM CTC model][LibriSpeech_tdnn_lstm_ctc].
37+
We provide 3 models for this recipe:
38+
39+
- [conformer CTC model][LibriSpeech_conformer_ctc]
40+
- [TDNN LSTM CTC model][LibriSpeech_tdnn_lstm_ctc]
41+
- [RNN-T Conformer model][LibriSpeech_transducer]
3942

4043
#### Conformer CTC Model
4144

@@ -58,6 +61,20 @@ The WER for this model is:
5861

5962
We provide a Colab notebook to run a pre-trained TDNN LSTM CTC model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1kNmDXNMwREi0rZGAOIAOJo93REBuOTcd?usp=sharing)
6063

64+
65+
#### RNN-T Conformer model
66+
67+
Using Conformer as encoder.
68+
69+
The best WER with greedy search is:
70+
71+
| | test-clean | test-other |
72+
|-----|------------|------------|
73+
| WER | 3.16 | 7.71 |
74+
75+
We provide a Colab notebook to run a pre-trained RNN-T conformer model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1_u6yK9jDkPwG_NLrZMN2XK7Aeq4suMO2?usp=sharing)
76+
77+
6178
### Aishell
6279

6380
We provide two models for this recipe: [conformer CTC model][Aishell_conformer_ctc]
@@ -125,6 +142,7 @@ Please see: [![Open In Colab](https://colab.research.google.com/assets/colab-bad
125142

126143
[LibriSpeech_tdnn_lstm_ctc]: egs/librispeech/ASR/tdnn_lstm_ctc
127144
[LibriSpeech_conformer_ctc]: egs/librispeech/ASR/conformer_ctc
145+
[LibriSpeech_transducer]: egs/librispeech/ASR/transducer
128146
[Aishell_tdnn_lstm_ctc]: egs/aishell/ASR/tdnn_lstm_ctc
129147
[Aishell_conformer_ctc]: egs/aishell/ASR/conformer_ctc
130148
[TIMIT_tdnn_lstm_ctc]: egs/timit/ASR/tdnn_lstm_ctc

egs/librispeech/ASR/RESULTS.md

+46
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,51 @@
11
## Results
22

3+
### LibriSpeech BPE training results (RNN-T)
4+
5+
#### 2021-12-17
6+
7+
RNN-T + Conformer encoder
8+
9+
The best WER is
10+
11+
| | test-clean | test-other |
12+
|-----|------------|------------|
13+
| WER | 3.16 | 7.71 |
14+
15+
using `--epoch 26 --avg 12` during decoding with greedy search.
16+
17+
The training command to reproduce the above WER is:
18+
19+
```
20+
export CUDA_VISIBLE_DEVICES="0,1,2,3"
21+
22+
./transducer/train.py \
23+
--world-size 4 \
24+
--num-epochs 30 \
25+
--start-epoch 0 \
26+
--exp-dir transducer/exp-lr-2.5-full \
27+
--full-libri 1 \
28+
--max-duration 250 \
29+
--lr-factor 2.5
30+
```
31+
32+
The decoding command is:
33+
34+
```
35+
epoch=26
36+
avg=12
37+
38+
./transducer/decode.py \
39+
--epoch $epoch \
40+
--avg $avg \
41+
--exp-dir transducer/exp-lr-2.5-full \
42+
--bpe-model ./data/lang_bpe_500/bpe.model \
43+
--max-duration 100
44+
```
45+
46+
You can find the tensorboard log at: <https://tensorboard.dev/experiment/PYIbeD6zRJez1ViXaRqqeg/>
47+
48+
349
### LibriSpeech BPE training results (Conformer-CTC)
450

551
#### 2021-11-09

egs/librispeech/ASR/conformer_ctc/decode.py

-2
Original file line numberDiff line numberDiff line change
@@ -428,8 +428,6 @@ def decode_dataset(
428428
The first is the reference transcript, and the second is the
429429
predicted result.
430430
"""
431-
results = []
432-
433431
num_cuts = 0
434432

435433
try:
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
#!/usr/bin/env python3
2+
# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang)
3+
#
4+
# See ../../../../LICENSE for clarification regarding multiple authors
5+
#
6+
# Licensed under the Apache License, Version 2.0 (the "License");
7+
# you may not use this file except in compliance with the License.
8+
# You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
18+
"""
19+
This file displays duration statistics of utterances in a manifest.
20+
You can use the displayed value to choose minimum/maximum duration
21+
to remove short and long utterances during the training.
22+
23+
See the function `remove_short_and_long_utt()` in transducer/train.py
24+
for usage.
25+
"""
26+
27+
28+
from lhotse import load_manifest
29+
30+
31+
def main():
32+
path = "./data/fbank/cuts_train-clean-100.json.gz"
33+
path = "./data/fbank/cuts_train-clean-360.json.gz"
34+
path = "./data/fbank/cuts_train-other-500.json.gz"
35+
path = "./data/fbank/cuts_dev-clean.json.gz"
36+
path = "./data/fbank/cuts_dev-other.json.gz"
37+
path = "./data/fbank/cuts_test-clean.json.gz"
38+
path = "./data/fbank/cuts_test-other.json.gz"
39+
40+
cuts = load_manifest(path)
41+
cuts.describe()
42+
43+
44+
if __name__ == "__main__":
45+
main()
46+
47+
"""
48+
## train-clean-100
49+
Cuts count: 85617
50+
Total duration (hours): 303.8
51+
Speech duration (hours): 303.8 (100.0%)
52+
***
53+
Duration statistics (seconds):
54+
mean 12.8
55+
std 3.8
56+
min 1.3
57+
0.1% 1.9
58+
0.5% 2.2
59+
1% 2.5
60+
5% 4.2
61+
10% 6.4
62+
25% 11.4
63+
50% 13.8
64+
75% 15.3
65+
90% 16.7
66+
95% 17.3
67+
99% 18.1
68+
99.5% 18.4
69+
99.9% 18.8
70+
max 27.2
71+
72+
## train-clean-360
73+
Cuts count: 312042
74+
Total duration (hours): 1098.2
75+
Speech duration (hours): 1098.2 (100.0%)
76+
***
77+
Duration statistics (seconds):
78+
mean 12.7
79+
std 3.8
80+
min 1.0
81+
0.1% 1.8
82+
0.5% 2.2
83+
1% 2.5
84+
5% 4.2
85+
10% 6.2
86+
25% 11.2
87+
50% 13.7
88+
75% 15.3
89+
90% 16.6
90+
95% 17.3
91+
99% 18.1
92+
99.5% 18.4
93+
99.9% 18.8
94+
max 33.0
95+
96+
## train-other 500
97+
Cuts count: 446064
98+
Total duration (hours): 1500.6
99+
Speech duration (hours): 1500.6 (100.0%)
100+
***
101+
Duration statistics (seconds):
102+
mean 12.1
103+
std 4.2
104+
min 0.8
105+
0.1% 1.7
106+
0.5% 2.1
107+
1% 2.3
108+
5% 3.5
109+
10% 5.0
110+
25% 9.8
111+
50% 13.4
112+
75% 15.1
113+
90% 16.5
114+
95% 17.2
115+
99% 18.1
116+
99.5% 18.4
117+
99.9% 18.9
118+
max 31.0
119+
120+
## dev-clean
121+
Cuts count: 2703
122+
Total duration (hours): 5.4
123+
Speech duration (hours): 5.4 (100.0%)
124+
***
125+
Duration statistics (seconds):
126+
mean 7.2
127+
std 4.7
128+
min 1.4
129+
0.1% 1.6
130+
0.5% 1.8
131+
1% 1.9
132+
5% 2.4
133+
10% 2.7
134+
25% 3.8
135+
50% 5.9
136+
75% 9.3
137+
90% 13.3
138+
95% 16.4
139+
99% 23.8
140+
99.5% 28.5
141+
99.9% 32.3
142+
max 32.6
143+
144+
## dev-other
145+
Cuts count: 2864
146+
Total duration (hours): 5.1
147+
Speech duration (hours): 5.1 (100.0%)
148+
***
149+
Duration statistics (seconds):
150+
mean 6.4
151+
std 4.3
152+
min 1.1
153+
0.1% 1.3
154+
0.5% 1.7
155+
1% 1.8
156+
5% 2.2
157+
10% 2.6
158+
25% 3.5
159+
50% 5.3
160+
75% 7.9
161+
90% 12.0
162+
95% 15.0
163+
99% 22.2
164+
99.5% 27.1
165+
99.9% 32.4
166+
max 35.2
167+
168+
## test-clean
169+
Cuts count: 2620
170+
Total duration (hours): 5.4
171+
Speech duration (hours): 5.4 (100.0%)
172+
***
173+
Duration statistics (seconds):
174+
mean 7.4
175+
std 5.2
176+
min 1.3
177+
0.1% 1.6
178+
0.5% 1.8
179+
1% 2.0
180+
5% 2.3
181+
10% 2.7
182+
25% 3.7
183+
50% 5.8
184+
75% 9.6
185+
90% 14.6
186+
95% 17.8
187+
99% 25.5
188+
99.5% 28.4
189+
99.9% 32.8
190+
max 35.0
191+
192+
## test-other
193+
Cuts count: 2939
194+
Total duration (hours): 5.3
195+
Speech duration (hours): 5.3 (100.0%)
196+
***
197+
Duration statistics (seconds):
198+
mean 6.5
199+
std 4.4
200+
min 1.2
201+
0.1% 1.5
202+
0.5% 1.8
203+
1% 1.9
204+
5% 2.3
205+
10% 2.6
206+
25% 3.4
207+
50% 5.2
208+
75% 8.2
209+
90% 12.6
210+
95% 15.8
211+
99% 21.4
212+
99.5% 23.8
213+
99.9% 33.5
214+
max 34.5
215+
"""
+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
## Introduction
2+
3+
The encoder consists of Conformer layers in this folder. You can use the
4+
following command to start the training:
5+
6+
```bash
7+
cd egs/librispeech/ASR
8+
9+
export CUDA_VISIBLE_DEVICES="0,1,2,3"
10+
11+
./transducer/train.py \
12+
--world-size 4 \
13+
--num-epochs 30 \
14+
--start-epoch 0 \
15+
--exp-dir transducer/exp \
16+
--full-libri 1 \
17+
--max-duration 250 \
18+
--lr-factor 2.5
19+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../tdnn_lstm_ctc/asr_datamodule.py

0 commit comments

Comments
 (0)