Gradient filter for training lstm model (#564)

* init files * add gradient filter module * refact getting median value * add cutoff for grad filter * delete comments * apply gradient filter in LSTM module, to filter both input and params * fix typing and refactor * filter with soft mask * rename lstm_transducer_stateless2 to lstm_transducer_stateless3 * fix typos, and update RESULTS.md * minor fix * fix return typing * fix typo
k2-fsa · Sep 29, 2022 · f3ad327 · f3ad327
1 parent 923b60a
commit f3ad327
Show file tree

Hide file tree

Showing 23 changed files with 5,448 additions and 28 deletions.
diff --git a/.flake8 b/.flake8
@@ -9,7 +9,7 @@ per-file-ignores =
     egs/*/ASR/pruned_transducer_stateless*/*.py: E501,
     egs/*/ASR/*/optim.py: E501,
     egs/*/ASR/*/scaling.py: E501,
-    egs/librispeech/ASR/lstm_transducer_stateless/*.py: E501, E203
+    egs/librispeech/ASR/lstm_transducer_stateless*/*.py: E501, E203
     egs/librispeech/ASR/conv_emformer_transducer_stateless*/*.py: E501, E203
     egs/librispeech/ASR/conformer_ctc2/*py: E501,
     egs/librispeech/ASR/RESULTS.md: E999,

diff --git a/egs/librispeech/ASR/RESULTS.md b/egs/librispeech/ASR/RESULTS.md
@@ -1,11 +1,99 @@
 ## Results
 
-#### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + multi-dataset)
+### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + gradient filter)
 
-[lstm_transducer_stateless2](./lstm_transducer_stateless2)
+#### [lstm_transducer_stateless3](./lstm_transducer_stateless3)
 
-See <https://github.com/k2-fsa/icefall/pull/558> for more details.
+It implements LSTM model with mechanisms in reworked model for streaming ASR.
+Gradient filter is applied inside each lstm module to stabilize the training.
+
+See <https://github.com/k2-fsa/icefall/pull/564> for more details.
+
+##### training on full librispeech
+
+This model contains 12 encoder layers (LSTM module + Feedforward module). The number of model parameters is 84689496.
+
+The WERs are:
+
+|                                     | test-clean | test-other | comment              | decoding mode        |
+|-------------------------------------|------------|------------|----------------------|----------------------|
+| greedy search (max sym per frame 1) | 3.66       | 9.51       | --epoch 40 --avg 15  | simulated streaming  |
+| greedy search (max sym per frame 1) | 3.66       | 9.48       | --epoch 40 --avg 15  | streaming            |
+| fast beam search                    | 3.55       | 9.33       | --epoch 40 --avg 15  | simulated streaming  |
+| fast beam search                    | 3.57       | 9.25       | --epoch 40 --avg 15  | streaming            |
+| modified beam search                | 3.55       | 9.28       | --epoch 40 --avg 15  | simulated streaming  |
+| modified beam search                | 3.54       | 9.25       | --epoch 40 --avg 15  | streaming            |
+
+Note: `simulated streaming` indicates feeding full utterance during decoding, while `streaming` indicates feeding certain number of frames at each time.
+
+
+The training command is:
+
+```bash
+./lstm_transducer_stateless3/train.py \
+  --world-size 4 \
+  --num-epochs 40 \
+  --start-epoch 1 \
+  --exp-dir lstm_transducer_stateless3/exp \
+  --full-libri 1 \
+  --max-duration 500 \
+  --master-port 12325 \
+  --num-encoder-layers 12 \
+  --grad-norm-threshold 25.0 \
+  --rnn-hidden-size 1024
+```
+
+The tensorboard log can be found at
+<https://tensorboard.dev/experiment/caNPyr5lT8qAl9qKsXEeEQ/>
+
+The simulated streaming decoding command using greedy search, fast beam search, and modified beam search is:
+```bash
+for decoding_method in greedy_search fast_beam_search modified_beam_search; do
+  ./lstm_transducer_stateless3/decode.py \
+    --epoch 40 \
+    --avg 15 \
+    --exp-dir lstm_transducer_stateless3/exp \
+    --max-duration 600 \
+    --num-encoder-layers 12 \
+    --rnn-hidden-size 1024 \
+    --decoding-method $decoding_method \
+    --use-averaged-model True \
+    --beam 4 \
+    --max-contexts 4 \
+    --max-states 8 \
+    --beam-size 4
+done
+```
+
+The streaming decoding command using greedy search, fast beam search, and modified beam search is:
+```bash
+for decoding_method in greedy_search fast_beam_search modified_beam_search; do
+  ./lstm_transducer_stateless3/streaming_decode.py \
+    --epoch 40 \
+    --avg 15 \
+    --exp-dir lstm_transducer_stateless3/exp \
+    --max-duration 600 \
+    --num-encoder-layers 12 \
+    --rnn-hidden-size 1024 \
+    --decoding-method $decoding_method \
+    --use-averaged-model True \
+    --beam 4 \
+    --max-contexts 4 \
+    --max-states 8 \
+    --beam-size 4
+done
+```
+
+Pretrained models, training logs, decoding logs, and decoding results
+are available at
+<https://huggingface.co/Zengwei/icefall-asr-librispeech-lstm-transducer-stateless3-2022-09-28>
+
+
+### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + multi-dataset)
 
+#### [lstm_transducer_stateless2](./lstm_transducer_stateless2)
+
+See <https://github.com/k2-fsa/icefall/pull/558> for more details.
 
 The WERs are:
 
@@ -18,6 +106,7 @@ The WERs are:
 | modified_beam_search                | 2.75       | 7.08       | --iter 472000 --avg 18  |
 | fast_beam_search                    | 2.77       | 7.29       | --iter 472000 --avg 18  |
 
+
 The training command is:
 
 ```bash
@@ -70,15 +159,16 @@ Pretrained models, training logs, decoding logs, and decoding results
 are available at
 <https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03>
 
-#### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T)
 
-[lstm_transducer_stateless](./lstm_transducer_stateless)
+### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T)
+
+#### [lstm_transducer_stateless](./lstm_transducer_stateless)
 
 It implements LSTM model with mechanisms in reworked model for streaming ASR.
 
 See <https://github.com/k2-fsa/icefall/pull/479> for more details.
 
-#### training on full librispeech
+##### training on full librispeech
 
 This model contains 12 encoder layers (LSTM module + Feedforward module). The number of model parameters is 84689496.
 
@@ -165,7 +255,7 @@ It is modified from [torchaudio](https://github.com/pytorch/audio).
 
 See <https://github.com/k2-fsa/icefall/pull/440> for more details.
 
-#### With lower latency setup, training on full librispeech
+##### With lower latency setup, training on full librispeech
 
 In this model, the lengths of chunk and right context are 32 frames (i.e., 0.32s) and 8 frames (i.e., 0.08s), respectively.
 
@@ -316,7 +406,7 @@ Pretrained models, training logs, decoding logs, and decoding results
 are available at
 <https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05>
 
-#### With higher latency setup, training on full librispeech
+##### With higher latency setup, training on full librispeech
 
 In this model, the lengths of chunk and right context are 64 frames (i.e., 0.64s) and 16 frames (i.e., 0.16s), respectively.
 
@@ -851,14 +941,14 @@ Pre-trained models, training and decoding logs, and decoding results are availab
 
 ### LibriSpeech BPE training results (Pruned Stateless Conv-Emformer RNN-T)
 
-[conv_emformer_transducer_stateless](./conv_emformer_transducer_stateless)
+#### [conv_emformer_transducer_stateless](./conv_emformer_transducer_stateless)
 
 It implements [Emformer](https://arxiv.org/abs/2010.10759) augmented with convolution module for streaming ASR.
 It is modified from [torchaudio](https://github.com/pytorch/audio).
 
 See <https://github.com/k2-fsa/icefall/pull/389> for more details.
 
-#### Training on full librispeech
+##### Training on full librispeech
 
 In this model, the lengths of chunk and right context are 32 frames (i.e., 0.32s) and 8 frames (i.e., 0.08s), respectively.
 
@@ -1011,7 +1101,7 @@ are available at
 
 ### LibriSpeech BPE training results (Pruned Stateless Emformer RNN-T)
 
-[pruned_stateless_emformer_rnnt2](./pruned_stateless_emformer_rnnt2)
+#### [pruned_stateless_emformer_rnnt2](./pruned_stateless_emformer_rnnt2)
 
 Use <https://github.com/k2-fsa/icefall/pull/390>.
 
@@ -1079,7 +1169,7 @@ results at:
 
 ### LibriSpeech BPE training results (Pruned Stateless Transducer 5)
 
-[pruned_transducer_stateless5](./pruned_transducer_stateless5)
+#### [pruned_transducer_stateless5](./pruned_transducer_stateless5)
 
 Same as `Pruned Stateless Transducer 2` but with more layers.
 
@@ -1092,7 +1182,7 @@ The notations `large` and `medium` below are from the [Conformer](https://arxiv.
 paper, where the large model has about 118 M parameters and the medium model
 has 30.8 M parameters.
 
-#### Large
+##### Large
 
 Number of model parameters 118129516 (i.e, 118.13 M).
 
@@ -1152,7 +1242,7 @@ results at:
 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless5-2022-07-07>
 
 
-#### Medium
+##### Medium
 
 Number of model parameters 30896748 (i.e, 30.9 M).
 
@@ -1212,7 +1302,7 @@ results at:
 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless5-M-2022-07-07>
 
 
-#### Baseline-2
+##### Baseline-2
 
 It has 88.98 M parameters. Compared to the model in pruned_transducer_stateless2, its has more
 layers (24 v.s 12) but a narrower model (1536 feedforward dim and 384 encoder dim vs 2048 feed forward dim and 512 encoder dim).
@@ -1273,13 +1363,13 @@ results at:
 
 ### LibriSpeech BPE training results (Pruned Stateless Transducer 4)
 
-[pruned_transducer_stateless4](./pruned_transducer_stateless4)
+#### [pruned_transducer_stateless4](./pruned_transducer_stateless4)
 
 This version saves averaged model during training, and decodes with averaged model.
 
 See <https://github.com/k2-fsa/icefall/issues/337> for details about the idea of model averaging.
 
-#### Training on full librispeech
+##### Training on full librispeech
 
 See <https://github.com/k2-fsa/icefall/pull/344>
 
@@ -1355,7 +1445,7 @@ Pretrained models, training logs, decoding logs, and decoding results
 are available at
 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless4-2022-06-03>
 
-#### Training on train-clean-100
+##### Training on train-clean-100
 
 See <https://github.com/k2-fsa/icefall/pull/344>
 
@@ -1392,7 +1482,7 @@ The tensorboard log can be found at
 
 ### LibriSpeech BPE training results (Pruned Stateless Transducer 3, 2022-04-29)
 
-[pruned_transducer_stateless3](./pruned_transducer_stateless3)
+#### [pruned_transducer_stateless3](./pruned_transducer_stateless3)
 Same as `Pruned Stateless Transducer 2` but using the XL subset from
 [GigaSpeech](https://github.com/SpeechColab/GigaSpeech) as extra training data.
 
@@ -1606,10 +1696,10 @@ can be found at
 
 ### LibriSpeech BPE training results (Pruned Transducer 2)
 
-[pruned_transducer_stateless2](./pruned_transducer_stateless2)
+#### [pruned_transducer_stateless2](./pruned_transducer_stateless2)
 This is with a reworked version of the conformer encoder, with many changes.
 
-#### Training on fulll librispeech
+##### Training on full librispeech
 
 Using commit `34aad74a2c849542dd5f6359c9e6b527e8782fd6`.
 See <https://github.com/k2-fsa/icefall/pull/288>
@@ -1658,7 +1748,7 @@ can be found at
 <https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless2-2022-04-29>
 
 
-#### Training on train-clean-100:
+##### Training on train-clean-100:
 
 Trained with 1 job:
 ```

diff --git a/egs/librispeech/ASR/lstm_transducer_stateless3/__init__.py b/egs/librispeech/ASR/lstm_transducer_stateless3/__init__.py
@@ -0,0 +1 @@
+../pruned_transducer_stateless2/__init__.py
diff --git a/egs/librispeech/ASR/lstm_transducer_stateless3/asr_datamodule.py b/egs/librispeech/ASR/lstm_transducer_stateless3/asr_datamodule.py
@@ -0,0 +1 @@
+../pruned_transducer_stateless2/asr_datamodule.py
diff --git a/egs/librispeech/ASR/lstm_transducer_stateless3/beam_search.py b/egs/librispeech/ASR/lstm_transducer_stateless3/beam_search.py
@@ -0,0 +1 @@
+../pruned_transducer_stateless2/beam_search.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../pruned_transducer_stateless2/asr_datamodule.py