Skip to content

Commit

Permalink
Gradient filter for training lstm model (#564)
Browse files Browse the repository at this point in the history
* init files

* add gradient filter module

* refact getting median value

* add cutoff for grad filter

* delete comments

* apply gradient filter in LSTM module, to filter both input and params

* fix typing and refactor

* filter with soft mask

* rename lstm_transducer_stateless2 to lstm_transducer_stateless3

* fix typos, and update RESULTS.md

* minor fix

* fix return typing

* fix typo
  • Loading branch information
yaozengwei authored Sep 29, 2022
1 parent 923b60a commit f3ad327
Show file tree
Hide file tree
Showing 23 changed files with 5,448 additions and 28 deletions.
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ per-file-ignores =
egs/*/ASR/pruned_transducer_stateless*/*.py: E501,
egs/*/ASR/*/optim.py: E501,
egs/*/ASR/*/scaling.py: E501,
egs/librispeech/ASR/lstm_transducer_stateless/*.py: E501, E203
egs/librispeech/ASR/lstm_transducer_stateless*/*.py: E501, E203
egs/librispeech/ASR/conv_emformer_transducer_stateless*/*.py: E501, E203
egs/librispeech/ASR/conformer_ctc2/*py: E501,
egs/librispeech/ASR/RESULTS.md: E999,
Expand Down
134 changes: 112 additions & 22 deletions egs/librispeech/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,99 @@
## Results

#### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + multi-dataset)
### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + gradient filter)

[lstm_transducer_stateless2](./lstm_transducer_stateless2)
#### [lstm_transducer_stateless3](./lstm_transducer_stateless3)

See <https://github.com/k2-fsa/icefall/pull/558> for more details.
It implements LSTM model with mechanisms in reworked model for streaming ASR.
Gradient filter is applied inside each lstm module to stabilize the training.

See <https://github.com/k2-fsa/icefall/pull/564> for more details.

##### training on full librispeech

This model contains 12 encoder layers (LSTM module + Feedforward module). The number of model parameters is 84689496.

The WERs are:

| | test-clean | test-other | comment | decoding mode |
|-------------------------------------|------------|------------|----------------------|----------------------|
| greedy search (max sym per frame 1) | 3.66 | 9.51 | --epoch 40 --avg 15 | simulated streaming |
| greedy search (max sym per frame 1) | 3.66 | 9.48 | --epoch 40 --avg 15 | streaming |
| fast beam search | 3.55 | 9.33 | --epoch 40 --avg 15 | simulated streaming |
| fast beam search | 3.57 | 9.25 | --epoch 40 --avg 15 | streaming |
| modified beam search | 3.55 | 9.28 | --epoch 40 --avg 15 | simulated streaming |
| modified beam search | 3.54 | 9.25 | --epoch 40 --avg 15 | streaming |

Note: `simulated streaming` indicates feeding full utterance during decoding, while `streaming` indicates feeding certain number of frames at each time.


The training command is:

```bash
./lstm_transducer_stateless3/train.py \
--world-size 4 \
--num-epochs 40 \
--start-epoch 1 \
--exp-dir lstm_transducer_stateless3/exp \
--full-libri 1 \
--max-duration 500 \
--master-port 12325 \
--num-encoder-layers 12 \
--grad-norm-threshold 25.0 \
--rnn-hidden-size 1024
```

The tensorboard log can be found at
<https://tensorboard.dev/experiment/caNPyr5lT8qAl9qKsXEeEQ/>

The simulated streaming decoding command using greedy search, fast beam search, and modified beam search is:
```bash
for decoding_method in greedy_search fast_beam_search modified_beam_search; do
./lstm_transducer_stateless3/decode.py \
--epoch 40 \
--avg 15 \
--exp-dir lstm_transducer_stateless3/exp \
--max-duration 600 \
--num-encoder-layers 12 \
--rnn-hidden-size 1024 \
--decoding-method $decoding_method \
--use-averaged-model True \
--beam 4 \
--max-contexts 4 \
--max-states 8 \
--beam-size 4
done
```

The streaming decoding command using greedy search, fast beam search, and modified beam search is:
```bash
for decoding_method in greedy_search fast_beam_search modified_beam_search; do
./lstm_transducer_stateless3/streaming_decode.py \
--epoch 40 \
--avg 15 \
--exp-dir lstm_transducer_stateless3/exp \
--max-duration 600 \
--num-encoder-layers 12 \
--rnn-hidden-size 1024 \
--decoding-method $decoding_method \
--use-averaged-model True \
--beam 4 \
--max-contexts 4 \
--max-states 8 \
--beam-size 4
done
```

Pretrained models, training logs, decoding logs, and decoding results
are available at
<https://huggingface.co/Zengwei/icefall-asr-librispeech-lstm-transducer-stateless3-2022-09-28>


### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + multi-dataset)

#### [lstm_transducer_stateless2](./lstm_transducer_stateless2)

See <https://github.com/k2-fsa/icefall/pull/558> for more details.

The WERs are:

Expand All @@ -18,6 +106,7 @@ The WERs are:
| modified_beam_search | 2.75 | 7.08 | --iter 472000 --avg 18 |
| fast_beam_search | 2.77 | 7.29 | --iter 472000 --avg 18 |


The training command is:

```bash
Expand Down Expand Up @@ -70,15 +159,16 @@ Pretrained models, training logs, decoding logs, and decoding results
are available at
<https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03>

#### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T)

[lstm_transducer_stateless](./lstm_transducer_stateless)
### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T)

#### [lstm_transducer_stateless](./lstm_transducer_stateless)

It implements LSTM model with mechanisms in reworked model for streaming ASR.

See <https://github.com/k2-fsa/icefall/pull/479> for more details.

#### training on full librispeech
##### training on full librispeech

This model contains 12 encoder layers (LSTM module + Feedforward module). The number of model parameters is 84689496.

Expand Down Expand Up @@ -165,7 +255,7 @@ It is modified from [torchaudio](https://github.com/pytorch/audio).

See <https://github.com/k2-fsa/icefall/pull/440> for more details.

#### With lower latency setup, training on full librispeech
##### With lower latency setup, training on full librispeech

In this model, the lengths of chunk and right context are 32 frames (i.e., 0.32s) and 8 frames (i.e., 0.08s), respectively.

Expand Down Expand Up @@ -316,7 +406,7 @@ Pretrained models, training logs, decoding logs, and decoding results
are available at
<https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05>

#### With higher latency setup, training on full librispeech
##### With higher latency setup, training on full librispeech

In this model, the lengths of chunk and right context are 64 frames (i.e., 0.64s) and 16 frames (i.e., 0.16s), respectively.

Expand Down Expand Up @@ -851,14 +941,14 @@ Pre-trained models, training and decoding logs, and decoding results are availab

### LibriSpeech BPE training results (Pruned Stateless Conv-Emformer RNN-T)

[conv_emformer_transducer_stateless](./conv_emformer_transducer_stateless)
#### [conv_emformer_transducer_stateless](./conv_emformer_transducer_stateless)

It implements [Emformer](https://arxiv.org/abs/2010.10759) augmented with convolution module for streaming ASR.
It is modified from [torchaudio](https://github.com/pytorch/audio).

See <https://github.com/k2-fsa/icefall/pull/389> for more details.

#### Training on full librispeech
##### Training on full librispeech

In this model, the lengths of chunk and right context are 32 frames (i.e., 0.32s) and 8 frames (i.e., 0.08s), respectively.

Expand Down Expand Up @@ -1011,7 +1101,7 @@ are available at

### LibriSpeech BPE training results (Pruned Stateless Emformer RNN-T)

[pruned_stateless_emformer_rnnt2](./pruned_stateless_emformer_rnnt2)
#### [pruned_stateless_emformer_rnnt2](./pruned_stateless_emformer_rnnt2)

Use <https://github.com/k2-fsa/icefall/pull/390>.

Expand Down Expand Up @@ -1079,7 +1169,7 @@ results at:

### LibriSpeech BPE training results (Pruned Stateless Transducer 5)

[pruned_transducer_stateless5](./pruned_transducer_stateless5)
#### [pruned_transducer_stateless5](./pruned_transducer_stateless5)

Same as `Pruned Stateless Transducer 2` but with more layers.

Expand All @@ -1092,7 +1182,7 @@ The notations `large` and `medium` below are from the [Conformer](https://arxiv.
paper, where the large model has about 118 M parameters and the medium model
has 30.8 M parameters.

#### Large
##### Large

Number of model parameters 118129516 (i.e, 118.13 M).

Expand Down Expand Up @@ -1152,7 +1242,7 @@ results at:
<https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless5-2022-07-07>


#### Medium
##### Medium

Number of model parameters 30896748 (i.e, 30.9 M).

Expand Down Expand Up @@ -1212,7 +1302,7 @@ results at:
<https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless5-M-2022-07-07>


#### Baseline-2
##### Baseline-2

It has 88.98 M parameters. Compared to the model in pruned_transducer_stateless2, its has more
layers (24 v.s 12) but a narrower model (1536 feedforward dim and 384 encoder dim vs 2048 feed forward dim and 512 encoder dim).
Expand Down Expand Up @@ -1273,13 +1363,13 @@ results at:

### LibriSpeech BPE training results (Pruned Stateless Transducer 4)

[pruned_transducer_stateless4](./pruned_transducer_stateless4)
#### [pruned_transducer_stateless4](./pruned_transducer_stateless4)

This version saves averaged model during training, and decodes with averaged model.

See <https://github.com/k2-fsa/icefall/issues/337> for details about the idea of model averaging.

#### Training on full librispeech
##### Training on full librispeech

See <https://github.com/k2-fsa/icefall/pull/344>

Expand Down Expand Up @@ -1355,7 +1445,7 @@ Pretrained models, training logs, decoding logs, and decoding results
are available at
<https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless4-2022-06-03>

#### Training on train-clean-100
##### Training on train-clean-100

See <https://github.com/k2-fsa/icefall/pull/344>

Expand Down Expand Up @@ -1392,7 +1482,7 @@ The tensorboard log can be found at

### LibriSpeech BPE training results (Pruned Stateless Transducer 3, 2022-04-29)

[pruned_transducer_stateless3](./pruned_transducer_stateless3)
#### [pruned_transducer_stateless3](./pruned_transducer_stateless3)
Same as `Pruned Stateless Transducer 2` but using the XL subset from
[GigaSpeech](https://github.com/SpeechColab/GigaSpeech) as extra training data.

Expand Down Expand Up @@ -1606,10 +1696,10 @@ can be found at

### LibriSpeech BPE training results (Pruned Transducer 2)

[pruned_transducer_stateless2](./pruned_transducer_stateless2)
#### [pruned_transducer_stateless2](./pruned_transducer_stateless2)
This is with a reworked version of the conformer encoder, with many changes.

#### Training on fulll librispeech
##### Training on full librispeech

Using commit `34aad74a2c849542dd5f6359c9e6b527e8782fd6`.
See <https://github.com/k2-fsa/icefall/pull/288>
Expand Down Expand Up @@ -1658,7 +1748,7 @@ can be found at
<https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless2-2022-04-29>


#### Training on train-clean-100:
##### Training on train-clean-100:

Trained with 1 job:
```
Expand Down
1 change: 1 addition & 0 deletions egs/librispeech/ASR/lstm_transducer_stateless3/__init__.py
Loading

0 comments on commit f3ad327

Please sign in to comment.