|
| 1 | +# Results |
| 2 | + |
| 3 | +## Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer) |
| 4 | + |
| 5 | +### [pruned_transducer_stateless7_streaming](./pruned_transducer_stateless7_streaming) |
| 6 | + |
| 7 | +See <https://github.com/k2-fsa/icefall/pull/892> for more details. |
| 8 | + |
| 9 | +You can find a pretrained model, training logs, decoding logs, and decoding results at: |
| 10 | +<https://huggingface.co/TeoWenShen/icefall-asr-csj-pruned-transducer-stateless7-streaming-230208> |
| 11 | + |
| 12 | +Number of model parameters: 75688409, i.e. 75.7M. |
| 13 | + |
| 14 | +#### training on disfluent transcript |
| 15 | + |
| 16 | +The CERs are: |
| 17 | + |
| 18 | +| decoding method | chunk size | eval1 | eval2 | eval3 | excluded | valid | average | decoding mode | |
| 19 | +| --------------- | ---------- | ----- | ----- | ----- | -------- | ----- | ------- | ------------- | |
| 20 | +| fast beam search | 320ms | 5.39 | 4.08 | 4.16 | 5.4 | 5.02 | --epoch 30 --avg 17 | simulated streaming | |
| 21 | +| fast beam search | 320ms | 5.34 | 4.1 | 4.26 | 5.61 | 4.91 | --epoch 30 --avg 17 | chunk-wise | |
| 22 | +| greedy search | 320ms | 5.43 | 4.14 | 4.31 | 5.48 | 4.88 | --epoch 30 --avg 17 | simulated streaming | |
| 23 | +| greedy search | 320ms | 5.44 | 4.14 | 4.39 | 5.7 | 4.98 | --epoch 30 --avg 17 | chunk-wise | |
| 24 | +| modified beam search | 320ms | 5.2 | 3.95 | 4.09 | 5.12 | 4.75 | --epoch 30 --avg 17 | simulated streaming | |
| 25 | +| modified beam search | 320ms | 5.18 | 4.07 | 4.12 | 5.36 | 4.77 | --epoch 30 --avg 17 | chunk-wise | |
| 26 | +| fast beam search | 640ms | 5.01 | 3.78 | 3.96 | 4.85 | 4.6 | --epoch 30 --avg 17 | simulated streaming | |
| 27 | +| fast beam search | 640ms | 4.97 | 3.88 | 3.96 | 4.91 | 4.61 | --epoch 30 --avg 17 | chunk-wise | |
| 28 | +| greedy search | 640ms | 5.02 | 3.84 | 4.14 | 5.02 | 4.59 | --epoch 30 --avg 17 | simulated streaming | |
| 29 | +| greedy search | 640ms | 5.32 | 4.22 | 4.33 | 5.39 | 4.99 | --epoch 30 --avg 17 | chunk-wise | |
| 30 | +| modified beam search | 640ms | 4.78 | 3.66 | 3.85 | 4.72 | 4.42 | --epoch 30 --avg 17 | simulated streaming | |
| 31 | +| modified beam search | 640ms | 5.77 | 4.72 | 4.73 | 5.85 | 5.36 | --epoch 30 --avg 17 | chunk-wise | |
| 32 | + |
| 33 | +Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`, |
| 34 | +while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`. |
| 35 | + |
| 36 | +The training command was: |
| 37 | +```bash |
| 38 | +./pruned_transducer_stateless7_streaming/train.py \ |
| 39 | + --feedforward-dims "1024,1024,2048,2048,1024" \ |
| 40 | + --world-size 8 \ |
| 41 | + --num-epochs 30 \ |
| 42 | + --start-epoch 1 \ |
| 43 | + --use-fp16 1 \ |
| 44 | + --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \ |
| 45 | + --max-duration 375 \ |
| 46 | + --transcript-mode disfluent \ |
| 47 | + --lang data/lang_char \ |
| 48 | + --manifest-dir /mnt/host/corpus/csj/fbank \ |
| 49 | + --pad-feature 30 \ |
| 50 | + --musan-dir /mnt/host/corpus/musan/musan/fbank |
| 51 | +``` |
| 52 | + |
| 53 | +The simulated streaming decoding command was: |
| 54 | +```bash |
| 55 | +for chunk in 64 32; do |
| 56 | + for m in greedy_search fast_beam_search modified_beam_search; do |
| 57 | + python pruned_transducer_stateless7_streaming/decode.py \ |
| 58 | + --feedforward-dims "1024,1024,2048,2048,1024" \ |
| 59 | + --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \ |
| 60 | + --epoch 30 \ |
| 61 | + --avg 17 \ |
| 62 | + --max-duration 350 \ |
| 63 | + --decoding-method $m \ |
| 64 | + --manifest-dir /mnt/host/corpus/csj/fbank \ |
| 65 | + --lang data/lang_char \ |
| 66 | + --transcript-mode disfluent \ |
| 67 | + --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/sim_"$chunk"_"$m" \ |
| 68 | + --decode-chunk-len $chunk \ |
| 69 | + --pad-feature 30 \ |
| 70 | + --gpu 0 |
| 71 | + done |
| 72 | +done |
| 73 | +``` |
| 74 | + |
| 75 | +The streaming chunk-wise decoding command was: |
| 76 | +```bash |
| 77 | +for chunk in 64 32; do |
| 78 | + for m in greedy_search fast_beam_search modified_beam_search; do |
| 79 | + python pruned_transducer_stateless7_streaming/streaming_decode.py \ |
| 80 | + --feedforward-dims "1024,1024,2048,2048,1024" \ |
| 81 | + --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \ |
| 82 | + --epoch 30 \ |
| 83 | + --avg 17 \ |
| 84 | + --max-duration 350 \ |
| 85 | + --decoding-method $m \ |
| 86 | + --manifest-dir /mnt/host/corpus/csj/fbank \ |
| 87 | + --lang data/lang_char \ |
| 88 | + --transcript-mode disfluent \ |
| 89 | + --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/stream_"$chunk"_"$m" \ |
| 90 | + --decode-chunk-len $chunk \ |
| 91 | + --gpu 2 \ |
| 92 | + --num-decode-streams 40 |
| 93 | + done |
| 94 | +done |
| 95 | +``` |
| 96 | + |
| 97 | +#### training on fluent transcript |
| 98 | + |
| 99 | +The CERs are: |
| 100 | + |
| 101 | +| decoding method | chunk size | eval1 | eval2 | eval3 | excluded | valid | average | decoding mode | |
| 102 | +| --------------- | ---------- | ----- | ----- | ----- | -------- | ----- | ------- | ------------- | |
| 103 | +| fast beam search | 320ms | 4.19 | 3.63 | 3.77 | 4.43 | 4.09 | --epoch 30 --avg 12 | simulated streaming | |
| 104 | +| fast beam search | 320ms | 4.06 | 3.55 | 3.66 | 4.70 | 4.04 | --epoch 30 --avg 12 | chunk-wise | |
| 105 | +| greedy search | 320ms | 4.22 | 3.62 | 3.82 | 4.45 | 3.98 | --epoch 30 --avg 12 | simulated streaming | |
| 106 | +| greedy search | 320ms | 4.13 | 3.61 | 3.85 | 4.67 | 4.05 | --epoch 30 --avg 12 | chunk-wise | |
| 107 | +| modified beam search | 320ms | 4.02 | 3.43 | 3.62 | 4.43 | 3.81 | --epoch 30 --avg 12 | simulated streaming | |
| 108 | +| modified beam search | 320ms | 3.97 | 3.43 | 3.59 | 4.99 | 3.88 | --epoch 30 --avg 12 | chunk-wise | |
| 109 | +| fast beam search | 640ms | 3.80 | 3.31 | 3.55 | 4.16 | 3.90 | --epoch 30 --avg 12 | simulated streaming | |
| 110 | +| fast beam search | 640ms | 3.81 | 3.34 | 3.46 | 4.58 | 3.85 | --epoch 30 --avg 12 | chunk-wise | |
| 111 | +| greedy search | 640ms | 3.92 | 3.38 | 3.65 | 4.31 | 3.88 | --epoch 30 --avg 12 | simulated streaming | |
| 112 | +| greedy search | 640ms | 3.98 | 3.38 | 3.64 | 4.54 | 4.01 | --epoch 30 --avg 12 | chunk-wise | |
| 113 | +| modified beam search | 640ms | 3.72 | 3.26 | 3.39 | 4.10 | 3.65 | --epoch 30 --avg 12 | simulated streaming | |
| 114 | +| modified beam search | 640ms | 3.78 | 3.32 | 3.45 | 4.81 | 3.81 | --epoch 30 --avg 12 | chunk-wise | |
| 115 | + |
| 116 | +Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`, |
| 117 | +while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`. |
| 118 | + |
| 119 | +The training command was: |
| 120 | +```bash |
| 121 | +./pruned_transducer_stateless7_streaming/train.py \ |
| 122 | + --feedforward-dims "1024,1024,2048,2048,1024" \ |
| 123 | + --world-size 8 \ |
| 124 | + --num-epochs 30 \ |
| 125 | + --start-epoch 1 \ |
| 126 | + --use-fp16 1 \ |
| 127 | + --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \ |
| 128 | + --max-duration 375 \ |
| 129 | + --transcript-mode fluent \ |
| 130 | + --lang data/lang_char \ |
| 131 | + --manifest-dir /mnt/host/corpus/csj/fbank \ |
| 132 | + --pad-feature 30 \ |
| 133 | + --musan-dir /mnt/host/corpus/musan/musan/fbank |
| 134 | +``` |
| 135 | + |
| 136 | +The simulated streaming decoding command was: |
| 137 | +```bash |
| 138 | +for chunk in 64 32; do |
| 139 | + for m in greedy_search fast_beam_search modified_beam_search; do |
| 140 | + python pruned_transducer_stateless7_streaming/decode.py \ |
| 141 | + --feedforward-dims "1024,1024,2048,2048,1024" \ |
| 142 | + --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \ |
| 143 | + --epoch 30 \ |
| 144 | + --avg 12 \ |
| 145 | + --max-duration 350 \ |
| 146 | + --decoding-method $m \ |
| 147 | + --manifest-dir /mnt/host/corpus/csj/fbank \ |
| 148 | + --lang data/lang_char \ |
| 149 | + --transcript-mode fluent \ |
| 150 | + --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/sim_"$chunk"_"$m" \ |
| 151 | + --decode-chunk-len $chunk \ |
| 152 | + --pad-feature 30 \ |
| 153 | + --gpu 1 |
| 154 | + done |
| 155 | +done |
| 156 | +``` |
| 157 | + |
| 158 | +The streaming chunk-wise decoding command was: |
| 159 | +```bash |
| 160 | +for chunk in 64 32; do |
| 161 | + for m in greedy_search fast_beam_search modified_beam_search; do |
| 162 | + python pruned_transducer_stateless7_streaming/streaming_decode.py \ |
| 163 | + --feedforward-dims "1024,1024,2048,2048,1024" \ |
| 164 | + --exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \ |
| 165 | + --epoch 30 \ |
| 166 | + --avg 12 \ |
| 167 | + --max-duration 350 \ |
| 168 | + --decoding-method $m \ |
| 169 | + --manifest-dir /mnt/host/corpus/csj/fbank \ |
| 170 | + --lang data/lang_char \ |
| 171 | + --transcript-mode fluent \ |
| 172 | + --res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/stream_"$chunk"_"$m" \ |
| 173 | + --decode-chunk-len $chunk \ |
| 174 | + --gpu 3 \ |
| 175 | + --num-decode-streams 40 |
| 176 | + done |
| 177 | +done |
| 178 | +``` |
| 179 | + |
| 180 | +#### Comparing disfluent to fluent |
| 181 | + |
| 182 | +$$ \texttt{CER}^{f}_d = \frac{\texttt{sub}_f + \texttt{ins} + \texttt{del}_f}{N_f} $$ |
| 183 | + |
| 184 | +This comparison evaluates the disfluent model on the fluent transcript (calculated by `disfluent_recogs_to_fluent.py`), forgiving the disfluent model's mistakes on fillers and partial words. It is meant as an illustrative metric only, so that the disfluent and fluent models can be compared. |
| 185 | + |
| 186 | +| decoding method | chunk size | eval1 (d vs f) | eval2 (d vs f) | eval3 (d vs f) | excluded (d vs f) | valid (d vs f) | decoding mode | |
| 187 | +| --------------- | ---------- | -------------- | --------------- | -------------- | -------------------- | --------------- | ----------- | |
| 188 | +| fast beam search | 320ms | 4.54 vs 4.19 | 3.44 vs 3.63 | 3.56 vs 3.77 | 4.22 vs 4.43 | 4.22 vs 4.09 | simulated streaming | |
| 189 | +| fast beam search | 320ms | 4.48 vs 4.06 | 3.41 vs 3.55 | 3.65 vs 3.66 | 4.26 vs 4.7 | 4.08 vs 4.04 | chunk-wise | |
| 190 | +| greedy search | 320ms | 4.53 vs 4.22 | 3.48 vs 3.62 | 3.69 vs 3.82 | 4.38 vs 4.45 | 4.05 vs 3.98 | simulated streaming | |
| 191 | +| greedy search | 320ms | 4.53 vs 4.13 | 3.46 vs 3.61 | 3.71 vs 3.85 | 4.48 vs 4.67 | 4.12 vs 4.05 | chunk-wise | |
| 192 | +| modified beam search | 320ms | 4.45 vs 4.02 | 3.38 vs 3.43 | 3.57 vs 3.62 | 4.19 vs 4.43 | 4.04 vs 3.81 | simulated streaming | |
| 193 | +| modified beam search | 320ms | 4.44 vs 3.97 | 3.47 vs 3.43 | 3.56 vs 3.59 | 4.28 vs 4.99 | 4.04 vs 3.88 | chunk-wise | |
| 194 | +| fast beam search | 640ms | 4.14 vs 3.8 | 3.12 vs 3.31 | 3.38 vs 3.55 | 3.72 vs 4.16 | 3.81 vs 3.9 | simulated streaming | |
| 195 | +| fast beam search | 640ms | 4.05 vs 3.81 | 3.23 vs 3.34 | 3.36 vs 3.46 | 3.65 vs 4.58 | 3.78 vs 3.85 | chunk-wise | |
| 196 | +| greedy search | 640ms | 4.1 vs 3.92 | 3.17 vs 3.38 | 3.5 vs 3.65 | 3.87 vs 4.31 | 3.77 vs 3.88 | simulated streaming | |
| 197 | +| greedy search | 640ms | 4.41 vs 3.98 | 3.56 vs 3.38 | 3.69 vs 3.64 | 4.26 vs 4.54 | 4.16 vs 4.01 | chunk-wise | |
| 198 | +| modified beam search | 640ms | 4 vs 3.72 | 3.08 vs 3.26 | 3.33 vs 3.39 | 3.75 vs 4.1 | 3.71 vs 3.65 | simulated streaming | |
| 199 | +| modified beam search | 640ms | 5.05 vs 3.78 | 4.22 vs 3.32 | 4.26 vs 3.45 | 5.02 vs 4.81 | 4.73 vs 3.81 | chunk-wise | |
| 200 | +| average (d - f) | | 0.43 | -0.02 | -0.02 | -0.34 | 0.13 | | |
0 commit comments