Skip to content

Commit e63a8c2

Browse files
authored
CSJ pruned_transducer_stateless7_streaming (#892)
* update manifest stats * update transcript configs * lang_char and compute_fbanks * save cuts in fbank_dir * add core codes * update decode.py * Create local/utils * tidy up * parse raw in prepare_lang_char.py * update manifest stats * update transcript configs * lang_char and compute_fbanks * save cuts in fbank_dir * add core codes * update decode.py * Create local/utils * tidy up * parse raw in prepare_lang_char.py * working train * Add compare_cer_transcript.py * fix tokenizer decode, allow d2f only * comment cleanup * add export files and READMEs * reword average column * fix comments * Update new results
1 parent 25ee50e commit e63a8c2

37 files changed

+5849
-1242
lines changed

egs/csj/ASR/README.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Introduction
2+
3+
[./RESULTS.md](./RESULTS.md) contains the latest results.
4+
5+
# Transducers
6+
7+
These are the types of architectures currently available.
8+
9+
| | Encoder | Decoder | Comment |
10+
|---------------------------------------|---------------------|--------------------|---------------------------------------------------|
11+
| `pruned_transducer_stateless7_streaming` | Streaming Zipformer | Embedding + Conv1d | Adapted from librispeech pruned_transducer_stateless7_streaming |

egs/csj/ASR/RESULTS.md

+200
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# Results
2+
3+
## Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer)
4+
5+
### [pruned_transducer_stateless7_streaming](./pruned_transducer_stateless7_streaming)
6+
7+
See <https://github.com/k2-fsa/icefall/pull/892> for more details.
8+
9+
You can find a pretrained model, training logs, decoding logs, and decoding results at:
10+
<https://huggingface.co/TeoWenShen/icefall-asr-csj-pruned-transducer-stateless7-streaming-230208>
11+
12+
Number of model parameters: 75688409, i.e. 75.7M.
13+
14+
#### training on disfluent transcript
15+
16+
The CERs are:
17+
18+
| decoding method | chunk size | eval1 | eval2 | eval3 | excluded | valid | average | decoding mode |
19+
| --------------- | ---------- | ----- | ----- | ----- | -------- | ----- | ------- | ------------- |
20+
| fast beam search | 320ms | 5.39 | 4.08 | 4.16 | 5.4 | 5.02 | --epoch 30 --avg 17 | simulated streaming |
21+
| fast beam search | 320ms | 5.34 | 4.1 | 4.26 | 5.61 | 4.91 | --epoch 30 --avg 17 | chunk-wise |
22+
| greedy search | 320ms | 5.43 | 4.14 | 4.31 | 5.48 | 4.88 | --epoch 30 --avg 17 | simulated streaming |
23+
| greedy search | 320ms | 5.44 | 4.14 | 4.39 | 5.7 | 4.98 | --epoch 30 --avg 17 | chunk-wise |
24+
| modified beam search | 320ms | 5.2 | 3.95 | 4.09 | 5.12 | 4.75 | --epoch 30 --avg 17 | simulated streaming |
25+
| modified beam search | 320ms | 5.18 | 4.07 | 4.12 | 5.36 | 4.77 | --epoch 30 --avg 17 | chunk-wise |
26+
| fast beam search | 640ms | 5.01 | 3.78 | 3.96 | 4.85 | 4.6 | --epoch 30 --avg 17 | simulated streaming |
27+
| fast beam search | 640ms | 4.97 | 3.88 | 3.96 | 4.91 | 4.61 | --epoch 30 --avg 17 | chunk-wise |
28+
| greedy search | 640ms | 5.02 | 3.84 | 4.14 | 5.02 | 4.59 | --epoch 30 --avg 17 | simulated streaming |
29+
| greedy search | 640ms | 5.32 | 4.22 | 4.33 | 5.39 | 4.99 | --epoch 30 --avg 17 | chunk-wise |
30+
| modified beam search | 640ms | 4.78 | 3.66 | 3.85 | 4.72 | 4.42 | --epoch 30 --avg 17 | simulated streaming |
31+
| modified beam search | 640ms | 5.77 | 4.72 | 4.73 | 5.85 | 5.36 | --epoch 30 --avg 17 | chunk-wise |
32+
33+
Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
34+
while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.
35+
36+
The training command was:
37+
```bash
38+
./pruned_transducer_stateless7_streaming/train.py \
39+
--feedforward-dims "1024,1024,2048,2048,1024" \
40+
--world-size 8 \
41+
--num-epochs 30 \
42+
--start-epoch 1 \
43+
--use-fp16 1 \
44+
--exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
45+
--max-duration 375 \
46+
--transcript-mode disfluent \
47+
--lang data/lang_char \
48+
--manifest-dir /mnt/host/corpus/csj/fbank \
49+
--pad-feature 30 \
50+
--musan-dir /mnt/host/corpus/musan/musan/fbank
51+
```
52+
53+
The simulated streaming decoding command was:
54+
```bash
55+
for chunk in 64 32; do
56+
for m in greedy_search fast_beam_search modified_beam_search; do
57+
python pruned_transducer_stateless7_streaming/decode.py \
58+
--feedforward-dims "1024,1024,2048,2048,1024" \
59+
--exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
60+
--epoch 30 \
61+
--avg 17 \
62+
--max-duration 350 \
63+
--decoding-method $m \
64+
--manifest-dir /mnt/host/corpus/csj/fbank \
65+
--lang data/lang_char \
66+
--transcript-mode disfluent \
67+
--res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/sim_"$chunk"_"$m" \
68+
--decode-chunk-len $chunk \
69+
--pad-feature 30 \
70+
--gpu 0
71+
done
72+
done
73+
```
74+
75+
The streaming chunk-wise decoding command was:
76+
```bash
77+
for chunk in 64 32; do
78+
for m in greedy_search fast_beam_search modified_beam_search; do
79+
python pruned_transducer_stateless7_streaming/streaming_decode.py \
80+
--feedforward-dims "1024,1024,2048,2048,1024" \
81+
--exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
82+
--epoch 30 \
83+
--avg 17 \
84+
--max-duration 350 \
85+
--decoding-method $m \
86+
--manifest-dir /mnt/host/corpus/csj/fbank \
87+
--lang data/lang_char \
88+
--transcript-mode disfluent \
89+
--res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/stream_"$chunk"_"$m" \
90+
--decode-chunk-len $chunk \
91+
--gpu 2 \
92+
--num-decode-streams 40
93+
done
94+
done
95+
```
96+
97+
#### training on fluent transcript
98+
99+
The CERs are:
100+
101+
| decoding method | chunk size | eval1 | eval2 | eval3 | excluded | valid | average | decoding mode |
102+
| --------------- | ---------- | ----- | ----- | ----- | -------- | ----- | ------- | ------------- |
103+
| fast beam search | 320ms | 4.19 | 3.63 | 3.77 | 4.43 | 4.09 | --epoch 30 --avg 12 | simulated streaming |
104+
| fast beam search | 320ms | 4.06 | 3.55 | 3.66 | 4.70 | 4.04 | --epoch 30 --avg 12 | chunk-wise |
105+
| greedy search | 320ms | 4.22 | 3.62 | 3.82 | 4.45 | 3.98 | --epoch 30 --avg 12 | simulated streaming |
106+
| greedy search | 320ms | 4.13 | 3.61 | 3.85 | 4.67 | 4.05 | --epoch 30 --avg 12 | chunk-wise |
107+
| modified beam search | 320ms | 4.02 | 3.43 | 3.62 | 4.43 | 3.81 | --epoch 30 --avg 12 | simulated streaming |
108+
| modified beam search | 320ms | 3.97 | 3.43 | 3.59 | 4.99 | 3.88 | --epoch 30 --avg 12 | chunk-wise |
109+
| fast beam search | 640ms | 3.80 | 3.31 | 3.55 | 4.16 | 3.90 | --epoch 30 --avg 12 | simulated streaming |
110+
| fast beam search | 640ms | 3.81 | 3.34 | 3.46 | 4.58 | 3.85 | --epoch 30 --avg 12 | chunk-wise |
111+
| greedy search | 640ms | 3.92 | 3.38 | 3.65 | 4.31 | 3.88 | --epoch 30 --avg 12 | simulated streaming |
112+
| greedy search | 640ms | 3.98 | 3.38 | 3.64 | 4.54 | 4.01 | --epoch 30 --avg 12 | chunk-wise |
113+
| modified beam search | 640ms | 3.72 | 3.26 | 3.39 | 4.10 | 3.65 | --epoch 30 --avg 12 | simulated streaming |
114+
| modified beam search | 640ms | 3.78 | 3.32 | 3.45 | 4.81 | 3.81 | --epoch 30 --avg 12 | chunk-wise |
115+
116+
Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
117+
while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.
118+
119+
The training command was:
120+
```bash
121+
./pruned_transducer_stateless7_streaming/train.py \
122+
--feedforward-dims "1024,1024,2048,2048,1024" \
123+
--world-size 8 \
124+
--num-epochs 30 \
125+
--start-epoch 1 \
126+
--use-fp16 1 \
127+
--exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
128+
--max-duration 375 \
129+
--transcript-mode fluent \
130+
--lang data/lang_char \
131+
--manifest-dir /mnt/host/corpus/csj/fbank \
132+
--pad-feature 30 \
133+
--musan-dir /mnt/host/corpus/musan/musan/fbank
134+
```
135+
136+
The simulated streaming decoding command was:
137+
```bash
138+
for chunk in 64 32; do
139+
for m in greedy_search fast_beam_search modified_beam_search; do
140+
python pruned_transducer_stateless7_streaming/decode.py \
141+
--feedforward-dims "1024,1024,2048,2048,1024" \
142+
--exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
143+
--epoch 30 \
144+
--avg 12 \
145+
--max-duration 350 \
146+
--decoding-method $m \
147+
--manifest-dir /mnt/host/corpus/csj/fbank \
148+
--lang data/lang_char \
149+
--transcript-mode fluent \
150+
--res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/sim_"$chunk"_"$m" \
151+
--decode-chunk-len $chunk \
152+
--pad-feature 30 \
153+
--gpu 1
154+
done
155+
done
156+
```
157+
158+
The streaming chunk-wise decoding command was:
159+
```bash
160+
for chunk in 64 32; do
161+
for m in greedy_search fast_beam_search modified_beam_search; do
162+
python pruned_transducer_stateless7_streaming/streaming_decode.py \
163+
--feedforward-dims "1024,1024,2048,2048,1024" \
164+
--exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
165+
--epoch 30 \
166+
--avg 12 \
167+
--max-duration 350 \
168+
--decoding-method $m \
169+
--manifest-dir /mnt/host/corpus/csj/fbank \
170+
--lang data/lang_char \
171+
--transcript-mode fluent \
172+
--res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/stream_"$chunk"_"$m" \
173+
--decode-chunk-len $chunk \
174+
--gpu 3 \
175+
--num-decode-streams 40
176+
done
177+
done
178+
```
179+
180+
#### Comparing disfluent to fluent
181+
182+
$$ \texttt{CER}^{f}_d = \frac{\texttt{sub}_f + \texttt{ins} + \texttt{del}_f}{N_f} $$
183+
184+
This comparison evaluates the disfluent model on the fluent transcript (calculated by `disfluent_recogs_to_fluent.py`), forgiving the disfluent model's mistakes on fillers and partial words. It is meant as an illustrative metric only, so that the disfluent and fluent models can be compared.
185+
186+
| decoding method | chunk size | eval1 (d vs f) | eval2 (d vs f) | eval3 (d vs f) | excluded (d vs f) | valid (d vs f) | decoding mode |
187+
| --------------- | ---------- | -------------- | --------------- | -------------- | -------------------- | --------------- | ----------- |
188+
| fast beam search | 320ms | 4.54 vs 4.19 | 3.44 vs 3.63 | 3.56 vs 3.77 | 4.22 vs 4.43 | 4.22 vs 4.09 | simulated streaming |
189+
| fast beam search | 320ms | 4.48 vs 4.06 | 3.41 vs 3.55 | 3.65 vs 3.66 | 4.26 vs 4.7 | 4.08 vs 4.04 | chunk-wise |
190+
| greedy search | 320ms | 4.53 vs 4.22 | 3.48 vs 3.62 | 3.69 vs 3.82 | 4.38 vs 4.45 | 4.05 vs 3.98 | simulated streaming |
191+
| greedy search | 320ms | 4.53 vs 4.13 | 3.46 vs 3.61 | 3.71 vs 3.85 | 4.48 vs 4.67 | 4.12 vs 4.05 | chunk-wise |
192+
| modified beam search | 320ms | 4.45 vs 4.02 | 3.38 vs 3.43 | 3.57 vs 3.62 | 4.19 vs 4.43 | 4.04 vs 3.81 | simulated streaming |
193+
| modified beam search | 320ms | 4.44 vs 3.97 | 3.47 vs 3.43 | 3.56 vs 3.59 | 4.28 vs 4.99 | 4.04 vs 3.88 | chunk-wise |
194+
| fast beam search | 640ms | 4.14 vs 3.8 | 3.12 vs 3.31 | 3.38 vs 3.55 | 3.72 vs 4.16 | 3.81 vs 3.9 | simulated streaming |
195+
| fast beam search | 640ms | 4.05 vs 3.81 | 3.23 vs 3.34 | 3.36 vs 3.46 | 3.65 vs 4.58 | 3.78 vs 3.85 | chunk-wise |
196+
| greedy search | 640ms | 4.1 vs 3.92 | 3.17 vs 3.38 | 3.5 vs 3.65 | 3.87 vs 4.31 | 3.77 vs 3.88 | simulated streaming |
197+
| greedy search | 640ms | 4.41 vs 3.98 | 3.56 vs 3.38 | 3.69 vs 3.64 | 4.26 vs 4.54 | 4.16 vs 4.01 | chunk-wise |
198+
| modified beam search | 640ms | 4 vs 3.72 | 3.08 vs 3.26 | 3.33 vs 3.39 | 3.75 vs 4.1 | 3.71 vs 3.65 | simulated streaming |
199+
| modified beam search | 640ms | 5.05 vs 3.78 | 4.22 vs 3.32 | 4.26 vs 3.45 | 5.02 vs 4.81 | 4.73 vs 3.81 | chunk-wise |
200+
| average (d - f) | | 0.43 | -0.02 | -0.02 | -0.34 | 0.13 | |
+94
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
import argparse
2+
import logging
3+
from configparser import ConfigParser
4+
from pathlib import Path
5+
from typing import List
6+
7+
from lhotse import CutSet, SupervisionSet
8+
from lhotse.recipes.csj import CSJSDBParser
9+
10+
ARGPARSE_DESCRIPTION = """
11+
This script adds transcript modes to an existing CutSet or SupervisionSet.
12+
"""
13+
14+
15+
def get_args():
16+
parser = argparse.ArgumentParser(
17+
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
18+
description=ARGPARSE_DESCRIPTION,
19+
)
20+
parser.add_argument(
21+
"-f",
22+
"--fbank-dir",
23+
type=Path,
24+
help="Path to directory where manifests are stored.",
25+
)
26+
parser.add_argument(
27+
"-c",
28+
"--config",
29+
type=Path,
30+
nargs="+",
31+
help="Path to config file for transcript parsing.",
32+
)
33+
return parser.parse_args()
34+
35+
36+
def get_CSJParsers(config_files: List[Path]) -> List[CSJSDBParser]:
37+
parsers = []
38+
for config_file in config_files:
39+
config = ConfigParser()
40+
config.optionxform = str
41+
assert config.read(config_file), f"{config_file} could not be found."
42+
decisions = {}
43+
for k, v in config["DECISIONS"].items():
44+
try:
45+
decisions[k] = int(v)
46+
except ValueError:
47+
decisions[k] = v
48+
parsers.append(
49+
(config["CONSTANTS"].get("MODE"), CSJSDBParser(decisions=decisions))
50+
)
51+
return parsers
52+
53+
54+
def main():
55+
args = get_args()
56+
logging.basicConfig(
57+
format=("%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"),
58+
level=logging.INFO,
59+
)
60+
parsers = get_CSJParsers(args.config)
61+
config = ConfigParser()
62+
config.optionxform = str
63+
assert config.read(args.config), args.config
64+
decisions = {}
65+
for k, v in config["DECISIONS"].items():
66+
try:
67+
decisions[k] = int(v)
68+
except ValueError:
69+
decisions[k] = v
70+
71+
logging.info(f"Adding {', '.join(x[0] for x in parsers)} transcript mode.")
72+
73+
manifests = args.fbank_dir.glob("csj_cuts_*.jsonl.gz")
74+
assert manifests, f"No cuts to be found in {args.fbank_dir}"
75+
76+
for manifest in manifests:
77+
results = []
78+
logging.info(f"Adding transcript modes to {manifest.name} now.")
79+
cutset = CutSet.from_file(manifest)
80+
for cut in cutset:
81+
for name, parser in parsers:
82+
cut.supervisions[0].custom[name] = parser.parse(
83+
cut.supervisions[0].custom["raw"]
84+
)
85+
cut.supervisions[0].text = ""
86+
results.append(cut)
87+
results = CutSet.from_items(results)
88+
res_file = manifest.as_posix()
89+
manifest.replace(manifest.parent / ("bak." + manifest.name))
90+
results.to_file(res_file)
91+
92+
93+
if __name__ == "__main__":
94+
main()

0 commit comments

Comments
 (0)