Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSJ pruned_transducer_stateless7_streaming #892

Merged
merged 27 commits into from
Feb 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions egs/csj/ASR/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Introduction

[./RESULTS.md](./RESULTS.md) contains the latest results.

# Transducers

These are the types of architectures currently available.

| | Encoder | Decoder | Comment |
|---------------------------------------|---------------------|--------------------|---------------------------------------------------|
| `pruned_transducer_stateless7_streaming` | Streaming Zipformer | Embedding + Conv1d | Adapted from librispeech pruned_transducer_stateless7_streaming |
200 changes: 200 additions & 0 deletions egs/csj/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# Results

## Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer)

### [pruned_transducer_stateless7_streaming](./pruned_transducer_stateless7_streaming)

See <https://github.com/k2-fsa/icefall/pull/892> for more details.

You can find a pretrained model, training logs, decoding logs, and decoding results at:
<https://huggingface.co/TeoWenShen/icefall-asr-csj-pruned-transducer-stateless7-streaming-230208>

Number of model parameters: 75688409, i.e. 75.7M.

#### training on disfluent transcript

The CERs are:

| decoding method | chunk size | eval1 | eval2 | eval3 | excluded | valid | average | decoding mode |
| --------------- | ---------- | ----- | ----- | ----- | -------- | ----- | ------- | ------------- |
| fast beam search | 320ms | 5.39 | 4.08 | 4.16 | 5.4 | 5.02 | --epoch 30 --avg 17 | simulated streaming |
| fast beam search | 320ms | 5.34 | 4.1 | 4.26 | 5.61 | 4.91 | --epoch 30 --avg 17 | chunk-wise |
| greedy search | 320ms | 5.43 | 4.14 | 4.31 | 5.48 | 4.88 | --epoch 30 --avg 17 | simulated streaming |
| greedy search | 320ms | 5.44 | 4.14 | 4.39 | 5.7 | 4.98 | --epoch 30 --avg 17 | chunk-wise |
| modified beam search | 320ms | 5.2 | 3.95 | 4.09 | 5.12 | 4.75 | --epoch 30 --avg 17 | simulated streaming |
| modified beam search | 320ms | 5.18 | 4.07 | 4.12 | 5.36 | 4.77 | --epoch 30 --avg 17 | chunk-wise |
| fast beam search | 640ms | 5.01 | 3.78 | 3.96 | 4.85 | 4.6 | --epoch 30 --avg 17 | simulated streaming |
| fast beam search | 640ms | 4.97 | 3.88 | 3.96 | 4.91 | 4.61 | --epoch 30 --avg 17 | chunk-wise |
| greedy search | 640ms | 5.02 | 3.84 | 4.14 | 5.02 | 4.59 | --epoch 30 --avg 17 | simulated streaming |
| greedy search | 640ms | 5.32 | 4.22 | 4.33 | 5.39 | 4.99 | --epoch 30 --avg 17 | chunk-wise |
| modified beam search | 640ms | 4.78 | 3.66 | 3.85 | 4.72 | 4.42 | --epoch 30 --avg 17 | simulated streaming |
| modified beam search | 640ms | 5.77 | 4.72 | 4.73 | 5.85 | 5.36 | --epoch 30 --avg 17 | chunk-wise |

Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.

The training command was:
```bash
./pruned_transducer_stateless7_streaming/train.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--world-size 8 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
--max-duration 375 \
--transcript-mode disfluent \
--lang data/lang_char \
--manifest-dir /mnt/host/corpus/csj/fbank \
--pad-feature 30 \
--musan-dir /mnt/host/corpus/musan/musan/fbank
```

The simulated streaming decoding command was:
```bash
for chunk in 64 32; do
for m in greedy_search fast_beam_search modified_beam_search; do
python pruned_transducer_stateless7_streaming/decode.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
--epoch 30 \
--avg 17 \
--max-duration 350 \
--decoding-method $m \
--manifest-dir /mnt/host/corpus/csj/fbank \
--lang data/lang_char \
--transcript-mode disfluent \
--res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/sim_"$chunk"_"$m" \
--decode-chunk-len $chunk \
--pad-feature 30 \
--gpu 0
done
done
```

The streaming chunk-wise decoding command was:
```bash
for chunk in 64 32; do
for m in greedy_search fast_beam_search modified_beam_search; do
python pruned_transducer_stateless7_streaming/streaming_decode.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
--epoch 30 \
--avg 17 \
--max-duration 350 \
--decoding-method $m \
--manifest-dir /mnt/host/corpus/csj/fbank \
--lang data/lang_char \
--transcript-mode disfluent \
--res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/stream_"$chunk"_"$m" \
--decode-chunk-len $chunk \
--gpu 2 \
--num-decode-streams 40
done
done
```

#### training on fluent transcript

The CERs are:

| decoding method | chunk size | eval1 | eval2 | eval3 | excluded | valid | average | decoding mode |
| --------------- | ---------- | ----- | ----- | ----- | -------- | ----- | ------- | ------------- |
| fast beam search | 320ms | 4.19 | 3.63 | 3.77 | 4.43 | 4.09 | --epoch 30 --avg 12 | simulated streaming |
| fast beam search | 320ms | 4.06 | 3.55 | 3.66 | 4.70 | 4.04 | --epoch 30 --avg 12 | chunk-wise |
| greedy search | 320ms | 4.22 | 3.62 | 3.82 | 4.45 | 3.98 | --epoch 30 --avg 12 | simulated streaming |
| greedy search | 320ms | 4.13 | 3.61 | 3.85 | 4.67 | 4.05 | --epoch 30 --avg 12 | chunk-wise |
| modified beam search | 320ms | 4.02 | 3.43 | 3.62 | 4.43 | 3.81 | --epoch 30 --avg 12 | simulated streaming |
| modified beam search | 320ms | 3.97 | 3.43 | 3.59 | 4.99 | 3.88 | --epoch 30 --avg 12 | chunk-wise |
| fast beam search | 640ms | 3.80 | 3.31 | 3.55 | 4.16 | 3.90 | --epoch 30 --avg 12 | simulated streaming |
| fast beam search | 640ms | 3.81 | 3.34 | 3.46 | 4.58 | 3.85 | --epoch 30 --avg 12 | chunk-wise |
| greedy search | 640ms | 3.92 | 3.38 | 3.65 | 4.31 | 3.88 | --epoch 30 --avg 12 | simulated streaming |
| greedy search | 640ms | 3.98 | 3.38 | 3.64 | 4.54 | 4.01 | --epoch 30 --avg 12 | chunk-wise |
| modified beam search | 640ms | 3.72 | 3.26 | 3.39 | 4.10 | 3.65 | --epoch 30 --avg 12 | simulated streaming |
| modified beam search | 640ms | 3.78 | 3.32 | 3.45 | 4.81 | 3.81 | --epoch 30 --avg 12 | chunk-wise |

Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.

The training command was:
```bash
./pruned_transducer_stateless7_streaming/train.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--world-size 8 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
--max-duration 375 \
--transcript-mode fluent \
--lang data/lang_char \
--manifest-dir /mnt/host/corpus/csj/fbank \
--pad-feature 30 \
--musan-dir /mnt/host/corpus/musan/musan/fbank
```

The simulated streaming decoding command was:
```bash
for chunk in 64 32; do
for m in greedy_search fast_beam_search modified_beam_search; do
python pruned_transducer_stateless7_streaming/decode.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
--epoch 30 \
--avg 12 \
--max-duration 350 \
--decoding-method $m \
--manifest-dir /mnt/host/corpus/csj/fbank \
--lang data/lang_char \
--transcript-mode fluent \
--res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/sim_"$chunk"_"$m" \
--decode-chunk-len $chunk \
--pad-feature 30 \
--gpu 1
done
done
```

The streaming chunk-wise decoding command was:
```bash
for chunk in 64 32; do
for m in greedy_search fast_beam_search modified_beam_search; do
python pruned_transducer_stateless7_streaming/streaming_decode.py \
--feedforward-dims "1024,1024,2048,2048,1024" \
--exp-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30 \
--epoch 30 \
--avg 12 \
--max-duration 350 \
--decoding-method $m \
--manifest-dir /mnt/host/corpus/csj/fbank \
--lang data/lang_char \
--transcript-mode fluent \
--res-dir pruned_transducer_stateless7_streaming/exp_fluent_2_pad30/github/stream_"$chunk"_"$m" \
--decode-chunk-len $chunk \
--gpu 3 \
--num-decode-streams 40
done
done
```

#### Comparing disfluent to fluent

$$ \texttt{CER}^{f}_d = \frac{\texttt{sub}_f + \texttt{ins} + \texttt{del}_f}{N_f} $$

This comparison evaluates the disfluent model on the fluent transcript (calculated by `disfluent_recogs_to_fluent.py`), forgiving the disfluent model's mistakes on fillers and partial words. It is meant as an illustrative metric only, so that the disfluent and fluent models can be compared.

| decoding method | chunk size | eval1 (d vs f) | eval2 (d vs f) | eval3 (d vs f) | excluded (d vs f) | valid (d vs f) | decoding mode |
| --------------- | ---------- | -------------- | --------------- | -------------- | -------------------- | --------------- | ----------- |
| fast beam search | 320ms | 4.54 vs 4.19 | 3.44 vs 3.63 | 3.56 vs 3.77 | 4.22 vs 4.43 | 4.22 vs 4.09 | simulated streaming |
| fast beam search | 320ms | 4.48 vs 4.06 | 3.41 vs 3.55 | 3.65 vs 3.66 | 4.26 vs 4.7 | 4.08 vs 4.04 | chunk-wise |
| greedy search | 320ms | 4.53 vs 4.22 | 3.48 vs 3.62 | 3.69 vs 3.82 | 4.38 vs 4.45 | 4.05 vs 3.98 | simulated streaming |
| greedy search | 320ms | 4.53 vs 4.13 | 3.46 vs 3.61 | 3.71 vs 3.85 | 4.48 vs 4.67 | 4.12 vs 4.05 | chunk-wise |
| modified beam search | 320ms | 4.45 vs 4.02 | 3.38 vs 3.43 | 3.57 vs 3.62 | 4.19 vs 4.43 | 4.04 vs 3.81 | simulated streaming |
| modified beam search | 320ms | 4.44 vs 3.97 | 3.47 vs 3.43 | 3.56 vs 3.59 | 4.28 vs 4.99 | 4.04 vs 3.88 | chunk-wise |
| fast beam search | 640ms | 4.14 vs 3.8 | 3.12 vs 3.31 | 3.38 vs 3.55 | 3.72 vs 4.16 | 3.81 vs 3.9 | simulated streaming |
| fast beam search | 640ms | 4.05 vs 3.81 | 3.23 vs 3.34 | 3.36 vs 3.46 | 3.65 vs 4.58 | 3.78 vs 3.85 | chunk-wise |
| greedy search | 640ms | 4.1 vs 3.92 | 3.17 vs 3.38 | 3.5 vs 3.65 | 3.87 vs 4.31 | 3.77 vs 3.88 | simulated streaming |
| greedy search | 640ms | 4.41 vs 3.98 | 3.56 vs 3.38 | 3.69 vs 3.64 | 4.26 vs 4.54 | 4.16 vs 4.01 | chunk-wise |
| modified beam search | 640ms | 4 vs 3.72 | 3.08 vs 3.26 | 3.33 vs 3.39 | 3.75 vs 4.1 | 3.71 vs 3.65 | simulated streaming |
| modified beam search | 640ms | 5.05 vs 3.78 | 4.22 vs 3.32 | 4.26 vs 3.45 | 5.02 vs 4.81 | 4.73 vs 3.81 | chunk-wise |
| average (d - f) | | 0.43 | -0.02 | -0.02 | -0.34 | 0.13 | |
94 changes: 94 additions & 0 deletions egs/csj/ASR/local/add_transcript_mode.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
import argparse
import logging
from configparser import ConfigParser
from pathlib import Path
from typing import List

from lhotse import CutSet, SupervisionSet
from lhotse.recipes.csj import CSJSDBParser

ARGPARSE_DESCRIPTION = """
This script adds transcript modes to an existing CutSet or SupervisionSet.
"""


def get_args():
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
description=ARGPARSE_DESCRIPTION,
)
parser.add_argument(
"-f",
"--fbank-dir",
type=Path,
help="Path to directory where manifests are stored.",
)
parser.add_argument(
"-c",
"--config",
type=Path,
nargs="+",
help="Path to config file for transcript parsing.",
)
return parser.parse_args()


def get_CSJParsers(config_files: List[Path]) -> List[CSJSDBParser]:
parsers = []
for config_file in config_files:
config = ConfigParser()
config.optionxform = str
assert config.read(config_file), f"{config_file} could not be found."
decisions = {}
for k, v in config["DECISIONS"].items():
try:
decisions[k] = int(v)
except ValueError:
decisions[k] = v
parsers.append(
(config["CONSTANTS"].get("MODE"), CSJSDBParser(decisions=decisions))
)
return parsers


def main():
args = get_args()
logging.basicConfig(
format=("%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"),
level=logging.INFO,
)
parsers = get_CSJParsers(args.config)
config = ConfigParser()
config.optionxform = str
assert config.read(args.config), args.config
decisions = {}
for k, v in config["DECISIONS"].items():
try:
decisions[k] = int(v)
except ValueError:
decisions[k] = v

logging.info(f"Adding {', '.join(x[0] for x in parsers)} transcript mode.")

manifests = args.fbank_dir.glob("csj_cuts_*.jsonl.gz")
assert manifests, f"No cuts to be found in {args.fbank_dir}"

for manifest in manifests:
results = []
logging.info(f"Adding transcript modes to {manifest.name} now.")
cutset = CutSet.from_file(manifest)
for cut in cutset:
for name, parser in parsers:
cut.supervisions[0].custom[name] = parser.parse(
cut.supervisions[0].custom["raw"]
)
cut.supervisions[0].text = ""
results.append(cut)
results = CutSet.from_items(results)
res_file = manifest.as_posix()
manifest.replace(manifest.parent / ("bak." + manifest.name))
results.to_file(res_file)


if __name__ == "__main__":
main()
Loading