Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Streaming conformer pruned transducer #242

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
## wer with various right context

related model and decoding result/log fils could be found:
https://huggingface.co/GuoLiyong/icefall_streaming_prunned_transducer_stateless/tree/main/streaming_pruned_transducer_stateless/exp

decoding with ctc greedy search:

right_context|1|8|16|32|64|full
--|--|--|--|--|--|--
latency|0.07s|0.35s|0.67s|1.31s|2.59s|*
test_clean|5.60|4.00|3.76|3.75|3.65|3.28|
+20 tailing dummy frames|5.52|3.98|3.75|3.75|3.65|3.28
simulate streaming with chunk_by_chunk decoding|5.52|3.98|3.75|3.75|3.65|3.28
test_other|14.07|10.69|9.80|9.48|9.01|8.05|
+20 tailing dummy frames|14.00|10.69|9.80|9.48|9.0|8.04
simulate streaming with chunk_by_chunk decoding|14.00|10.69|9.80|9.48|9.0|8.04



## How latency is computed?

latency = (subsampling factor * right_context + initialize_frames_need_by_subsampling_convs) * 10ms

During which: subsmapling factor = 4
initialize_frames_need_by_subsampling_convs = 3

To decode the first frame encoder out: 7 frams fbanks = subsampling_factor + initialize_frames_need_by_subsampling_convs are needed.
Once the deocding started, 4 frames fbank are needed per encoder_out frame.


## Why does tailing dummy frames help?

As 4 frames fbank are needed per encoder_out frame, suppose only 3(or 2,1) frames left, after a decoding process.
There will no encoder out frames corresponding to these 3 frames.
This may results in some "substitution/deletion errors" at the end.
By padding some dummy frames to the right, this problem could be solved to some extent.

### Some Examples results:
padding 0 frame|padding 20 frames
--|--
WITH ONE JUMP (ANDERS->ANDREWS) GOT OUT OF HIS (CHAIR->CHA)|WITH ONE JUMP (ANDERS->ANDREWS) GOT OUT OF HIS CHAIR
COME WE'LL HAVE OUR COFFEE IN THE OTHER ROOM AND YOU CAN (SMOKE->SMO)|COME WE'LL HAVE OUR COFFEE IN THE OTHER ROOM AND YOU CAN SMOKE
THINKING OF ALL THIS I WENT TO (SLEEP->SLEE)|THINKING OF ALL THIS I WENT TO SLEEP
STEAM UP AND CANVAS SPREAD THE SCHOONER STARTED (EASTWARDS->EASTWARD)|STEAM UP AND CANVAS SPREAD THE SCHOONER STARTED EASTWARDS

### final Wers and detail error counts :
*|wer|ins|del|sub
--|--|--|--|--
padding 0|5.60|329|283|2332
padding 20 frames|5.52|329|282|2291

Raw log files of previous table:
```
padding 0 frames:
%WER = 5.60
Errors: 329 insertions, 283 deletions, 2332 substitutions, over 52576 reference words (49961 correct)

padding 20 frames:
%WER = 5.52
Errors: 329 insertions, 282 deletions, 2291 substitutions, over 52576 reference words (50003 correct)
```


Empty file.
Loading