Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support getting word IDs for CTC HLG decoding. #978

Merged
merged 2 commits into from
Jun 6, 2024

Conversation

csukuangfj
Copy link
Collaborator

Fixes #805

@w11wo

Could you help test this PR with your phone-based CTC model?

Note that we only return the word IDs. You need to convert the IDs to actual word symbols.


Usage

Non-streaming CTC HLG decoding

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-zipformer-ctc-en-2023-10-02.tar.bz2
tar xvf sherpa-onnx-zipformer-ctc-en-2023-10-02.tar.bz2
rm sherpa-onnx-zipformer-ctc-en-2023-10-02.tar.bz2
./build/bin/sherpa-onnx-offline \
  --model-type=zipformer2_ctc \
  --ctc.graph=./sherpa-onnx-zipformer-ctc-en-2023-10-02/HLG.fst \
  --zipformer-ctc-model=./sherpa-onnx-zipformer-ctc-en-2023-10-02/model.int8.onnx \
  --tokens=./sherpa-onnx-zipformer-ctc-en-2023-10-02/tokens.txt \
  ./sherpa-onnx-zipformer-ctc-en-2023-10-02/test_wavs/0.wav \
  ./sherpa-onnx-zipformer-ctc-en-2023-10-02/test_wavs/1.wav

The output is

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx-offline --model-type=zipformer2_ctc --ctc.graph=./sherpa-onnx-zipformer-ctc-en-2023-10-02/HLG.fst --zipformer-ctc-model=./sherpa-onnx-zipformer-ctc-en-2023-10-02/model.int8.onnx --tokens=./sherpa-onnx-zipformer-ctc-en-2023-10-02/tokens.txt ./sherpa-onnx-zipformer-ctc-en-2023-10-02/test_wavs/0.wav ./sherpa-onnx-zipformer-ctc-en-2023-10-02/test_wavs/1.wav 

OfflineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OfflineModelConfig(transducer=OfflineTransducerModelConfig(encoder_filename="", decoder_filename="", joiner_filename=""), paraformer=OfflineParaformerModelConfig(model=""), nemo_ctc=OfflineNemoEncDecCtcModelConfig(model=""), whisper=OfflineWhisperModelConfig(encoder="", decoder="", language="", task="transcribe", tail_paddings=-1), tdnn=OfflineTdnnModelConfig(model=""), zipformer_ctc=OfflineZipformerCtcModelConfig(model="./sherpa-onnx-zipformer-ctc-en-2023-10-02/model.int8.onnx"), wenet_ctc=OfflineWenetCtcModelConfig(model=""), telespeech_ctc="", tokens="./sherpa-onnx-zipformer-ctc-en-2023-10-02/tokens.txt", num_threads=2, debug=False, provider="cpu", model_type="zipformer2_ctc", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OfflineLMConfig(model="", scale=0.5), ctc_fst_decoder_config=OfflineCtcFstDecoderConfig(graph="./sherpa-onnx-zipformer-ctc-en-2023-10-02/HLG.fst", max_active=3000), decoding_method="greedy_search", max_active_paths=4, hotwords_file="", hotwords_score=1.5, blank_penalty=0)
Creating recognizer ...
Started
Done!

./sherpa-onnx-zipformer-ctc-en-2023-10-02/test_wavs/0.wav
{"text": " AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS", "timestamps": [0.28, 0.64, 0.76, 0.92, 1.12, 1.36, 1.44, 1.56, 1.72, 1.84, 1.96, 2.08, 2.20, 2.32, 2.40, 2.48, 2.60, 2.80, 3.04, 3.28, 3.40, 3.52, 3.72, 4.08, 4.28, 4.36, 4.52, 4.64, 4.80, 4.84, 4.96, 5.08, 5.28, 5.40, 5.56, 5.60, 5.76, 5.92, 6.04], "tokens":[" AFTER", " E", "AR", "LY", " NIGHT", "F", "A", "LL", " THE", " YE", "LL", "OW", " LA", "M", "P", "S", " WOULD", " LIGHT", " UP", " HE", "RE", " AND", " THERE", " THE", " S", "QUA", "LI", "D", " ", "QUA", "R", "TER", " OF", " THE", " B", "RO", "TH", "EL", "S"], "words": [2191, 52770, 121894, 175861, 198295, 98505, 197114, 101964, 186768, 80190, 5411, 176144, 175861, 166824, 142471, 124782, 175861, 22700]}
----
./sherpa-onnx-zipformer-ctc-en-2023-10-02/test_wavs/1.wav
{"text": " GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOR EVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN", "timestamps": [0.24, 0.36, 0.64, 0.80, 0.96, 1.08, 1.16, 1.20, 1.36, 1.56, 1.68, 1.76, 1.88, 2.04, 2.20, 2.32, 2.44, 2.64, 2.88, 3.16, 3.28, 3.48, 3.56, 3.72, 3.88, 4.20, 4.44, 4.60, 4.76, 4.96, 5.16, 5.40, 5.64, 6.20, 6.32, 6.56, 6.92, 7.16, 7.36, 7.60, 7.96, 8.20, 8.28, 8.40, 8.48, 8.64, 8.76, 8.88, 9.08, 9.28, 9.44, 9.52, 9.60, 9.72, 9.88, 10.00, 10.12, 10.56, 10.72, 10.88, 11.08, 11.24, 11.40, 11.56, 11.76, 12.00, 12.08, 12.16, 12.28, 12.52, 12.72, 12.84, 12.92, 13.04, 13.16, 13.48, 13.72, 13.88, 14.04, 14.16, 14.28, 14.40, 14.56, 14.72, 14.80, 15.00, 15.28, 15.44, 15.68, 15.92, 16.04, 16.12, 16.24], "tokens":[" GO", "D", " AS", " A", " DI", "RE", "C", "T", " CON", "SE", "QUE", "N", "CE", " OF", " THE", " S", "IN", " WHICH", " MAN", " TH", "US", " P", "UN", "ISH", "ED", " HAD", " GIVE", "N", " HER", " A", " LOVE", "LY", " CHILD", " WHO", "SE", " PLACE", " WAS", " ON", " THAT", " SAME", " DIS", "HO", "N", "OR", "ED", " BO", "S", "OM", " TO", " CON", "NE", "C", "T", " HER", " P", "AR", "ENT", " FOR", " E", "VER", " WITH", " THE", " RA", "CE", " AND", " DE", "S", "C", "ENT", " OF", " MO", "R", "T", "AL", "S", " AND", " TO", " BE", " FI", "N", "AL", "LY", " A", " B", "LESS", "ED", " SO", "UL", " IN", " HE", "A", "VE", "N"], "words": [71551, 8537, 4, 47736, 36616, 124782, 175861, 161552, 194419, 107839, 177270, 141612, 76098, 70667, 80081, 4, 104488, 31039, 194902, 135523, 192633, 125555, 175810, 153596, 48289, 20159, 178313, 36439, 80081, 129836, 64424, 58111, 196099, 175861, 143255, 5411, 45974, 124782, 117231, 5411, 178313, 13845, 62319, 4, 18136, 165122, 86219, 79148]}
----
num threads: 2
decoding method: greedy_search
Elapsed seconds: 1.066 s
Real time factor (RTF): 1.066 / 23.340 = 0.046

Streaming CTC HLG decoding

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18.tar.bz2
tar xvf sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18.tar.bz2
rm sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18.tar.bz2
./build/bin/sherpa-onnx \
  --zipformer2-ctc-model=./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/ctc-epoch-30-avg-3-chunk-16-left-128.int8.onnx \
  --ctc-graph=./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/HLG.fst \
  --tokens=./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/tokens.txt \
  ./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/test_wavs/0.wav \
  ./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/test_wavs/1.wav

The output is

/Users/fangjun/open-source/sherpa-onnx/sherpa-onnx/csrc/parse-options.cc:Read:361 ./build/bin/sherpa-onnx --zipformer2-ctc-model=./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/ctc-epoch-30-avg-3-chunk-16-left-128.int8.onnx --ctc-graph=./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/HLG.fst --tokens=./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/tokens.txt ./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/test_wavs/0.wav ./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/test_wavs/1.wav 

OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="", decoder="", joiner=""), paraformer=OnlineParaformerModelConfig(encoder="", decoder=""), wenet_ctc=OnlineWenetCtcModelConfig(model="", chunk_size=16, num_left_chunks=4), zipformer2_ctc=OnlineZipformer2CtcModelConfig(model="./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/ctc-epoch-30-avg-3-chunk-16-left-128.int8.onnx"), nemo_ctc=OnlineNeMoCtcModelConfig(model=""), tokens="./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/tokens.txt", num_threads=1, warm_up=0, debug=False, provider="cpu", model_type="", modeling_unit="cjkchar", bpe_vocab=""), lm_config=OnlineLMConfig(model="", scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.2, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), ctc_fst_decoder_config=OnlineCtcFstDecoderConfig(graph="./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/HLG.fst", max_active=3000), enable_endpoint=True, max_active_paths=4, hotwords_score=1.5, hotwords_file="", decoding_method="greedy_search", blank_penalty=0, temperature_scale=2)
./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/test_wavs/0.wav
Elapsed seconds: 0.78, Real time factor (RTF): 0.12
 AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
{ "text": " AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS", "tokens": [" AFTER", " E", "AR", "LY", " NIGHT", "F", "A", "LL", " THE", " YE", "LL", "OW", " LA", "M", "P", "S", " WOULD", " LIGHT", " UP", " HE", "RE", " AND", " THERE", " THE", " S", "QUA", "LI", "D", " ", "QUA", "R", "TER", " OF", " THE", " B", "RO", "TH", "EL", "S"], "timestamps": [0.68, 1.04, 1.12, 1.24, 1.48, 1.72, 1.80, 1.88, 2.04, 2.20, 2.28, 2.36, 2.52, 2.60, 2.68, 2.76, 2.88, 3.08, 3.36, 3.60, 3.68, 3.84, 4.04, 4.40, 4.64, 4.68, 4.80, 4.92, 5.12, 5.16, 5.28, 5.40, 5.56, 5.68, 5.84, 5.92, 6.00, 6.16, 6.32], "ys_probs": [], "lm_probs": [], "context_scores": [], "segment": 0, "words": [2191, 52770, 121894, 175861, 198295, 98505, 197114, 101964, 186768, 80190, 5411, 176144, 175861, 166824, 142471, 124782, 175861, 22700], "start_time": 0.00, "is_final": false}

./sherpa-onnx-streaming-zipformer-ctc-small-2024-03-18/test_wavs/1.wav
Elapsed seconds: 1.4, Real time factor (RTF): 0.081
 GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
{ "text": " GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONOURED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN", "tokens": [" GO", "D", " AS", " A", " DI", "RE", "C", "T", " CON", "SE", "QUE", "N", "CE", " OF", " THE", " S", "IN", " WHICH", " MAN", " TH", "US", " P", "UN", "ISH", "ED", " HAD", " GIVE", "N", " HER", " A", " LOVE", "LY", " CHILD", " WHO", "SE", " PLACE", " WAS", " ON", " THAT", " SAME", " DIS", "HO", "N", "OUR", "ED", " BO", "S", "OM", " TO", " CON", "NE", "C", "T", " HER", " P", "AR", "ENT", " FOR", "E", "VER", " WITH", " THE", " RA", "CE", " AND", " DE", "S", "C", "ENT", " OF", " MO", "R", "T", "AL", "S", " AND", " TO", " BE", " FI", "N", "AL", "LY", " A", " B", "LESS", "ED", " SO", "UL", " IN", " HE", "A", "VE", "N"], "timestamps": [0.68, 0.76, 0.96, 1.16, 1.32, 1.40, 1.48, 1.52, 1.72, 1.88, 2.00, 2.08, 2.20, 2.36, 2.56, 2.68, 2.76, 3.00, 3.24, 3.52, 3.60, 3.80, 3.88, 4.04, 4.16, 4.56, 4.76, 4.92, 5.12, 5.28, 5.48, 5.72, 5.96, 6.56, 6.72, 6.92, 7.24, 7.52, 7.72, 8.04, 8.36, 8.56, 8.64, 8.72, 8.80, 8.96, 9.08, 9.20, 9.40, 9.64, 9.76, 9.80, 9.88, 10.04, 10.24, 10.32, 10.48, 10.92, 11.12, 11.24, 11.40, 11.56, 11.76, 11.96, 12.16, 12.36, 12.44, 12.56, 12.64, 12.84, 13.08, 13.16, 13.24, 13.32, 13.48, 13.88, 14.08, 14.24, 14.40, 14.48, 14.60, 14.72, 14.92, 15.08, 15.16, 15.36, 15.68, 15.84, 16.04, 16.24, 16.32, 16.40, 16.48], "ys_probs": [], "lm_probs": [], "context_scores": [], "segment": 0, "words": [71551, 8537, 4, 47736, 36616, 124782, 175861, 161552, 194419, 107839, 177270, 141612, 76098, 70667, 80081, 4, 104488, 31039, 194902, 135523, 192633, 125555, 175810, 153596, 48296, 20159, 178313, 36439, 80081, 129836, 64736, 196099, 175861, 143255, 5411, 45974, 124782, 117231, 5411, 178313, 13845, 62319, 4, 18136, 165122, 86219, 79148], "start_time": 0.00, "is_final": false}

@w11wo
Copy link
Contributor

w11wo commented Jun 6, 2024

Hi @csukuangfj, thank you so much for this PR!

I have tested it with my phoneme-based CTC model, and it is working as expected! The following are my sample results:

icefall

isymbols: həɹɛdʌmbɹɛləɪzdʒʌstθbɛst
osymbols: her red umbrella is just the best

sherpa-onnx

isymbols: həɹɛdʌmbɹɛləɪzdʒʌstθbɛst
osymbols: her red umbrella is just the best

Like you said, I had to parse words.txt to detokenize the words, but it's easy to do.

Thanks again!

@csukuangfj
Copy link
Collaborator Author

Thanks for testing!

@csukuangfj csukuangfj merged commit 1a43d1e into k2-fsa:master Jun 6, 2024
162 of 207 checks passed
@csukuangfj csukuangfj deleted the fix-ctc-hlg-words branch June 6, 2024 06:22
XiaYucca pushed a commit to XiaYucca/sherpa-onnx that referenced this pull request Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Zipformer CTC HLG Decode to Words
2 participants