You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A minimal Seq2Seq example of Automatic Speech Recognition (ASR) based on Transformer
4
5
5
-
-[x] (2024.2.10) release training and inference code
6
-
-[ ] (2024.4.7) release training log and checkpoints of LRS2
7
-
-[ ] (2024.4.10) release training log and checkpoints of LibriSpeech
6
+
It aims to serve as a thorough tutorial for new beginners who is interested in training ASR models or other sequence-to-sequence models, complying with the blog in this link [包教包会!从零实现基于Transformer的语音识别(ASR)模型😘](https://zhuanlan.zhihu.com/p/648133707)
8
7
9
-
A minimal Seq2Seq example of Automatic Speech Recognition (ASR) based on Transformer
8
+
It contains almost everything you need to build a simple ASR model from scratch, such as training codes, inference codes, checkpoints, training logs and inference logs,
9
+
10
+
## Data preprocessing
10
11
11
-
Before launch training, you should download the train and test sub-sets of [LRS2](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html),
12
+
We use the audip part of [LRS2](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html) as our dataset.
13
+
14
+
Before launch training, you should download the train and test sub-sets of LRS2,
12
15
and prepare `./data/LRS2/train.paths`、`./data/LRS2/train.text`、`./data/LRS2/train.lengths` with the format that `train.py` requires.
13
16
14
17
Each line in train.paths represents the local path of an audio file.
@@ -23,16 +26,53 @@ The following table suggests a minimal example of the above three files.
23
26
| 1.wav | good morning | 1.6 |
24
27
| 2.wav | good afternoon | 2 |
25
28
| 3.wav | nice to meet you | 3.1 |
26
-
> 💡 If you have difficulty in accessing dataset LRS2, you may use other ASR datasets, such as [LibriSpeech](https://www.openslr.org/12) or [TEDLIUM-v3](https://www.openslr.org/51/)
27
-
>
28
-
> Use `torchaudio`, `ffmpeg` or any other tools to get the length information of audio
29
-
>
30
-
> If you are experiencing convergence issue, try subword-based tokenizers ([ref](https://github.com/google/sentencepiece)) or more sophisticated feature extractors (e.g. [1D ResNet](https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks/blob/0cc99fab046bd959578f2baac52e654746da4825/lipreading/models/resnet1D.py#L75)).
31
29
32
-
Training: `python3 ./train.py`
30
+
For convenience, we have prepared the three files above, the only thing you need to do is to place audio files consistent with `./data/LRS2/train.paths`. There you are ready to go.
31
+
32
+
> 💡 If you have difficulty in accessing dataset LRS2, you may use other ASR datasets, such as [LibriSpeech](https://www.openslr.org/12) or [TEDLIUM-v3](https://www.openslr.org/51/).
33
+
> However, in our preliminary experiments of LibriSpeech, we found that the model fails to converge under the default settings. You may need to modify the training or model hyper-parameters if necessary.
34
+
35
+
## Build tokenizers
36
+
Before training, you also need to prepare tokenizers.
37
+
In the [blog](https://zhuanlan.zhihu.com/p/648133707), we use char-based tokenizers.
38
+
However, considering that subword-based tokenizers are more often used in the ASR task, we use subword-based tokenizers instead.
39
+
40
+
Run `build_spm_tokenizer.sh` to build your subword-based tokenizer. You should replace the script's argument `save_prefix` and `txt_file_path` to fit your own data.
41
+
42
+
We have already provided tokenizers, located in the directory `spm/lrs2`. You could use them directly.
0 commit comments