Skip to content

Commit 8c4c6ed

Browse files
authored
Update README.md
1 parent cceb61e commit 8c4c6ed

File tree

1 file changed

+54
-14
lines changed

1 file changed

+54
-14
lines changed

README.md

+54-14
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,17 @@
11
# Automatic-Speech-Recognition-from-Scratch
22

3-
todo:
3+
## Description
4+
A minimal Seq2Seq example of Automatic Speech Recognition (ASR) based on Transformer
45

5-
- [x] (2024.2.10) release training and inference code
6-
- [ ] (2024.4.7) release training log and checkpoints of LRS2
7-
- [ ] (2024.4.10) release training log and checkpoints of LibriSpeech
6+
It aims to serve as a thorough tutorial for new beginners who is interested in training ASR models or other sequence-to-sequence models, complying with the blog in this link [包教包会!从零实现基于Transformer的语音识别(ASR)模型😘](https://zhuanlan.zhihu.com/p/648133707)
87

9-
A minimal Seq2Seq example of Automatic Speech Recognition (ASR) based on Transformer
8+
It contains almost everything you need to build a simple ASR model from scratch, such as training codes, inference codes, checkpoints, training logs and inference logs,
9+
10+
## Data preprocessing
1011

11-
Before launch training, you should download the train and test sub-sets of [LRS2](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html),
12+
We use the audip part of [LRS2](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html) as our dataset.
13+
14+
Before launch training, you should download the train and test sub-sets of LRS2,
1215
and prepare `./data/LRS2/train.paths``./data/LRS2/train.text``./data/LRS2/train.lengths` with the format that `train.py` requires.
1316

1417
Each line in train.paths represents the local path of an audio file.
@@ -23,16 +26,53 @@ The following table suggests a minimal example of the above three files.
2326
| 1.wav | good morning | 1.6 |
2427
| 2.wav | good afternoon | 2 |
2528
| 3.wav | nice to meet you | 3.1 |
26-
> 💡 If you have difficulty in accessing dataset LRS2, you may use other ASR datasets, such as [LibriSpeech](https://www.openslr.org/12) or [TEDLIUM-v3](https://www.openslr.org/51/)
27-
>
28-
> Use `torchaudio`, `ffmpeg` or any other tools to get the length information of audio
29-
>
30-
> If you are experiencing convergence issue, try subword-based tokenizers ([ref](https://github.com/google/sentencepiece)) or more sophisticated feature extractors (e.g. [1D ResNet](https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks/blob/0cc99fab046bd959578f2baac52e654746da4825/lipreading/models/resnet1D.py#L75)).
3129

32-
Training: `python3 ./train.py`
30+
For convenience, we have prepared the three files above, the only thing you need to do is to place audio files consistent with `./data/LRS2/train.paths`. There you are ready to go.
31+
32+
> 💡 If you have difficulty in accessing dataset LRS2, you may use other ASR datasets, such as [LibriSpeech](https://www.openslr.org/12) or [TEDLIUM-v3](https://www.openslr.org/51/).
33+
> However, in our preliminary experiments of LibriSpeech, we found that the model fails to converge under the default settings. You may need to modify the training or model hyper-parameters if necessary.
34+
35+
## Build tokenizers
36+
Before training, you also need to prepare tokenizers.
37+
In the [blog](https://zhuanlan.zhihu.com/p/648133707), we use char-based tokenizers.
38+
However, considering that subword-based tokenizers are more often used in the ASR task, we use subword-based tokenizers instead.
39+
40+
Run `build_spm_tokenizer.sh` to build your subword-based tokenizer. You should replace the script's argument `save_prefix` and `txt_file_path` to fit your own data.
41+
42+
We have already provided tokenizers, located in the directory `spm/lrs2`. You could use them directly.
43+
44+
45+
## training
46+
Usage: `python train.py <feature_extractor_type> <dataset_type>`
47+
48+
We support two types of feature extractors: linear layer and 1D-ResNet18.
49+
> The 1D-ResNet18 is based on the implementation of this [repo](https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks).
3350
34-
Inference: `python3 ./test.py <ckpt_path>`
51+
We support two types of dataset: LRS2 and LibriSpeech.
3552

53+
For example, `python3 ./train.py resnet lrs2`.
3654

37-
# Contact
55+
The training logs are located in the `log` directory, containing the loss history and model details.
56+
57+
We highly encourage users to thoroughly read the codes if they want to customize their own datasets or understand the details of the training process.
58+
59+
60+
## Inference
61+
Usage: `python test.py <feature_extractor_type> <dataset_type> <checkpoint_path>`
62+
63+
For example, `python3 ./test.py resnet lrs2 ./ckpts/resnet_lrs2_epoch050.pt`
64+
65+
The checkpoints are located in the `ckpts` directory, containing both the linear and 1D-ResNet feature extractors.
66+
67+
The inference logs are located in the `log` directory, containing predictions of each sample.
68+
69+
70+
## Warning
71+
This repository is slightly different from the [blog](https://zhuanlan.zhihu.com/p/648133707) mentioned above in the following aspects.
72+
- We use pre-norm instead of post-norm;
73+
- We use subword-based tokenizers instead of char-based tokenizers;
74+
- We add support of the feature extractor of 1D-ResNet.
75+
76+
## Contact
3877
bingquanxia AT qq.com
78+

0 commit comments

Comments
 (0)