Skip to content
This repository was archived by the owner on Oct 10, 2022. It is now read-only.

Commit 6f5b0fe

Browse files
committed
OPUS torrent micro release
- Dataset conversion to OPUS - OPUS torrent - OPUS helpers and build instructions - Coming soon - new unlimited direct links - Further reading links
1 parent 41db59d commit 6f5b0fe

File tree

3 files changed

+206
-90
lines changed

3 files changed

+206
-90
lines changed

README.md

+129-47
Original file line numberDiff line numberDiff line change
@@ -6,51 +6,50 @@
66
[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
77
[![Mailing list : test](https://img.shields.io/badge/Contact-Authors-blue.svg)](mailto:open_stt@googlegroups.com)
88

9-
109
# **Russian Open Speech To Text (STT/ASR) Dataset**
1110

1211
Arguably the largest public Russian STT dataset up to date:
1312
- ~16m utterances (1-2m with less perfect annotation, see [#7](https://github.com/snakers4/open_stt/issues/7));
1413
- ~20 000 hours;
15-
- 2,3 TB (in `.wav` format in `int16`);
14+
- 2,3 TB (in `.wav` format in `int16`), 356G in `.opus`;
1615
- (**new!**) A new domain - public speech;
1716
- (**new!**) A huge Radio dataset update with **10 000+ hours**;
17+
- (**new!**) Utils for working with OPUS;
18+
- (**Coming soon!**) New OPUS torrent, **unlimited direct links**;
1819

1920
Prove [us](mailto:open_stt@googlegroups.com) wrong!
2021
Open issues, collaborate, submit a PR, contribute, share your datasets!
2122
Let's make STT in Russian (and more) as open and available as CV models.
2223

2324
**Planned releases:**
2425

25-
- Refine and publish speaker labels, probably add speakers for old datasets;
26-
- Improve / re-upload some of the existing datasets, refine the STT labels;
27-
- Probably add new languages;
28-
- Add pre-trained models;
26+
- Working on a new project with 3 more languages, stay tuned!
2927

3028
# **Table of contents**
3129

32-
33-
- [Dataset composition](https://github.com/snakers4/open_stt/#dataset-composition)
34-
- [Downloads](https://github.com/snakers4/open_stt/#downloads)
35-
- [Via torrent](https://github.com/snakers4/open_stt/#via-torrent)
36-
- [Links](https://github.com/snakers4/open_stt/#links)
37-
- [Download-instructions](https://github.com/snakers4/open_stt/#download-instructions)
38-
- [End-to-end download scripts](https://github.com/snakers4/open_stt/#end-to-end-download-scripts)
39-
- [Annotation methodology](https://github.com/snakers4/open_stt/#annotation-methodology)
40-
- [Audio normalization](https://github.com/snakers4/open_stt/#audio-normalization)
41-
- [Disk db methodology](https://github.com/snakers4/open_stt/#on-disk-db-methodology)
42-
- [Helper functions](https://github.com/snakers4/open_stt/#helper-functions)
43-
- [Contacts](https://github.com/snakers4/open_stt/#contacts)
44-
- [Acknowledgements](https://github.com/snakers4/open_stt/#acknowledgements)
45-
- [FAQ](https://github.com/snakers4/open_stt/#faq)
46-
- [License](https://github.com/snakers4/open_stt/#license)
47-
- [Donations](https://github.com/snakers4/open_stt/#donations)
30+
- [Dataset composition](https://github.com/snakers4/open_stt/#dataset-composition)
31+
- [Downloads](https://github.com/snakers4/open_stt/#downloads)
32+
- [Via torrent](https://github.com/snakers4/open_stt/#via-torrent)
33+
- [Links](https://github.com/snakers4/open_stt/#links)
34+
- [Download-instructions](https://github.com/snakers4/open_stt/#download-instructions)
35+
- [End-to-end download scripts](https://github.com/snakers4/open_stt/#end-to-end-download-scripts)
36+
- [Annotation methodology](https://github.com/snakers4/open_stt/#annotation-methodology)
37+
- [Audio normalization](https://github.com/snakers4/open_stt/#audio-normalization)
38+
- [Disk db methodology](https://github.com/snakers4/open_stt/#on-disk-db-methodology)
39+
- [Helper functions](https://github.com/snakers4/open_stt/#helper-functions)
40+
- [How to open opus](https://github.com/snakers4/open_stt/#how-to-open-opus)
41+
- [Contacts](https://github.com/snakers4/open_stt/#contacts)
42+
- [Acknowledgements](https://github.com/snakers4/open_stt/#acknowledgements)
43+
- [FAQ](https://github.com/snakers4/open_stt/#faq)
44+
- [License](https://github.com/snakers4/open_stt/#license)
45+
- [Donations](https://github.com/snakers4/open_stt/#donations)
46+
- [Further reading](https://github.com/snakers4/open_stt/#further-reading)
4847

4948
# **Dataset composition**
5049

5150
| Dataset | Utterances | Hours | GB | Av s/chars | Comment | Annotation | Quality/noise |
5251
|---------------------------|------------|-------|-----|------------|------------------|-------------|---------------|
53-
| radio_v4 | 7,603,192 | 10,430 | 1,195 | 4.94s / 68 | Radio | Alignment (*)| 95% / crisp |
52+
| radio_v4 | 7,603,192 | 10,430 | 1,195 | 4.94s / 68 | Radio | Alignment (*)| 95% / crisp |
5453
| public_speech | 1,700,060 | 2,709 | 301 | 5,73s / 79 | Public speech | Alignment (*)| 95% / crisp |
5554
| audiobook_2 | 1,149,404 | 1,511 | 162 | 4.7s / 56 | Books | Alignment (*)| 95% / crisp |
5655
| radio_2 | 651,645 | 1,439 | 154 | 7.95s / 110 | Radio | Alignment (*)| TBC, should be high |
@@ -77,6 +76,15 @@ This alignment was performed using Yuri's alignment tool.
7776

7877
# **Updates**
7978

79+
## **_Update 2020-05-04_**
80+
81+
**Migration to OPUS**
82+
83+
- Conversion of the whole dataset to OPUS
84+
- New OPUS torrent
85+
- Added OPUS helpers and build instructions
86+
- Coming soon - **new unlimited direct downloads**
87+
8088
## **_Update 2020-02-07_**
8189

8290
**Temporarily Deprecated Direct MP3 Links:**
@@ -87,10 +95,10 @@ This alignment was performed using Yuri's alignment tool.
8795

8896
**New train datasets added:**
8997

90-
- 10,430 hours radio_v4;
91-
- 2,709 hours public_speech;
92-
- 154 hours radio_v4_add;
93-
- 5% sample of all new datasets with annotation.
98+
- 10,430 hours radio_v4;
99+
- 2,709 hours public_speech;
100+
- 154 hours radio_v4_add;
101+
- 5% sample of all new datasets with annotation.
94102

95103
<details>
96104
<summary>Click to expand</summary>
@@ -144,16 +152,16 @@ This alignment was performed using Yuri's alignment tool.
144152

145153
## **Via torrent**
146154

147-
Save us a couple of bucks, download via torrent:
148-
- ~~An **MP3** [version](http://academictorrents.com/details/4a2656878dc819354ba59cd29b1c01182ca0e162) of the dataset (v3)~~ not supported anymore;
149-
- A **WAV** [version](https://academictorrents.com/details/a7929f1d8108a2a6ba2785f67d722423f088e6ba) of the dataset (v5);
155+
- ~~An **MP3** [version](http://academictorrents.com/details/4a2656878dc819354ba59cd29b1c01182ca0e162) of the dataset (v3)~~ DEPRECATED;
156+
- ~~A **WAV** [version](https://academictorrents.com/details/a7929f1d8108a2a6ba2785f67d722423f088e6ba) of the dataset (v5)~~ DEPRECATED;
157+
- A **OPUS** [version](https://academictorrents.com/details/95b4cab0f99850e119114c8b6df00193ab5fa34f) of the dataset (v1.01);
150158

151159
You can download separate files via torrent.
152-
~~Try several torrent clients if some do not work.~~
160+
153161
Looks like that due to large chunk size, most conversional torrent clients just fail silently.
154-
No problem (re-calculating the torrent takes much time, and some people have downloaded it already):
162+
No problem (re-calculating the torrent takes much time, and some people have downloaded it already), use `aria2c`:
155163

156-
```
164+
```bash
157165
apt update
158166
apt install aria2
159167
# list the torrent files
@@ -165,11 +173,16 @@ aria2c --select-file=4 ru_open_stt_wav_v10.torrent
165173
# https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-metalink-options
166174
# https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-specific-options
167175
```
168-
If you are using Windows, you may use Linux subsystem to run these commands.
176+
177+
If you are using Windows, you may use **Linux subsystem** to run these commands.
169178

170179
## **Links**
171180

172-
All **WAV** files can be downloaded ONLY via [torrent](https://academictorrents.com/details/a7929f1d8108a2a6ba2785f67d722423f088e6ba)
181+
**Coming soon** - new direct OPUS links!
182+
183+
All WAV or MP3 files / links / torrents to be superseded by OPUS.
184+
185+
Total size of OPUS files is about 356G, so OPUS is ~10% smaller than MP3.
173186

174187
| Dataset | GB, wav | GB, mp3 | Mp3 | Source | Manifest |
175188
|---------------------------------------|------|----------------|-----| -------| ----------|
@@ -198,25 +211,30 @@ All **WAV** files can be downloaded ONLY via [torrent](https://academictorrents.
198211

199212
### End to end
200213

201-
`download.sh`
202-
or
214+
`download.sh`
215+
216+
or
217+
203218
`download.py` with this config [file](https://github.com/snakers4/open_stt/blob/master/md5sum.lst). Please check the config first.
204219

205220
### Manually
206221

207222
1. Download each dataset separately:
208223

209224
Via `wget`
225+
210226
```
211227
wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file
212228
```
213229

214230
For multi-threaded downloads use aria2 with `-x` flag, i.e.
231+
215232
```
216233
aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file
217234
```
218235

219236
If necessary, merge chunks like this:
237+
220238
```
221239
cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz
222240
```
@@ -276,7 +294,7 @@ manifest_df = read_manifest('path/to/manifest.csv')
276294
<details><summary>See example</summary>
277295
<p>
278296

279-
```python
297+
```python3
280298
from utils.open_stt_utils import (plain_merge_manifests,
281299
check_files,
282300
save_manifest)
@@ -295,6 +313,45 @@ save_manifest(train_manifest,
295313
</p>
296314
</details>
297315

316+
# **How to open opus**
317+
318+
The best efficient way to read opus files in python (the we know of) that does incur any significant overhead (i.e. launching subprocesses, using a daisy chain of libraries with sox, FFMPEG etc) is to use pysoundfile (a python CFFI wrapper around libsoundfile).
319+
320+
When this solution was being researched the community had been waiting for a major libsoundfile release for years. Opus support has been implemented some time ago upstream, but it has not been properly released. Therefore we opted for a custom build + monkey patching.
321+
322+
At the time when you read / use this - probably there will be decent / proper builds of libsndfile.
323+
324+
## **Building libsoundfile**
325+
326+
```bash
327+
apt-get update
328+
apt-get install cmake autoconf autogen automake build-essential libasound2-dev \
329+
libflac-dev libogg-dev libtool libvorbis-dev libopus-dev pkg-config -y
330+
331+
cd /usr/local/lib
332+
git clone https://github.com/erikd/libsndfile.git
333+
cd libsndfile
334+
git reset --hard 49b7d61
335+
mkdir -p build && cd build
336+
337+
cmake .. -DBUILD_SHARED_LIBS=ON
338+
make && make install
339+
cmake --build .
340+
```
341+
342+
## **Patched pysound file wrapper**
343+
344+
```python3
345+
import utils.soundfile_opus as sf
346+
347+
path = 'path/to/file.opus`
348+
audio, sr = sf.read(path, dtype='int16')
349+
```
350+
351+
## **Known issues**
352+
353+
When you attempt writing large files (90-120s), there is an upstream bug in libsndfile that prevents writing such files with `opus` / `vorbis`. Most likely will be fixed by major libsndfile releases.
354+
298355
# **Contacts**
299356

300357
Please contact us [here](mailto:open_stt@googlegroups.com) or just create a GitHub issue!
@@ -310,16 +367,17 @@ Please contact us [here](mailto:open_stt@googlegroups.com) or just create a GitH
310367
# **Acknowledgements**
311368

312369
This repo would not be possible without these people:
370+
313371
- Many thanks for helping to encode the initial bulk of the data into mp3 to [akreal](https://nuget.pkg.github.com/akreal);
314372
- 18 hours of ground truth annotation datasets for validation are a courtesy of [activebc](https://activebc.ru/);
315373

316374
Kudos!
317375

318376
# **FAQ**
319377

320-
## **0. ~~Why not MP3?~~ MP3 encoding / decoding**
378+
## **0. ~~Why not MP3?~~ MP3 encoding / decoding** - DEPRECATED
321379

322-
#### **Encoding**
380+
### **Encoding**
323381

324382
Mostly we used `pydub` (via ffmpeg) or `sox` (much much faster way) to convert to MP3.
325383
We omitted blank files (YouTube mostly).
@@ -367,8 +425,7 @@ if res != 0:
367425
</p>
368426
</details>
369427

370-
371-
#### **Decoding**
428+
### **Decoding**
372429

373430
It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files:
374431

@@ -432,15 +489,15 @@ wav_path = save_wav_diskdb(wav,
432489
</p>
433490
</details>
434491

435-
#### **Why not OGG/ Opus**
492+
#### **Why not OGG/ Opus** - DEPRECATED
436493

437494
Even though OGG / Opus is considered to be better for speech with higher compression, we opted for a more conventional well known format.
438495

439496
Also LPC net codec boasts ultra-low bitrate speech compression as well. But we decided to opt for a more familiar format to avoid worry about actually losing signal in compression.
440497

441498
## **1. Issues with reading files**
442499

443-
#### **Maybe try this approach:**
500+
### **Maybe try this approach:**
444501

445502
<details><summary>See example</summary>
446503
<p>
@@ -461,28 +518,53 @@ if abs_max>0:
461518

462519
## **2. Why share such dataset?**
463520

464-
We are not altruists, life just is **not a zero sum game**.
521+
We are not altruists, life just is **not a zero sum game**.
465522

466523
Consider the progress in computer vision, that was made possible by:
524+
467525
- Public datasets;
468526
- Public pre-trained models;
469527
- Open source frameworks;
470528
- Open research;
471529

472-
TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English.
530+
STT does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English.
473531
Ultimately it leads to worse-off situation for the general community.
474532

475533
## **3. Known issues with the dataset to be fixed**
476534

477535
- Speaker labels coming soon;
478536
- Validation sets for new domains: Radio/Public Speech will be added in next releases.
479537

538+
## **4. Why migrate to OPUS?**
539+
540+
After extensive testing, both during training and validation, we confirmed that converting 16kHz int16 data to OPUS does not at the very least degrade quality.
541+
542+
Also designed for speech, OPUS even at default compression rates takes less space than MP3 and does not introduce artefacts.
543+
544+
Some people even reported quality improvements when training using OPUS.
545+
480546
# **License**
481547

482548
![сс-nc-by-license](https://static.wixstatic.com/media/342407_05e016f9f44240429203c35dfc8df63b~mv2.png/v1/fill/w_563,h_200,al_c,lg_1,q_80/342407_05e016f9f44240429203c35dfc8df63b~mv2.webp)
483549

484-
Сc-by-nc and commercial usage available after agreement with dataset authors.
550+
CC-BY-NC and commercial usage available after agreement with dataset authors.
485551

486552
# **Donations**
487553

488554
[Donate](https://buymeacoff.ee/8oneCIN) (each coffee pays for several full downloads) or via [open_collective](https://opencollective.com/open_stt) or just use our DO referral [link](https://sohabr.net/habr/post/357748/) to help.
555+
556+
# **Further reading**
557+
558+
## **English**
559+
560+
- https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/
561+
- https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/
562+
563+
## **Chinese**
564+
565+
- https://www.infoq.cn/article/4u58WcFCs0RdpoXev1E2
566+
567+
## **Russian**
568+
569+
- https://habr.com/ru/post/494006/
570+
- https://habr.com/ru/post/474462/

0 commit comments

Comments
 (0)