This repository is the non-official VAD Voxseg model implemented in PyTorch. You can check out the original Voxseg code here. All credits goes to the original authors.
Voxseg is a Python package for voice activity detection (VAD), for speech/non-speech audio segmentation. It provides a full VAD pipeline, including a pretrained VAD model, and it is based on work presented here.
Use of this VAD may be cited as follows:
title = {A hybrid {CNN-BiLSTM} voice activity detector},
author = {Wilkinson, N. and Niesler, T.},
booktitle = {Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
year = {2021},
address = {Toronto, Canada},
The code was built using the AVA Speech dataset downloaded from this repository. If you haven't yet downloaded the dataset, it's recommended to do so using the same way as we did. This way the code can run smoothly and without problems.
To install this package, clone the repository from GitHub to a directory of your choice and install using pip:
git clone
pip install ./voxseg
You need to create a conva environment using conda and install the requirements:
conda create -n venv python=3.8.10
conda activate venv
pip install -r requirements.txt
To test the installation run:
cd voxseg-pytorch
Before using the VAD, a number of files need to be created to specify the audio that one wishes to process. These files are the same as those used by the Kaldi toolkit. Extensive documentation on the data preparation process for Kaldi may be found here. Only the files required by the Voxseg toolkit are described here.
- this file provides the paths to the audio files one wishes to process, and assigns them a unique recording-id. It is structured as follows:<recording-id> <extended-filename>
. Each entry should appear on a new line, for example:rec_000 wavs/some_raw_audio.wav rec_001 wavs/some_more_raw_audio.wav rec_002 wavs/yet_more_raw_audio.wav
Note that the
may be an absolute path or relative path, except when using Docker or Singularity, where paths relative to the mount point must be used. -
- this file is optional and specifies segments within the audio file to be processed by the VAD (useful if one only wants to run the VAD on a subset of the full audio files). If this file is not present the full audio files will be processed. This file is structured as follows:<utterance-id> <recording-id> <segment-begin> <segment-end>
, where<segment-begin>
are in seconds. Each entry should appear on a new line, for example:rec_000_part_1 rec_000 20.5 142.6 rec_000_part_2 rec_000 362.1 421.0 rec_001_part_1 rec_001 45.9 89.4 rec_001_part_2 rec_001 97.7 130.0 rec_001_part_3 rec_001 186.9 241.8 rec_002_full rec_002 0.0 350.0
These two files should be placed in the same directory, usually named data
, however you may give it any name. This is the directory that is provided as input to voxseg’s feature extraction.
If you haven't created the files above, the code will automatically create for you.
The package may be used in a number of ways:
- The full VAD can be run with a single script.
- Smaller scripts may be called to run different parts of the pipeline separately, for example feature extraction, then VAD. Useful if one is tuning the parameters of the VAD, and would like to avoid recomputing the features for every experiment.
- As a module within python, useful if one would like to integrate parts of the system into one's own python code.
This package may be used through a basic command-line interface. To run the full VAD pipeline with default settings, navigate to the voxseg-pytorch directory and call:
# data_directory is Kaldi-style data directory and output_directory is destination for segments file
# if you trained the model using binary classification, you have to use as well in the inference step.
python3 data_directory output_directory --binary_classification
To explore the available flags for changing settings navigate to the voxseg directory and call:
python3 -h
The most commonly used flags are:
- -s sets the speech vs non-speech decision threshold (accepts float between 0 and 1, default is 0.5)
- -f: adds median filtering to smooth the output (accepts odd integer for kernal size, default is 1)
- -e: allows a reference directory to be given, against which the VAD output is scored (accepts path to Kaldi-style directory containing ground truth segments file)
To run the smaller, individual scripts, navigate to the voxseg-pytorch directory and call:
# reads Kaldi-style data directory and extracts features to .h5 file in output directory
python3 voxseg/ data_directory output_directory
# runs VAD and saves output segments file in ouput directory
python3 voxseg/ -m model_path features_directory output_directory
# reads Kaldi-style data directory used as VAD input, the VAD output directory and a directory
# contining a ground truth segments file reference.
python3 voxseg/ vad_input_directory vad_out_directory ground_truth_directory
To import the module an use it within custom Python scripts/modules:
import voxseg
import torch
# feature extraction
data = extract_feats.prep_data('path/to/input/data') # prepares audio from Kaldi-style data directory
feats = extract_feats.extract(data) # extracts log-mel filterbank spectrogram features
normalized_feats = extract_feats.normalize(norm_feats) # normalizes the features
#model execution
model = voxseg.Voxseg(2) # number of classes
model.load_state_dict(torch.load_model('path/to/model.pth')['model_state_dict']) # loads a pretrained VAD model
predicted_labels = run_cnnlstm.predict_labels(model, normalized_feats) # runs the VAD model on features, 'path/for/output/labels.h5') # saves predicted labels to .h5 file
A basic training script is provided in the file
in the root directory of the project.
To use this script the following files are required in a Kaldi style data directory:
- this file provides the paths to the audio files one wishes to use for training, and assigns them a unique recording-id. It is structured as follows:<recording-id> <extended-filename>
. Each entry should appear on a new line, for example:rec_000 wavs/some_raw_audio.wav rec_001 wavs/some_more_raw_audio.wav
Note that the
may be an absolute path or relative path, except when using Docker or Singularity, where paths relative to the mount point must be used. -
- this file specifies the start and end points of each labelled segment within the audio file. Note, this is different to the way this file is used when provided for decoding. This file is structured as follows:<utterance-id> <recording-id> <segment-begin> <segment-end>
, where<segment-begin>
are in seconds. Each entry should appear on a new line, for example:rec_000_00 rec_000 0.0 4.3 rec_000_01 rec_000 4.3 7.2 rec_000_02 rec_000 7.2 14.8 rec_000_03 rec_000 14.8 19.5 rec_001_00 rec_001 0.0 8.5 rec_001_01 rec_001 8.5 12.2 rec_001_02 rec_001 12.2 16.1 rec_001_03 rec_001 16.1 18.9 rec_001_04 rec_001 18.9 22.0
- this file specifies the label attached to each segment defined within thesegments
file. This file is structured as follows:<utterance-id> <label>
. Each entry should appear on a new line, for example:rec_000_00 speech rec_000_01 non_speech rec_000_02 speech rec_000_03 non_speech rec_001_00 non_speech rec_001_01 speech rec_001_02 non_speech rec_001_03 speech rec_001_04 non_speech
If you haven't created the files above, the code will automatically create for you.
Note, that the model may be trained with 2 classes ('speech', 'non_speech')
as shown in the above example, or with the 4 classes from AVA-Speech dataset ('clean_speech', 'no_speech', 'speech_with_music', 'speech_with_noise')
(the default model used by the toolkit).
To use the training script with a specific validation set run:
# use --validation_dir to specify a Kaldi style data directory to be used as validation set
python3 --validation_dir val_dir train_dir model_name out_dir
To use the training script with a percetage of the training data as a validation set run:
# use --validation_split to specify a percetage of the training data to be used as a validation set
python3 --validation_split 0.1 train_dir model_name out_dir
To use the training script with a binary classification:
# use --binary_classification to use only two classes ('speech' and 'non_speech')
python3 --validation_dir val_dir train_dir model_name out_dir --binary_classification
The training script may also be used without any flags, however this is not recommended, as it makes it difficult to tell whether the model is starting to overfit. When a validation set is provided the model with the best validation accuracy is saved. When no validation set is provided the model is saved after the final training epoch.