Python implementations of some clustering-based diarization systems.
- End-to-end recipes (from unsegmented audio to evaluation) for LibriCSS, AMI, AISHELL-4, and AliMeeting.
- Using Lhotse for data preparation.
- Using Pyannote 2.0 models for VAD and overlap detection.
- Scripts for fine-tuning Pyannote models on AMI, AISHELL-4, and AliMeeting (fine-tuned models also provided).
- VBx and x-vector extraction from BUT's implementation.
- Kaldi implementation of overlap-aware spectral clustering.
We recommend installation within a virtual environment (such as Conda) so that the package versions are consistent. To create a new Conda environment:
> conda create -n diar python=3.8
> conda activate diar
Once the environment is activated, clone and install the package as:
> git clone
> cd diarizer
> pip install -e .
The run scripts additionally use some Kaldi utilities (such as or, since we submit multiple jobs (usually 1 job per audio file). You may need to modify these if you are running in a different environment. Alternatively, if you have Kaldi somewhere, you can make a symbolic link to the utils folder as:
> KALDI_ROOT=/path/to/kaldi
> ln -s $KALDI_ROOT/egs/wsj/s5/utils .
NOTE: To run overlap-aware spectral clustering, please install scikit-learn from my fork.
End-to-end runnable recipes are provided in the scripts
directory. The scripts must be
invoked from the root directory, for example: scripts/ami/
Each recipe (LibriCSS, AMI, AISHELL-4) contains scripts broken down into stages such as: data preparation, VAD, x-vector extraction, overlap detection, and clustering, numbered in order as 010, 020, etc. These scripts are supposed to be run in order.
By default, the scripts submit commands through the
script (see utils
folder). To change
this behaviour, please modify the
file according to your job submission system.
- Voice activity detection (VAD) using Pyannote
Method | MS | FA | Total |
LibriCSS | 0.9 | 1.2 | 2.1 |
AMI | 3.5 | 2.8 | 6.3 |
AISHELL-4 | 3.3 | 2.3 | 5.6 |
AliMeeting | 1.9 | 1.8 | 3.7 |
- Speaker diarization (using above VAD)
The following is evaluated using the spyder package without ignoring overlaps and using a 0.0 collar.
- LibriCSS
Method | MS | FA | Conf. | DER | RTTM |
VBx | 10.37 | 1.19 | 2.96 | 14.52 | link |
VBx + OVL | 3.39 | 2.31 | 5.55 | 11.25 | link |
Spectral | 10.37 | 1.19 | 3.37 | 14.93 | link |
Spectral + OVL | 3.79 | 2.22 | 5.33 | 11.34 | link |
Method | MS | FA | Conf. | DER | RTTM |
VBx | 18.15 | 3.24 | 4.83 | 26.22 | link |
VBx + OVL | 9.04 | 7.70 | 8.31 | 25.05 | link |
Spectral | 18.15 | 3.24 | 4.14 | 25.53 | link |
Spectral + OVL | 9.63 | 7.39 | 6.67 | 23.69 | link |
Method | MS | FA | Conf. | DER | RTTM |
VBx | 8.27 | 2.80 | 6.94 | 18.01 | link |
VBx + OVL | 5.78 | 7.88 | 7.96 | 21.62 | link |
Spectral | 8.27 | 2.80 | 5.06 | 16.13 | link |
Spectral + OVL | 5.90 | 7.85 | 5.94 | 19.69 | link |
- AliMeeting (collar=0.25)
Method | MS | FA | Conf. | DER | RTTM |
VBx | 13.60 | 0.17 | 3.01 | 16.78 | link |
VBx + OVL | 5.87 | 2.82 | 5.36 | 14.05 | link |
Spectral | 13.60 | 0.17 | 2.60 | 16.37 | link |
Spectral + OVL | 5.96 | 2.83 | 5.64 | 14.43 | link |
- Datasets
Chen, Zhuo et al. “Continuous Speech Separation: Dataset and Analysis.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020): 7284-7288.
McCowan, Iain et al. “The AMI meeting corpus.” (2005).
Fu, Yihui et al. “AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario.” ArXiv abs/2104.03603 (2021): n. pag.
Yu, Fan et al. “M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge.” ArXiv abs/2110.07393 (2021).
- VAD and Overlap detection
Bredin, Hervé et al. “Pyannote. Audio: Neural Building Blocks for Speaker Diarization.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020): 7124-7128.
Bredin, Hervé and Antoine Laurent. “End-to-end speaker segmentation for overlap-aware resegmentation.” ArXiv abs/2104.04045 (2021): n. pag.
- VBx
- Landini, Federico et al. “Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks.” ArXiv abs/2012.14952 (2020)
- VBx with overlaps
- Bullock, Latané et al. “Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020): 7114-7118.
- Spectral clustering
- Park, Tae Jin et al. “Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap.” IEEE Signal Processing Letters 27 (2020): 381-385.
- Overlap-aware spectral clustering
Raj, Desh et al. “Multi-Class Spectral Clustering with Overlaps for Speaker Diarization.” 2021 IEEE Spoken Language Technology Workshop (SLT) (2021): 582-589.
Raj, Desh et al. "GPU-accelerated Guided Source Separation for Meeting Transcription." InterSpeech 2023.