This package is intended to create a relatively simple baseline for the MediaEval 2018 AcousticBrainz Genre Task . Participants are asked to automatically classify tracks into genres based on pre-computed content-based features. Multiple ground-truths from several sources and with different, but overlapping label namespaces are provided.
The task consists of two subtasks:
- Single-source Classification
- Multi-source Classification
For more details please see this page.
Instead of trying to learn genres from all features available in AcousticBrainz JSON files (sample), we only pay attention to the low-level melband features. Each of the 9 different melband features consists of 40 values, representing information about a specific Mel frequency band. Mel-based features have a long history in automatic genre recognition and are usually associated with timbre. E.g. G. Tzanetakis used MFCC as one of the timbral features in [1].
We treat the 9 melband features as a one-dimensional image with 9 channels. I.e. mean
with its 40 values
is one channel, dmean
another and so forth. This preserves both the spatial relationship between
bands as well as different kinds of features. We present these features to our neural network
in the shape of a (40, 9)
-tensor (size, channel).
Because our feature tensor has spatial relationships, it makes sense to use a convolutional neural network (CNN) architecture. In fact, we are using a fully convolutional network (FCN), i.e. we do not employ fully connected layers for classification, but global pooling.
The used network is specified in network.py. Because the genre classification task is multi-class and multi-label, we are using sigmoids as final activation functions and binary crossentropy as loss function. For regularization we use both dropout and early stopping. For optimization during learning we use Adam.
After training, the model is capable of predicting real values between 0 and 1 for each label. In order to map each of
of these values to true
or false
, we have to define thresholds. To ensure a performance that
neither emphasizes precision nor recall, we calculate
the F1 score (harmonic average of precision and recall) for each possible
threshold for predictions on the validation set and pick the threshold for max(F1)
. This is done for each label
individually. The found thresholds are then used when predicting labels for the test set. The used F1 maximization
procedure is also known under the term plug-in rule approach (as opposed to structured loss minimization).
Note that should no class-prediction be greater than its threshold, we normalize predictions using their thresholds and
then simply pick the largest one. This is done to ensure at least one prediction per track as required by the provided
check.R
evaluation script.
The system described above works nicely when training on a single ground-truth, a validation dataset and a testing dataset that all share the same label space (all datasets refer to genres in the same way).
Subtask 2 allows training on multiple datasets (each with their own label space). At prediction time, it is required to "speak the language" of only one of the training datasets (the target ground-truth). I.e. predict for a specific label space (set of genre/label names).
In order to exploit some overlaps, we normalize all genre names before creating the combined training ground-truth. The applied normalization is very simple—it merely removes any non-letters and whitespace from the genre names.
Because we calculate thresholds based on validation data, we require validation datasets for each of the training datasets.
At prediction time we simply drop any labels that do not occur in the desired target ground-truth and undo the normalization.
Predictions and trained models can be found in the results-folder.
- Create a clean Python 3.5 environment (e.g. using miniconda)
- (install dependencies)
- Clone this repo and run
setup.py install
:
git clone https://github.com/hendriks73/melbaseline.git
cd melbaseline
python setup.py install
After installation, you should be able to run extractmelfeatures
from the command line in order
to extract the melband features from your JSON files:
Obviously you need to replace INPUT_BASE_FOLDER
with the folder into which you extracted the
AcousticBrainz JSON feature files.
Once you have create mel_features.joblib
-files for all dataset parts, you can start training on them.
Again, the filenames in the sample above are just placeholders. The ...tsv.bz2
files represent genre
annotation files as made available by the task organizers
here.
The ...mel_features.joblib
files are the files create by the script extractmelfeatures
based on the data
also made available here.
Note that for subtask 2 (train on multiple datasets, predict for a single dataset), you can specify multiple dataset for training, validation and test-prediction like this:
Note that when providing multiple ground-truth datasets, you should specify them in the same order for all parameters, so that predictions are calibrated on the correct validation ground-truth and for the right label namespace.
Instead of executing the training and prediction process locally, you can also use Google Cloud Machine Learning Engine. To do so, you have to
- Sign up and install the Google Cloud SDK.
- Upload the task ground-truths and
mel_features.joblib
files to Google Storage. - Edit the provided script
trainandpredict_ml_engine.sh
to reflect your naming choices. - Run
trainandpredict_ml_engine.sh
.
Source code and models can be licensed under the GNU AFFERO GENERAL PUBLIC LICENSE v3. For details, please see the LICENSE file.
If you use this project in your work, please consider citing this publication:
@inproceedings{
Title = {Media{E}val 2018 Acoustic{B}rainz Genre Task: A {CNN} Baseline Relying on Mel-Features},
Author = {Schreiber, Hendrik},
Booktitle = {Proceedings of the Media{E}val 2018 Multimedia Benchmark Workshop},
Month = {10},
Year = {2018},
Address = {Sophia Antipolis, France}
}
[1] | George Tzanetakis, Perry Cook, Musical Genre Classification of Audio Signals IEEE Transactions on Speech and Audio Processing, 10.5 (2002): 293-302. |