2018:Music/Speech Classification and Detection

From MIREX Wiki

Description

The need for music-speech classification is evident in many audio processing tasks which relate to real-life materials such as archives of field recordings, broadcasts and any other contexts which are likely to involve speech and music, concurrent or alternating. Segregating the signal into speech and music segments is an obvious first step before applying speech-specific or music-specific algorithms.

Indeed, speech-music classification has received considerable attention from the research community (for a partial list, see references below) but many of the published algorithms are dataset-specific and are not directly comparable due to non-standardised evaluation.

This Mirex task is aimed at filling that gap and consists of classification (binary decision per clip) and detection (identifying potentially overlapping segments of different classes)as two separate sub-tasks, with submissions welcomed to either or to both.

Music/Speech Classification

Speech-Music classification is here defined as the binary problem of classifying pre-segmented audio data to the speech or music class. We impose the restriction that a segment is 30 s long and may either contain speech or music data and that mixed (speech over music) segments are not allowed.


Dataset

All submitted methods will be trained and tested on a dataset consisting of 50 hours of recordings, whose detailed description will be made available to all contestants after the competition has been completed. The following is a non-exclusive list of audio data sources to be used: commercial CDs (both compressed and uncompressed), YouTube audio data and radio broadcasts recorded over the Internet. Each audio file will be 30 s long. The dataset will be augmented with spectrally distorted and/or time-stretched versions of randomly selected recordings (mild distortions).

As an example dataset for development purposes we recommend the GTZAN Music/Speech Dataset G. Tzanetakis and P. Cook, “Marsyas: A framework for audio analysis,” Organised sound,vol. 4(3), 2000.

(http://marsyasweb.appspot.com/download/data_sets/)

Evaluation

Participating algorithms will be evaluated with 3-fold cross validation. The raw classification (identification) accuracy, standard deviation and a confusion matrix for each algorithm will be computed. Classification accuracies will be tested for statistically significant differences using Friedman's Anova with Tukey-Kramer honestly significant difference (HSD) tests for multiple comparisons. This test will be used to rank the algorithms and to group them into sets of equivalent performance. In addition, computation times for feature extraction and training/classification will be measured.

Submission Format

Audio Format

The audio files for both tasks are encoded as 44.1kHz / 16 bit WAV files.

Command line calling format

Please follow instructions on train/test tasks (see “Submission calling formats” at https://www.music-ir.org/mirex/wiki/2015:Audio_Classification_%28Train/Test%29_Tasks )

I/O format

The audio files to be used in these tasks will be specified in a simple ASCII list file. The formats for the list files are specified below:

Feature extraction list file

The list file passed for feature extraction will be a simple ASCII list file. This file will contain one path per line with no header line. I.e.

<example path and filename>


E.g. :

/path/to/track1.wav 
/path/to/track2.wav 
... 

Training list file

The list file passed for model training will be a simple ASCII list file. This file will contain one path per line, followed by a tab character and the class (music or speech) label, again with no header line.

I.e.

<example path and filename>\t<class label>


E.g. :

/path/to/track1.wav m 
/path/to/track2.wav s 
... 

Test (classification) list file

The list file passed for testing classification will be a simple ASCII list file identical in format to the Feature extraction list file. This file will contain one path per line with no header line.

I.e.

<example path and filename>

E.g. :

/path/to/track1.wav 
/path/to/track2.wav 
... 

Classification output file

Participating algorithms should produce a simple ASCII list file identical in format to the Training list file. This file will contain one path per line, followed by a tab character and the artist label, again with no header line.

I.e.

<example path and filename>\t<class label>

E.g. :

/path/to/track1.wav	m 
/path/to/track2.wav	s 
... 

Packaging submissions

Please follow instructions on train/test tasks (see “Packaging submissions” at https://www.music-ir.org/mirex/wiki/2015:Audio_Classification_%28Train/Test%29_Tasks )

Potential Participants

To all potential participants: please add your name/email

Thomas Lidy, lidy (AT) ifs.tuwien.ac.at

Jimena Royo-Letelier, jroyo (AT) deezer.com

Jan Schlüter, jan.schlueter ... ofai.at


Music/Speech Detection

The detection task consists in finding segments of music and speech in a signal, i.e. finding segment boundaries, and classifying each segment as music or speech. This task applies to complete recordings from archives, which are typically at least several minutes long and contain multiple segments. No assumptions are made about the relationships between different segments within a recording, i.e. speech and music segments may overlap.


Dataset

Example training material is available at the following download link:

http://mirg.city.ac.uk/datasets/muspeak/muspeak-mirex2015-detection-examples.zip

(NB the improved example annotations for two of the files - theconcert2.csv and UTMA-26.csv - are now included in the zip file)

Submissions will be evaluated by us on a dataset of 70 recordings taken from the British Library’s World & Traditional Music collections, which contain both music and speech segments. The recordings vary significantly in length with a minimum of 21 seconds and a maximum 1563 seconds. Due to copyright issues, evaluation will take place on a server which is positioned within the British Library.

Evaluation

Ground truth for the detection dataset has been obtained by manual annotation and cross-checking all done by the MuSpeak team at City University London. The annotation guidelines allowed a 500 ms tolerance and annotations were checked by at least one other person besides the annotator.

Frame level evaluation:

Frame-based evaluation will be carried out on 10ms segments. Precision (the portion of correct retrieved segments for all segments retrieved for each frame), Recall (the ratio of correct segments to all ground truth segments for each frame), and F-measure will be reported.

Event level evaluation:

Events will be evaluated on an onset-only basis as well as an onset-offset basis, again using the Precision, Recall, and F-Measure. In the former, a ground truth segment is assumed to be correctly detected if the system identifies the right class (Music/Speech) AND the detected segment`s onset is within a 1000ms range( +/- 500ms) of the onset of the ground truth segment. In the later (onset-offset) a ground truth segment is assumed to be correctly detected if the system identifies the right class (Music/Speech), AND the detected segment`s onset is within +/- 500ms of the onset of the ground truth segment, AND the detected segment's offset is EITHER within +/- 500ms of the offset of the ground truth segment OR within 20% of the ground truth segment's length.


Results will also be included using smaller onset/offset tolerance (+/-100, 200ms). Different statistics can also be reported if agreed by the participants.

Submission Format

Audio Format

The audio files for both tasks are encoded as 44.1kHz / 16 bit WAV files.


Command line calling format

Submissions have to conform to the specified format below  :

doMusicSpeechDetection path/to/file.wav  path/to/output/file.musp 


where:

  • path/to/file.wav: Path to the input audio file.
  • path/to/output/file.musp: The output file.

Programs can use their working directory if they need to keep temporary cache files or internal debugging info. Stdout and stderr will be logged.

I/O format

For each row, the file should contain the onset (seconds), duration (seconds), and the class (music or speech, represented by lower-case ‘m’ or ’s') of each note event separated by a tab, ordered in terms of onset times  :

onset duration CLASS 
onset duration CLASS 
...	... ...

which might look like  :

0.68	56.436	m 
0.72	100.2	s 
165.95	510.24	m 
...	... ...

(note that events can be overlapped)

Packaging submissions

All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guaranteed). All submissions should include a README file including the following the information:

  • Command line calling format for all executables and an example formatted set of commands
  • Number of threads/cores used or whether this should be specified on the command line
  • Expected memory footprint
  • Expected runtime
  • Any required environments (and versions), e.g. python, java, bash, matlab.

Potential Participants

name/email

Jan Schlüter, jan.schlueter ... ofai.at


Time and hardware limits (for both classification and detection)

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions are specified. A hard limit of 72 hours will be imposed on runs. Submissions that exceed this runtime may not receive a result.

Submission closing date (for both classification and detection)

Task specific mailing list (for both classification and detection)

All discussions on this task will take place on the MIREX "EvalFest" list. If you have a question or comment, simply include the task name in the subject heading.