Difference between revisions of "2018:Music and/or Speech Detection"

From MIREX Wiki
(Created page with "==Description== The need for music and/or speech detection is evident in many audio processing tasks which relate to real-life materials such as archives of field recordings, br...")
 
Line 12: Line 12:
  
 
The music detection task consists in finding segments of music in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
 
The music detection task consists in finding segments of music in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
 +
 +
classes: music
  
 
===Speech Detection===
 
===Speech Detection===
  
The music detection task consists in finding segments of music in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
+
The speech detection task consists in finding segments of speech in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
 +
 
 +
classes: speech
  
 
===Music and Speech Detection===
 
===Music and Speech Detection===
  
 
The music and speech detection sub-task is a combination of the previous two sub-tasks, i.e., the submitted algorithms will have to find segments of music and speech. No assumptions are made about the number of segments present in each archive or about their duration. Moreover, they might overlap in time.
 
The music and speech detection sub-task is a combination of the previous two sub-tasks, i.e., the submitted algorithms will have to find segments of music and speech. No assumptions are made about the number of segments present in each archive or about their duration. Moreover, they might overlap in time.
 +
 +
classes: music, speech
 +
 +
===Music/Background-music/No-music Segmentation===
 +
 +
The music/background-music/no-music segmentation task consists in segregating the whole audio in segments of one of these three classes. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.
 +
 +
classes: fg-music, bg-music, no-music
  
 
==Datasets==
 
==Datasets==
  
===Content===
+
===Available Training Datasets===
 +
 
 +
These resources may be a good starting point for participants.
 +
 
 +
GTZAN Speech and Music Dataset
 +
http://opihi.cs.uvic.ca/sound/music_speech.tar.gz
 +
 
 +
Scheirer & Slaney Music Speech Corpus
 +
http://www.ee.columbia.edu/~dpwe/sounds/musp/scheislan.html
 +
 
 +
MUSAN Corpus
 +
http://www.openslr.org/17/
  
Dataset 1: it consists of 27 hours of audio from 8 different TV program types from France, Germany, Spain and the United Kingdom. It includes 1647 1-minuts files at 22050 Hz and 16 bits per sample.
+
Muspeak Speech and Music Detection Dataset
 +
http://mirg.city.ac.uk/datasets/muspeak/muspeak-mirex2015-detection-examples.zip
  
===Annotation===
+
Music detection dataset:
 +
www.seyerlehner.info/download/music_detection_dataset_dafx_07.zip
 +
(Ask the author for the password)
  
Dataset 1: it was annotated by a freelancer using BAT. A percentage of the annotations has been reviewed manually.
+
 
 +
===Evaluation Dataset===
 +
 
 +
====Content====
 +
 
 +
Evaluation dataset 1: it consists of 27 hours of audio from 8 different TV program types from France, Germany, Spain and the United Kingdom. It includes 1647 1-minute files sampled at 22,050 or 44,100 Hz with 16 bits per sample. We split this dataset into a training dataset (30 %) and a testing dataset (70 %). In order to avoid too similar training and testing datasets, audio files coming from the same country and program type are only present in one of them.
 +
 
 +
Evaluation dataset 2: it consists of 10 hours of audio corresponding to French TV and radio programs, provided by INA (French National Institute of Audiovisual). This include archives collected from 1950 up to now. The files were sampled at 16,000 Hz with 16 bits per sample.
 +
The whole dataset will be used for evaluation. It is aimed at being used with pretrained models only.
 +
 
 +
====Annotation====
 +
 
 +
Evaluation dataset 1: it was annotated by a single annotator using BAT. A percentage of the annotations has been manually reviewed. The classes included in the ground truth are: foreground music, background music and no music. We defined foreground music as music that is louder than all the other existing simultaneous sounds.
 +
 
 +
Evaluation dataset 2: has been manually annotated using Transcriber. Corpus annotation has been realized in the framework of the European Funded project MeMAD. It contains two possibly overlapping classes: speech and music.
  
 
==Evaluation==
 
==Evaluation==
  
In the literature we find two ways of measuring the performance of an algorithm depending on the way we compare the ground truth with an algorithm's estimation: in the frame-level approach the comparison is made in short time segments, while in the event-level approach every segment is understood as an event.
+
In the literature we find two ways of measuring the performance of an algorithm depending on the way we compare the ground truth with an algorithm's estimation: the frame-level evaluation and the event-level evaluation. We will report the statistics for each of these evaluations by file and for the whole dataset. We will do that for each algorithm and dataset.
  
===Frame level evaluation:===
+
===Frame-level evaluation:===
  
Frame-based evaluation will be carried out on 10ms segments. Precision (the portion of correct retrieved segments for all segments retrieved for each frame), Recall (the ratio of correct segments to all ground truth segments for each frame), and F-measure will be reported.
+
In the frame-level evaluation, we compare the estimation (est) produced by the algorithms with the ground truth (gt) in frames of 10 ms. We first compute the intermediate statistics for each class C, which include:
 +
* True Positives (TPc): gt frame’s class = C & est frame’s class = C
 +
* False Positives (FPc): gt frame’s class != C & est frame’s class = C
 +
* True Negatives (TNc): gt frame’s class != C & est frame’s class != C
 +
* False Negatives (FNc): gt frame’s class = C & est frame’s class != C
  
===Event level evaluation:===
+
Then we report class-wise Precision, Recall and F-measure.
 +
* Precision (Pc) = TPc / (TPc + FPc)
 +
* Recall (Rc) = TPc / (TPc + FNc)
 +
* F-measure = 2 * Pc * Rc / (Pc + Rc)
  
Events will be evaluated on an onset-only basis as well as an onset-offset basis, again using the Precision, Recall, and F-Measure. In the former, a ground truth segment is assumed to be correctly detected if (1) the system identifies the right class and (2) the detected segment`s onset is within a 1000ms range(+/- 500ms) of the onset of the ground truth segment. In the later (onset-offset) a ground truth segment is assumed to be correctly detected if (1) the system identifies the right class, (2) the detected segment`s onset is within +/- 500ms of the onset of the ground truth segment and (3) the detected segment's offset is either within +/- 500ms of the offset of the ground truth segment or within 20% of the ground truth segment's length. Results will also be included using smaller onset/offset tolerance (+/-100, 200ms). Different statistics can also be reported if agreed by the participants.
+
As well as the overall Accuracy:
 +
* Accuracy = (TP + TN) / (TP + TN + FP + FN)
 +
 
 +
Where:
 +
* TP = sum(TPc), for every class c
 +
* FP = sum(FPc), for every class c
 +
* TN = sum(TNc), for every class c
 +
* FN = sum(FNc), for every class c
 +
 
 +
===Event-level evaluation:===
 +
 
 +
In the event-level evaluation, we compare the estimation (est) produced by the algorithms with the ground truth (gt) in terms of onsets and offsets. We first compute the intermediate statistics for the onsets and offsets of each class C, which include:
 +
* True Positives (TP): an est onset (offset) of class = C appears in a tolerance time-window around a gt onset (offset) of class = C.
 +
* False Positives (FP): an est onset (offset) of class = C appears outside of the tolerance time-window of any gt onset (offset) of class = C.
 +
* False Negatives (FN): no est onset (offset) of class = C appears inside of the tolerance time-window of a gt onset (offset) of class = C.
 +
 
 +
Then we report class-wise Precision, Recall and F-measure.
 +
* Precision (Pc) = TPc / (TPc + FPc)
 +
* Recall (Rc) = TPc / (TPc + FNc)
 +
* F-measure = 2 * Pc * Rc / (Pc + Rc)
 +
 
 +
As well as the overall Accuracy:
 +
* Accuracy = (TP + TN) / (TP + TN + FP + FN)
 +
 
 +
Where:
 +
* TP = sum(TPc), for every class c
 +
* FP = sum(FPc), for every class c
 +
* TN = sum(TNc), for every class c
 +
* FN = sum(FNc), for every class c
 +
 
 +
Different tolerance time-window will be used: +/- 500 ms, +/- 200 ms, +/- 100 ms.
 +
 
 +
===Other evaluated features===
 +
 
 +
The execution time of each algorithm will also be reported.
  
 
==Submission Format==
 
==Submission Format==
Line 54: Line 135:
  
 
Music and Speech Detection: ''doMusicAndSpeechDetection path/to/file.wav  path/to/output/file.muspd ''
 
Music and Speech Detection: ''doMusicAndSpeechDetection path/to/file.wav  path/to/output/file.muspd ''
 +
 +
Fg.-Music/Bg.-Music/No-music segmentation: ''doFgBgNoMusicSegmentation path/to/file.wav  path/to/output/file.fbns ''
  
  
Line 63: Line 146:
 
===I/O format===
 
===I/O format===
  
For each detected segment, the file should include a row containing the onset (seconds), duration (seconds) and the class (represented by lower-case 'm' or 's') separated by a tab and ordered by onset time:
+
For each detected segment, the file should include a row containing the onset (seconds), duration (seconds) and the class separated by a tab and ordered by onset time:
  
 
  ''onset1    duration    class''
 
  ''onset1    duration    class''
Line 84: Line 167:
  
 
Jan Schlüter, jan.schlueter ... ofai.at
 
Jan Schlüter, jan.schlueter ... ofai.at
David Doukhan, david.doukhan gmail.com
+
David Doukhan, ddoukhan ina.fr
 
Blai Meléndez-Catalán, bmelendez … bmat.com
 
Blai Meléndez-Catalán, bmelendez … bmat.com
 
----
 
----
Line 95: Line 178:
 
==Submission closing date==
 
==Submission closing date==
  
The submission deadline for this task is the 11th of August.
+
The submission deadline for these tasks is the 11th of August.
  
 
==Task specific mailing list==
 
==Task specific mailing list==
  
 
All discussions on this task will take place on the MIREX  [https://mail.lis.illinois.edu/mailman/listinfo/evalfest "EvalFest" list]. If you have a question or comment, simply include the task name in the subject heading.
 
All discussions on this task will take place on the MIREX  [https://mail.lis.illinois.edu/mailman/listinfo/evalfest "EvalFest" list]. If you have a question or comment, simply include the task name in the subject heading.

Revision as of 01:15, 14 July 2018

Description

The need for music and/or speech detection is evident in many audio processing tasks which relate to real-life materials such as archives of field recordings, broadcasts and any other contexts which are likely to involve speech and music, concurrent or alternating. Segregating the signal into speech and music segments is an obvious first step before applying speech-specific or music-specific algorithms.

Indeed, music and/or speech detection has received considerable attention from the research community (for a partial list, see references below) but many of the published algorithms are dataset-specific and are not directly comparable due to non-standardised evaluation.

This Mirex task is aimed at filling that gap and consists of three sub-tasks: Music Detection, Speech Detection and Music and Speech Detection, with submissions welcomed to one or more of them.

Tasks

Music Detection

The music detection task consists in finding segments of music in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.

classes: music

Speech Detection

The speech detection task consists in finding segments of speech in a signal. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.

classes: speech

Music and Speech Detection

The music and speech detection sub-task is a combination of the previous two sub-tasks, i.e., the submitted algorithms will have to find segments of music and speech. No assumptions are made about the number of segments present in each archive or about their duration. Moreover, they might overlap in time.

classes: music, speech

Music/Background-music/No-music Segmentation

The music/background-music/no-music segmentation task consists in segregating the whole audio in segments of one of these three classes. This task applies to complete recordings from archives. No assumptions are made about the number of segments present in each archive or about their duration.

classes: fg-music, bg-music, no-music

Datasets

Available Training Datasets

These resources may be a good starting point for participants.

GTZAN Speech and Music Dataset http://opihi.cs.uvic.ca/sound/music_speech.tar.gz

Scheirer & Slaney Music Speech Corpus http://www.ee.columbia.edu/~dpwe/sounds/musp/scheislan.html

MUSAN Corpus http://www.openslr.org/17/

Muspeak Speech and Music Detection Dataset http://mirg.city.ac.uk/datasets/muspeak/muspeak-mirex2015-detection-examples.zip

Music detection dataset: www.seyerlehner.info/download/music_detection_dataset_dafx_07.zip (Ask the author for the password)


Evaluation Dataset

Content

Evaluation dataset 1: it consists of 27 hours of audio from 8 different TV program types from France, Germany, Spain and the United Kingdom. It includes 1647 1-minute files sampled at 22,050 or 44,100 Hz with 16 bits per sample. We split this dataset into a training dataset (30 %) and a testing dataset (70 %). In order to avoid too similar training and testing datasets, audio files coming from the same country and program type are only present in one of them.

Evaluation dataset 2: it consists of 10 hours of audio corresponding to French TV and radio programs, provided by INA (French National Institute of Audiovisual). This include archives collected from 1950 up to now. The files were sampled at 16,000 Hz with 16 bits per sample. The whole dataset will be used for evaluation. It is aimed at being used with pretrained models only.

Annotation

Evaluation dataset 1: it was annotated by a single annotator using BAT. A percentage of the annotations has been manually reviewed. The classes included in the ground truth are: foreground music, background music and no music. We defined foreground music as music that is louder than all the other existing simultaneous sounds.

Evaluation dataset 2: has been manually annotated using Transcriber. Corpus annotation has been realized in the framework of the European Funded project MeMAD. It contains two possibly overlapping classes: speech and music.

Evaluation

In the literature we find two ways of measuring the performance of an algorithm depending on the way we compare the ground truth with an algorithm's estimation: the frame-level evaluation and the event-level evaluation. We will report the statistics for each of these evaluations by file and for the whole dataset. We will do that for each algorithm and dataset.

Frame-level evaluation:

In the frame-level evaluation, we compare the estimation (est) produced by the algorithms with the ground truth (gt) in frames of 10 ms. We first compute the intermediate statistics for each class C, which include:

  • True Positives (TPc): gt frame’s class = C & est frame’s class = C
  • False Positives (FPc): gt frame’s class != C & est frame’s class = C
  • True Negatives (TNc): gt frame’s class != C & est frame’s class != C
  • False Negatives (FNc): gt frame’s class = C & est frame’s class != C

Then we report class-wise Precision, Recall and F-measure.

  • Precision (Pc) = TPc / (TPc + FPc)
  • Recall (Rc) = TPc / (TPc + FNc)
  • F-measure = 2 * Pc * Rc / (Pc + Rc)

As well as the overall Accuracy:

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)

Where:

  • TP = sum(TPc), for every class c
  • FP = sum(FPc), for every class c
  • TN = sum(TNc), for every class c
  • FN = sum(FNc), for every class c

Event-level evaluation:

In the event-level evaluation, we compare the estimation (est) produced by the algorithms with the ground truth (gt) in terms of onsets and offsets. We first compute the intermediate statistics for the onsets and offsets of each class C, which include:

  • True Positives (TP): an est onset (offset) of class = C appears in a tolerance time-window around a gt onset (offset) of class = C.
  • False Positives (FP): an est onset (offset) of class = C appears outside of the tolerance time-window of any gt onset (offset) of class = C.
  • False Negatives (FN): no est onset (offset) of class = C appears inside of the tolerance time-window of a gt onset (offset) of class = C.

Then we report class-wise Precision, Recall and F-measure.

  • Precision (Pc) = TPc / (TPc + FPc)
  • Recall (Rc) = TPc / (TPc + FNc)
  • F-measure = 2 * Pc * Rc / (Pc + Rc)

As well as the overall Accuracy:

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)

Where:

  • TP = sum(TPc), for every class c
  • FP = sum(FPc), for every class c
  • TN = sum(TNc), for every class c
  • FN = sum(FNc), for every class c

Different tolerance time-window will be used: +/- 500 ms, +/- 200 ms, +/- 100 ms.

Other evaluated features

The execution time of each algorithm will also be reported.

Submission Format

Command line calling format

Submissions have to conform to the specified format below:

Music Detection: doMusicDetection path/to/file.wav path/to/output/file.mud

Speech Detection: doSpeechDetection path/to/file.wav path/to/output/file.spd

Music and Speech Detection: doMusicAndSpeechDetection path/to/file.wav path/to/output/file.muspd

Fg.-Music/Bg.-Music/No-music segmentation: doFgBgNoMusicSegmentation path/to/file.wav path/to/output/file.fbns


where:

  • path/to/file.wav: Path to the input audio file.
  • path/to/output/file.*: The output file.

Programs can use their working directory if they need to keep temporary cache files or internal debugging info. Stdout and stderr will be logged.

I/O format

For each detected segment, the file should include a row containing the onset (seconds), duration (seconds) and the class separated by a tab and ordered by onset time:

onset1    	duration    	class
onset2    	duration    	class
...  ... 	...

(note that events in the case of music and speech detection can overlap)

Packaging submissions

All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guaranteed) and include a README file including the following the information:

  • Command line calling format for all executables and an example formatted set of commands
  • Number of threads/cores used or whether this should be specified on the command line
  • Expected memory footprint
  • Expected runtime
  • Any required environments (and versions), e.g. python, java, bash, matlab.

Potential Participants

name/email

Jan Schlüter, jan.schlueter ... ofai.at David Doukhan, ddoukhan … ina.fr Blai Meléndez-Catalán, bmelendez … bmat.com


Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions are specified. A hard limit of 72 hours will be imposed on runs. Submissions that exceed this runtime may not receive a result.

Submission closing date

The submission deadline for these tasks is the 11th of August.

Task specific mailing list

All discussions on this task will take place on the MIREX "EvalFest" list. If you have a question or comment, simply include the task name in the subject heading.