Difference between revisions of "2005:Audio Onset Detect"

From MIREX Wiki
 
(91 intermediate revisions by 20 users not shown)
Line 3: Line 3:
 
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk
 
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk
  
äöä#
+
Pierre Leveau (Laboratoire d'Acoustique Musicale, GET-ENST (Télécom Paris)) leveau at lam dot jussieu dot fr
 +
 
 
==Title==
 
==Title==
  
Line 17: Line 18:
 
''Audio format'':
 
''Audio format'':
  
The data will be monophonic sound files, with the associated onset times and
+
The data are monophonic sound files, with the associated onset times and
 
data about the annotation robustness.
 
data about the annotation robustness.
 
* CD-quality (PCM, 16-bit, 44100 Hz)
 
* CD-quality (PCM, 16-bit, 44100 Hz)
 
* single channel (mono)
 
* single channel (mono)
* the file length is not critical for that task, but 30 seconds max. excerpts would be convenient if we want to have a correct diversity in the dataset. It must be reminded that real-world sounds must be manually annotated (painful and time-consuming task, as pointed by J. Bello at MIREX 2004).
+
* file length between 2 and 36 seconds (total time: 14 minutes)
 +
* File names:
  
 
''Audio content'':
 
''Audio content'':
  
The dataset will be subdivided into classes. This idea has been evoked by D. Ellis at last MIREX. The reasons why:
+
The dataset is subdivided into classes, because onset detection is sometimes performed in applications dedicated to a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...). The performance of each algorithm will be assessed on the whole dataset but also on each class separately.
* onset detection are performed in various applications, some of them are dedicated for a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...)
+
 
* the composition of the entire database can determine the relative rank of the onset detection algorithms. For example, an evaluation of a dataset principally composed of complex mixes will not emphasize an onset detection performing well on solo phrases of bowed strings, but a little less than the others on complex mixes.
+
The dataset contains 85 files from 5 classes annotated as follows:
* it can show the weak points of the compared methods. I think it is more useful than an evaluation based on an overall success percentage or curve.
+
* 30 solo drum excerpts cross-annotated by 3 people
The 3 following classes will be considered:
+
* 30 solo monophonic pitched instruments excerpts cross-annotated by 3 people
* monophonic instruments solo phrases
+
* 10 solo polyphonic pitched instruments excerpts cross-annotated by 3 people
* polyphonic instruments solo phrases
+
* 15 complex mixes cross-annotated by 5 people
* complex mixes
+
 
 +
Moreover the monophonic pitched instruments class is divided into 6 sub-classes: brass (2 excerpts), winds (4), sustained strings (6), plucked strings (9), bars and bells (4), singing voice (5).
  
 +
''Nomenclature''
  
''Meta data'':
+
<AudioFileName>.wav for the audio file
  
Two types of annotation can be provided:
 
* Manual annotation for the real word sounds. For this type of annotation, our article mentions these potential difficulties:
 
* Midi score for synthesized sounds or MIDI commanded instruments. They are considered as robust ground-truth.
 
 
 
''Notes on annotation'':
 
  
As mentioned above, the sound files will be provided with their onset time annotation. The ground-truth we will define can be critical for the evaluation.
+
2) '''Output data'''
For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock.
 
For real world sounds, annotation volunteers are needed.  The annotations should be cross-validated (errare humanum est). Precise instructions on which events to annotate must be given to the annotators.
 
Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation.How the annotation is taken into account must be precisely defined...  my opinion is to discard sound events that are not music notes, for example breathing, key strokes etc..., that are quite frequent in the solo recordings, even if they're detected by most of the onset detection algorithms...
 
  
Article and matlab tool for annotation by Pierre Leveau et al.
+
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>.output.
  
http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf
 
  
http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm
+
''Onset file Format''
  
2) '''Output data'''
+
<onset time(in seconds)>\n
  
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>_onsets.txt.
+
where \n denotes the end of line. The < and > characters are not included.
  
==Potential Participants==
+
==Participants==
 +
* Julien Ricard and Gilles Peterschmitt (no affiliation, algorithm previously developped at University Pompeu Fabra), julien.ricard@gmail.com, gpeter@iua.upf.es
 +
* Axel Roebel (IRCAM), roebel@ircam.fr
 +
* Antonio Pertusa, José M. Iñesta (University of Alicante) and Anssi Klapuri (Tampere University of Technology), pertusa@dlsi.ua.es, inesta@dlsi.ua.es, klap@cs.tut.fi
 +
* Alexandre Lacoste and Douglas Eck (University of Montreal), lacostea@sympatico.ca, eckdoug@iro.umontreal.ca
 +
* Nick Collins (University of Cambridge), nc272@cam.ac.uk
 +
* Paul Brossier (Queen Mary, University of London), paul.brossier@elec.qmul.ac.uk
 +
* Kris West (University of East Anglia), kw@cmp.uea.ac.uk
  
* Tampere University of Technnology, Audio Research Group
+
==Other Potential Participants==
Ansii Klapuri <klap@cs.tut.fi>
+
* Balaji Thoshkahna (Indian Institute of Science,Bangalore), balajitn@ee.iisc.ernet.in
 
* MIT, MediaLab
 
* MIT, MediaLab
Tristan Jehan <tristan@medialab.mit.edu>
+
Tristan Jehan <tristan{at}medialab{dot}mit{dot}edu>
 
* LAM, France
 
* LAM, France
Pierre Leveau <leveau@lam.jussieu.fr>
+
Pierre Leveau <leveau at lam dot jussieu dot fr>
Laurent Daudet <daudet@lam.jussieu.fr>
+
Laurent Daudet <daudet at lam dot jussieu dot fr>
 
* IRCAM, France
 
* IRCAM, France
Xavier Rodet <rod@ircam.fr>,
+
Xavier Rodet (rod{at}ircam{dot}fr),
Axel Roebel <roebel@ircam.fr>
+
Geoffroy Peeters (peeters{at}ircam{dot}fr);
* University of Pompeo Fabra, Multimedia Technology Group
 
Julien Ricard <jricard@iua.upf.es>
 
Fabien Gouyon <fgouyon@iua.upf.es>
 
* Queen Mary College, Centre for Digital Music
 
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk>
 
Paul Brossier <paul.brossier@qmul.elec.ac.uk>
 
  
 
==Evaluation Procedures==
 
==Evaluation Procedures==
  
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a '''correct detection''' (CD). If not, it is a '''false positive''' (FP).  
+
The detected onset times will be compared with the ground-truth ones. For a given groud-truth onset time, if there is a detection in a tolerance time-window around it, it is considered as a '''correct detection''' (CD). If not, there  is a '''false negative''' (FN). The detections outside all the tolerance windows are counted as '''false positives''' (FP). Doubled onsets (two detections for one ground-truth onset) and merged onsets (one detection for two ground-truth onsets) will be taken into account in the evaluation. Doubled onsets are a subset of the FP onsets, and merged onsets a subset of FN onsets.
 +
 +
 
 +
We define:
 +
 
 +
'''Precision'''
 +
 
 +
''P = Ocd / (Ocd +Ofp)''
 +
 
 +
 
 +
'''Recall'''
 +
 
 +
''R = Ocd / (Ocd + Ofn)''
 +
 
 +
 
 +
and the '''F-measure''':
 +
 
 +
''F = 2*P*R/(P+R)''
 +
 
 +
 
 +
with these notations:
 +
 
 +
''Ocd'': number of correctly detected onsets (CD)
 +
 
 +
''Ofn'': number of missed onsets (FN)
 +
 
 +
''Om'': number of merged onsets
 +
 
 +
''Ofp'': number of false positive onsets (FP)
 +
 
 +
''Od'': number of double onsets
 +
 
 +
 
 +
Other indicative measurements:
 +
 
 +
'''FP rate''':
 +
 
 +
''FP =      100. * (Ofp) / (Ocd+Ofp)''
 +
 
 +
'''Doubled Onset rate in FP'''
 +
 
 +
'' D = 100 * Od / Ofp ''
 +
 
 +
'''Merged Onset rate in FN'''
 +
 
 +
'' M = 100 * Om / Ofn ''
 +
 
  
The algorithms based on detection functions will be tuned to a limited number of working points of the ROC Curve, e.g. one with a good correct detections rate, an other one with a weak false positives rate, and a third between the two. These tunings will be considered as different versions of a same algorithm, and will be done before the submission to the contest.
+
Because files are cross-annotated, the mean Precision and Recall rates are defined by averaging Precision and Recall rates computed for each annotation.
  
To establish a ranking (and indicate a winner...), we can compute the euclidian distance between the (CD rate, FP rate) and the (100, 0) point. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.
+
To establish a ranking (and indicate a winner...), we will use the F-measure, widely used in string comparaisons. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.
  
  
Evaluation measures:
+
'''Evaluation measures:'''
 
* percentage of correct detections / false positives (can also be expressed as precision/recall)
 
* percentage of correct detections / false positives (can also be expressed as precision/recall)
* time precision (tolerance from 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.
+
* time precision (tolerance from +/- 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.
* separate scoring for different instrument types (percussive, strings, winds)  
+
* separate scoring for different instrument types (percussive, strings, winds, etc)
More detailed data:
+
 
 +
'''More detailed data:'''
 
* percentage of doubled detections
 
* percentage of doubled detections
 
* speed measurements of the algorithms
 
* speed measurements of the algorithms
Line 98: Line 141:
 
==Relevant Test Collections==
 
==Relevant Test Collections==
  
Possible sources: excerpts of RWC Database, recordings in the labs (MIDI generated or human), upcoming FreeSound database, etc...
+
Audio data are commercial CD recordings, recordings made by MTG at UPF Barcelona and excerpts from the RWC database. Annotations were conducted by the Centre for Digital Music at QMU London (62% of annotations), Musical Acoustics Lab at Paris 6 University (25%), MTG at UPF Barcelona (11%) and Analysis Synthesis Group at IRCAM Paris (2%). MATLAB annotation software by Pierre Leveau (http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm ) was used for this purpose. Annotaters were provided with an approximate aim (catching all onsets corresponding to music notes, including pitched onsets and not only percussive ones), but no further supervision of annotation was performed.
Some of them have already been cross-annotated. It would be fine that each people owning an already annotated sound onset database details its contents (source of the annotation (MIDI, how many human subjects, etc.). It could give an overview of the amount of onsets we already have, and of from where they come...
+
 
 +
The defined ground-truth can be critical for the evaluation. Precise instructions on which events to annotate must be given to the annotators. Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation.  
  
Some training data is available at: http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm. It is composed of amateur recordings and RWC DB excerpts.
+
Article about annotation by Pierre Leveau et al.: http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf
  
 
==Review 1==
 
==Review 1==
Line 125: Line 169:
  
 
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.
 
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive.
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.
+
There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of &amp;amp;quot;precisely-labeled&amp;amp;quot; onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise.
 
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.
 
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants.
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.
+
For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the &amp;amp;quot;tolerance window&amp;amp;quot; for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms.
 
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.
 
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community.
 
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.
 
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.
  
 
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.
 
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible.
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.
+
Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above (&amp;amp;quot;complex mixes&amp;amp;quot;, e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.
  
 
==Downie's Comments==
 
==Downie's Comments==

Latest revision as of 11:47, 15 November 2005

Proposer

Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk

Pierre Leveau (Laboratoire d'Acoustique Musicale, GET-ENST (Télécom Paris)) leveau at lam dot jussieu dot fr

Title

Onset Detection Contest


Description

The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.

1) Input data

Audio format:

The data are monophonic sound files, with the associated onset times and data about the annotation robustness.

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono)
  • file length between 2 and 36 seconds (total time: 14 minutes)
  • File names:

Audio content:

The dataset is subdivided into classes, because onset detection is sometimes performed in applications dedicated to a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...). The performance of each algorithm will be assessed on the whole dataset but also on each class separately.

The dataset contains 85 files from 5 classes annotated as follows:

  • 30 solo drum excerpts cross-annotated by 3 people
  • 30 solo monophonic pitched instruments excerpts cross-annotated by 3 people
  • 10 solo polyphonic pitched instruments excerpts cross-annotated by 3 people
  • 15 complex mixes cross-annotated by 5 people

Moreover the monophonic pitched instruments class is divided into 6 sub-classes: brass (2 excerpts), winds (4), sustained strings (6), plucked strings (9), bars and bells (4), singing voice (5).

Nomenclature

<AudioFileName>.wav for the audio file


2) Output data

The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>.output.


Onset file Format

<onset time(in seconds)>\n

where \n denotes the end of line. The < and > characters are not included.

Participants

  • Julien Ricard and Gilles Peterschmitt (no affiliation, algorithm previously developped at University Pompeu Fabra), julien.ricard@gmail.com, gpeter@iua.upf.es
  • Axel Roebel (IRCAM), roebel@ircam.fr
  • Antonio Pertusa, Jos├⌐ M. I├▒esta (University of Alicante) and Anssi Klapuri (Tampere University of Technology), pertusa@dlsi.ua.es, inesta@dlsi.ua.es, klap@cs.tut.fi
  • Alexandre Lacoste and Douglas Eck (University of Montreal), lacostea@sympatico.ca, eckdoug@iro.umontreal.ca
  • Nick Collins (University of Cambridge), nc272@cam.ac.uk
  • Paul Brossier (Queen Mary, University of London), paul.brossier@elec.qmul.ac.uk
  • Kris West (University of East Anglia), kw@cmp.uea.ac.uk

Other Potential Participants

  • Balaji Thoshkahna (Indian Institute of Science,Bangalore), balajitn@ee.iisc.ernet.in
  • MIT, MediaLab

Tristan Jehan <tristan{at}medialab{dot}mit{dot}edu>

  • LAM, France

Pierre Leveau <leveau at lam dot jussieu dot fr> Laurent Daudet <daudet at lam dot jussieu dot fr>

  • IRCAM, France

Xavier Rodet (rod{at}ircam{dot}fr), Geoffroy Peeters (peeters{at}ircam{dot}fr);

Evaluation Procedures

The detected onset times will be compared with the ground-truth ones. For a given groud-truth onset time, if there is a detection in a tolerance time-window around it, it is considered as a correct detection (CD). If not, there is a false negative (FN). The detections outside all the tolerance windows are counted as false positives (FP). Doubled onsets (two detections for one ground-truth onset) and merged onsets (one detection for two ground-truth onsets) will be taken into account in the evaluation. Doubled onsets are a subset of the FP onsets, and merged onsets a subset of FN onsets.


We define:

Precision

P = Ocd / (Ocd +Ofp)


Recall

R = Ocd / (Ocd + Ofn)


and the F-measure:

F = 2*P*R/(P+R)


with these notations:

Ocd: number of correctly detected onsets (CD)

Ofn: number of missed onsets (FN)

Om: number of merged onsets

Ofp: number of false positive onsets (FP)

Od: number of double onsets


Other indicative measurements:

FP rate:

FP = 100. * (Ofp) / (Ocd+Ofp)

Doubled Onset rate in FP

D = 100 * Od / Ofp

Merged Onset rate in FN

M = 100 * Om / Ofn


Because files are cross-annotated, the mean Precision and Recall rates are defined by averaging Precision and Recall rates computed for each annotation.

To establish a ranking (and indicate a winner...), we will use the F-measure, widely used in string comparaisons. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.


Evaluation measures:

  • percentage of correct detections / false positives (can also be expressed as precision/recall)
  • time precision (tolerance from +/- 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.
  • separate scoring for different instrument types (percussive, strings, winds, etc)

More detailed data:

  • percentage of doubled detections
  • speed measurements of the algorithms
  • scalability to large files
  • robustness to noise, loudness

Relevant Test Collections

Audio data are commercial CD recordings, recordings made by MTG at UPF Barcelona and excerpts from the RWC database. Annotations were conducted by the Centre for Digital Music at QMU London (62% of annotations), Musical Acoustics Lab at Paris 6 University (25%), MTG at UPF Barcelona (11%) and Analysis Synthesis Group at IRCAM Paris (2%). MATLAB annotation software by Pierre Leveau (http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm ) was used for this purpose. Annotaters were provided with an approximate aim (catching all onsets corresponding to music notes, including pitched onsets and not only percussive ones), but no further supervision of annotation was performed.

The defined ground-truth can be critical for the evaluation. Precise instructions on which events to annotate must be given to the annotators. Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation.

Article about annotation by Pierre Leveau et al.: http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf

Review 1

Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.

In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%. The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.

It does not mention whether there will be training data available to participants. To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.

I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.


Review 2

Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.

The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.

The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.

The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive. There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of &amp;quot;precisely-labeled&amp;quot; onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise. I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants. For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the &amp;quot;tolerance window&amp;quot; for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms. You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community. Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.

Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible. Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above (&amp;quot;complex mixes&amp;quot;, e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.

Downie's Comments

1. Tend to agree that this is a rather low level and not very sexy task to evaluate in the MIR context. However, I have great respect for folks working in this area and will defer to the judgement of the community on the suitablility of this task as part of our evaluation framework.

2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes?