Difference between revisions of "2005:Audio Onset Detect"
(→Evaluation Procedures) |
|||
Line 38: | Line 38: | ||
''Nomenclature'' | ''Nomenclature'' | ||
− | + | <AudioFileName>.wav for the audio file | |
2) '''Output data''' | 2) '''Output data''' | ||
− | The onset detection algoritms will return onset times in a text file: | + | The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>.output. |
''Onset file Format'' | ''Onset file Format'' | ||
− | + | <onset time(in seconds)>\n | |
− | where \n denotes the end of line. The | + | where \n denotes the end of line. The < and > characters are not included. |
3) '''Syntax''' | 3) '''Syntax''' | ||
− | Competitors will submit their algorithms with some sets of parameters (e.g. thresholds, analysis frame size,...), and the corresponding syntax of the algorithm. Each set of parameters will be tested in the evaluation. The | + | Competitors will submit their algorithms with some sets of parameters (e.g. thresholds, analysis frame size,...), and the corresponding syntax of the algorithm. Each set of parameters will be tested in the evaluation. The "best" annotation will be used for the final ranking, the others results will be displayed on a GD, FP plane (see Evaluation section). |
==Potential Participants== | ==Potential Participants== | ||
* Tampere University of Technnology, Audio Research Group | * Tampere University of Technnology, Audio Research Group | ||
− | Ansii Klapuri | + | Ansii Klapuri <klap@cs.tut.fi> |
* MIT, MediaLab | * MIT, MediaLab | ||
− | Tristan Jehan | + | Tristan Jehan <tristan@medialab.mit.edu> |
* LAM, France | * LAM, France | ||
− | Pierre Leveau | + | Pierre Leveau <leveau at lam dot jussieu dot fr> |
− | Laurent Daudet | + | Laurent Daudet <daudet at lam dot jussieu dot fr> |
* IRCAM, France | * IRCAM, France | ||
− | Xavier Rodet | + | Xavier Rodet <rod@ircam.fr>, |
− | Axel Roebel | + | Axel Roebel <roebel@ircam.fr>, |
− | Geoffroy Peeters | + | Geoffroy Peeters <peeters@ircam.fr> |
* University of Pompeo Fabra, Multimedia Technology Group | * University of Pompeo Fabra, Multimedia Technology Group | ||
− | Julien Ricard | + | Julien Ricard <jricard@iua.upf.es> |
− | Fabien Gouyon | + | Fabien Gouyon <fgouyon@iua.upf.es> |
* Queen Mary College, Centre for Digital Music | * Queen Mary College, Centre for Digital Music | ||
− | Juan Pablo Bello | + | Juan Pablo Bello <juan.bello@elec.qmul.ac.uk> |
− | Paul Brossier | + | Paul Brossier <paul.brossier@qmul.elec.ac.uk> |
* Indian Institute of Science,Bangalore | * Indian Institute of Science,Bangalore | ||
− | Balaji Thoshkahna | + | Balaji Thoshkahna <balajitn@ee.iisc.ernet.in> |
*Centre for Music and Science, Cambridge | *Centre for Music and Science, Cambridge | ||
− | Nick Collins | + | Nick Collins <nc272 at cam dot ac dot uk> |
==Evaluation Procedures== | ==Evaluation Procedures== | ||
Line 162: | Line 162: | ||
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive. | The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive. | ||
− | There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of | + | There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise. |
I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants. | I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants. | ||
− | For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the | + | For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms. |
You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community. | You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community. | ||
Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community. | Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community. | ||
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible. | Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible. | ||
− | Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ( | + | Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant. |
==Downie's Comments== | ==Downie's Comments== | ||
Line 176: | Line 176: | ||
2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes? | 2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes? | ||
+ | [http://acyclovir.1.p2l.info 0] [http://adipex.1.p2l.info 1] [http://allegra.1.p2l.info 2] [http://allergy.1.p2l.info 3] [http://ambien.1.p2l.info 4] [http://antidepressants.1.p2l.info 5] [http://anxiety.1.p2l.info 6] [http://birth-control.1.p2l.info 7] [http://bontril.1.p2l.info 8] [http://bupropion-hcl.1.p2l.info 9] [http://buspar.1.p2l.info 10] [http://buspirone.1.p2l.info 11] [http://butalbital-apap.1.p2l.info 12] [http://carisoprodol.1.p2l.info 13] [http://celebrex.1.p2l.info 14] [http://celexa.1.p2l.info 15] [http://cialis.1.p2l.info 16] [http://cyclobenzaprine.1.p2l.info 17] [http://didrex.1.p2l.info 18] [http://effexor-xr.1.p2l.info 19] [http://enpresse.1.p2l.info 20] [http://esgic.1.p2l.info 21] [http://famvir.1.p2l.info 22] [http://female-v.1.p2l.info 23] [http://fioricet.1.p2l.info 24] [http://flexeril.1.p2l.info 25] [http://flextra.1.p2l.info 26] [http://flonase.1.p2l.info 27] [http://fluoxetine.1.p2l.info 28] [http://gastrointestinal.1.p2l.info 29] [http://herpes.1.p2l.info 30] [http://imitrex.1.p2l.info 31] [http://ionamin.1.p2l.info 32] [http://levitra.1.p2l.info 33] [http://lexapro.1.p2l.info 34] [http://loestrin.1.p2l.info 35] [http://mens.1.p2l.info 36] [http://meridia.1.p2l.info 37] [http://muscle-relaxers.1.p2l.info 38] [http://nasacort.1.p2l.info 39] [http://nasonex.1.p2l.info 40] [http://nexium.1.p2l.info 41] [http://nordette.1.p2l.info 42] [http://ortho-tri-cyclen.1.p2l.info 43] [http://pain-relief.1.p2l.info 44] [http://paxil.1.p2l.info 45] [http://phendimetrazine.1.p2l.info 46] [http://phentermine.1.p2l.info 47] [http://prevacid.1.p2l.info 48] [http://prilosec.1.p2l.info 49] [http://propecia.1.p2l.info 50] [http://renova.1.p2l.info 51] [http://retin-a.1.p2l.info 52] [http://seasonale.1.p2l.info 53] [http://skelaxin.1.p2l.info 54] [http://skin-care.1.p2l.info 55] [http://sleep-aids.1.p2l.info 56] [http://soma.1.p2l.info 57] [http://sonata.1.p2l.info 58] [http://stop-smoking.1.p2l.info 59] [http://supplements.1.p2l.info 60] [http://tenuate.1.p2l.info 61] [http://tizanidine.1.p2l.info 62] [http://tramadol.1.p2l.info 63] [http://triphasil.1.p2l.info 64] [http://ultracet.1.p2l.info 65] [http://ultram.1.p2l.info 66] [http://valtrex.1.p2l.info 67] [http://vaniqa.1.p2l.info 68] [http://viagra.1.p2l.info 69] [http://viagra-soft-tabs.1.p2l.info 70] [http://vioxx.1.p2l.info 71] [http://vitalitymax.1.p2l.info 72] [http://weight-loss.1.p2l.info 73] [http://wellbutrin.1.p2l.info 74] [http://xenical.1.p2l.info 75] [http://yasmin.1.p2l.info 76] [http://zanaflex.1.p2l.info 77] [http://zebutal.1.p2l.info 78] [http://zoloft.1.p2l.info 79] [http://zyban.1.p2l.info 80] [http://zyrtec.1.p2l.info 81] [http://adipex.3.p2l.info 82] [http://ambien.3.p2l.info 83] [http://celexa.3.p2l.info 84] [http://cialis.3.p2l.info 85] [http://fioricet.3.p2l.info 86] [http://flonase.3.p2l.info 87] [http://herpes.3.p2l.info 88] [http://imitrex.3.p2l.info 89] [http://levitra.3.p2l.info 90] [http://lexapro.3.p2l.info 91] [http://meridia.3.p2l.info 92] [http://nordette.3.p2l.info 93] [http://phentermine.3.p2l.info 94] [http://skelaxin.3.p2l.info 95] [http://tramadol.3.p2l.info 96] [http://triphasil.3.p2l.info 97] [http://valtrex.3.p2l.info 98] [http://viagra.3.p2l.info 99] [http://weight-loss.3.p2l.info 100] [http://wellbutrin.3.p2l.info 101] [http://xenical.3.p2l.info 102] [http://yasmin.3.p2l.info 103] [http://zoloft.3.p2l.info 104] [http://zyrtec.3.p2l.info 105] [http://adipex.4.p2l.info 106] [http://ambien.4.p2l.info 107] [http://celexa.4.p2l.info 108] [http://cialis.4.p2l.info 109] [http://fioricet.4.p2l.info 110] [http://flonase.4.p2l.info 111] [http://herpes.4.p2l.info 112] [http://imitrex.4.p2l.info 113] [http://levitra.4.p2l.info 114] [http://lexapro.4.p2l.info 115] [http://meridia.4.p2l.info 116] [http://nordette.4.p2l.info 117] [http://phentermine.4.p2l.info 118] [http://skelaxin.4.p2l.info 119] [http://tramadol.4.p2l.info 120] [http://triphasil.4.p2l.info 121] [http://valtrex.4.p2l.info 122] [http://viagra.4.p2l.info 123] [http://weight-loss.4.p2l.info 124] [http://wellbutrin.4.p2l.info 125] [http://xenical.4.p2l.info 126] [http://yasmin.4.p2l.info 127] [http://zoloft.4.p2l.info 128] [http://zyrtec.4.p2l.info 129] [http://ab.5.p2l.info 130] [http://ak.5.p2l.info 131] [http://al.5.p2l.info 132] [http://ar.5.p2l.info 133] [http://as.5.p2l.info 134] [http://az.5.p2l.info 135] [http://bc.5.p2l.info 136] [http://ca.5.p2l.info 137] [http://co.5.p2l.info 138] [http://ct.5.p2l.info 139] [http://dc.5.p2l.info 140] [http://de.5.p2l.info 141] [http://def.5.p2l.info 142] [http://fl.5.p2l.info 143] [http://ga.5.p2l.info 144] [http://gu.5.p2l.info 145] [http://hi.5.p2l.info 146] [http://ia.5.p2l.info 147] [http://id.5.p2l.info 148] [http://il.5.p2l.info 149] [http://in.5.p2l.info 150] [http://ks.5.p2l.info 151] [http://ky.5.p2l.info 152] [http://la.5.p2l.info 153] [http://ma.5.p2l.info 154] [http://mb.5.p2l.info 155] [http://md.5.p2l.info 156] [http://me.5.p2l.info 157] [http://mi.5.p2l.info 158] [http://mn.5.p2l.info 159] [http://mo.5.p2l.info 160] [http://mp.5.p2l.info 161] [http://ms.5.p2l.info 162] [http://mt.5.p2l.info 163] [http://nb.5.p2l.info 164] [http://nc.5.p2l.info 165] [http://nd.5.p2l.info 166] [http://ne.5.p2l.info 167] [http://nf.5.p2l.info 168] [http://nh.5.p2l.info 169] [http://nj.5.p2l.info 170] [http://nm.5.p2l.info 171] [http://ns.5.p2l.info 172] [http://nv.5.p2l.info 173] [http://ny.5.p2l.info 174] [http://oh.5.p2l.info 175] [http://ok.5.p2l.info 176] [http://on.5.p2l.info 177] [http://or.5.p2l.info 178] [http://pa.5.p2l.info 179] [http://pe.5.p2l.info 180] [http://pr.5.p2l.info 181] [http://qc.5.p2l.info 182] [http://ri.5.p2l.info 183] [http://sc.5.p2l.info 184] [http://sd.5.p2l.info 185] [http://sk.5.p2l.info 186] [http://tn.5.p2l.info 187] [http://tx.5.p2l.info 188] [http://ut.5.p2l.info 189] [http://va.5.p2l.info 190] [http://vi.5.p2l.info 191] [http://vt.5.p2l.info 192] [http://wa.5.p2l.info 193] [http://wi.5.p2l.info 194] [http://wv.5.p2l.info 195] [http://wy.5.p2l.info 196] [http://yt.5.p2l.info 197] |
Revision as of 00:55, 23 May 2005
Contents
Proposer
Paul Brossier (Queen Mary) paul.brossier@elec.qmul.ac.uk
Pierre Leveau (Laboratoire d'Acoustique Musicale, GET-ENST (Télécom Paris)) leveau at lam dot jussieu dot fr
Title
Onset Detection Contest
Description
The aim of this contest is to compare state-of-the-art onset detection algorithms on music recordings. The methods will be evaluated on a large, various and reliably-annotated dataset, composed of sub-datasets grouping files of the same type.
1) Input data
Audio format:
The data are monophonic sound files, with the associated onset times and data about the annotation robustness.
- CD-quality (PCM, 16-bit, 44100 Hz)
- single channel(mono) or stereo
- file length between 8 and 15 seconds
- File names:
Audio content:
The dataset is subdivided into classes, because onset detection is sometimes performed in applications dedicated to a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...). The performance of each algorithm will be assessed on the whole dataset but also on each class separately.
The dataset contains 100 files from 5 classes annotated as follows:
- 30 solo drum excerpts cross-annotated by 3 people
- 30 solo monophonic pitched instruments excerpts cross-annotated by 3 people
- 10 solo polyphonic pitched instruments excerpts cross-annotated by 3 people
- 15 complex mixes cross-annotated by 5 people
- 15 complex mixes synthesized from MIDI
Nomenclature
<AudioFileName>.wav for the audio file
2) Output data
The onset detection algoritms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>.output.
Onset file Format
<onset time(in seconds)>\n
where \n denotes the end of line. The < and > characters are not included.
3) Syntax
Competitors will submit their algorithms with some sets of parameters (e.g. thresholds, analysis frame size,...), and the corresponding syntax of the algorithm. Each set of parameters will be tested in the evaluation. The "best" annotation will be used for the final ranking, the others results will be displayed on a GD, FP plane (see Evaluation section).
Potential Participants
- Tampere University of Technnology, Audio Research Group
Ansii Klapuri <klap@cs.tut.fi>
- MIT, MediaLab
Tristan Jehan <tristan@medialab.mit.edu>
- LAM, France
Pierre Leveau <leveau at lam dot jussieu dot fr> Laurent Daudet <daudet at lam dot jussieu dot fr>
- IRCAM, France
Xavier Rodet <rod@ircam.fr>, Axel Roebel <roebel@ircam.fr>, Geoffroy Peeters <peeters@ircam.fr>
- University of Pompeo Fabra, Multimedia Technology Group
Julien Ricard <jricard@iua.upf.es> Fabien Gouyon <fgouyon@iua.upf.es>
- Queen Mary College, Centre for Digital Music
Juan Pablo Bello <juan.bello@elec.qmul.ac.uk> Paul Brossier <paul.brossier@qmul.elec.ac.uk>
- Indian Institute of Science,Bangalore
Balaji Thoshkahna <balajitn@ee.iisc.ernet.in>
- Centre for Music and Science, Cambridge
Nick Collins <nc272 at cam dot ac dot uk>
Evaluation Procedures
The detected onset times will be compared with the ground-truth ones. For one onset time detected, if it belongs to a tolerance time-window around it, it is considered as a correct detection (CD). If not, it is a false positive (FP). Doubled onsets (two detections for one ground-truth onset) and merged onsets (one detection for two ground-truth onsets) will be taken into account in the evaluation.
We thus define the FP rate:
FP = 100. * (Ofp+Od) / Or
and the CD Rate: CD = 100. * (Or-Ofn-Om) / Or
with
Or: number of correctly detected onsets
Ofn: number of missed onsets
Om: number of merged onsets
Ot: number of ground-truth onsets
Ofp: number of false positive onsets
Od: number of double onsets
Because files are cross-annotated, the mean CD and FP rates are defined by averaging CD and FP rates computed for each annotation.
If an algorithm accept parameters (e.g. threshold for these based on detection functions), it will be tuned to a limited number of working points of the ROC Curve, e.g. one with a good correct detections rate, an other one with a weak false positives rate, and a third between the two (up to 15 parameterizations can be submitted). These tunings will be considered as different versions of a same algorithm, and will be done before the submission to the contest.
To establish a ranking (and indicate a winner...), we will compute the error rate (inspired from Alexander Lerch's work) :
q = (Ot - (Ofn + Ofp + Od + Om)) / (Or + (Ofn + Ofp + Od + Om))
This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.
Evaluation measures:
- percentage of correct detections / false positives (can also be expressed as precision/recall)
- time precision (tolerance from +/- 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.
- separate scoring for different instrument types (percussive, strings, winds)
More detailed data:
- percentage of doubled detections
- speed measurements of the algorithms
- scalability to large files
- robustness to noise, loudness
Relevant Test Collections
Audio data are recordings made by MTG at UPF Barcelona and excerpts from the RWC database. MIDI data are excerpts from the RWC database. Audio annotations were conducted by the Centre for Digital Music at QMU London (69% of annotations), Musical Acoustics Lab at Paris 6 University (18%), MTG at UPF Barcelona (11%) and Analysis Synthesis Group at IRCAM Paris (2%). MATLAB annotation software by Pierre Leveau (http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm ) was used for this purpose. Annotaters were provided with an approximate aim (catching all onsets corresponding to music notes, including pitched onsets and not only percussive ones), but no further supervision of annotation was performed.
The defined ground-truth can be critical for the evaluation. For the MIDI commanded instruments, care should be taken to synchronize the MIDI clock and the audio recording clock. For real world sounds, precise instructions on which events to annotate must be given to the annotators. Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation.
Article about annotation by Pierre Leveau et al.: http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf
Review 1
Besides being useful per se, onset detection is a pre-processing step for further music processing: rhythm analysis, beat tracking, instrument classification, and so on. It would be interesting that the proposal shortly discusses whether the evaluation metrics are unbiased wrt to the different potential applications.
In order to decide which algorithm is the winner a single number should be finally extracted. A possibility to do so is tuning the algorithms to a single working point on the ROC curve, e.g. say allow a difference between FP and FN of less than 1%. The evaluation should account for a statistical significance measure. I suppose McNemar's test could do the job.
It does not mention whether there will be training data available to participants. To my understanding, evaluation on the following three subcategories is enough: monophonic instrument, polyphonic solo instrument and complex mixes.
I cannot tell whether the suggested participants are willing to participate. Other potential candidates could be: Simon Dixon, Harvey Thornburg, Masataka Goto.
Review 2
Onset detection is a first step towards a number of very important DSP-oriented tasks that are relevant to the MIR community. However I wonder if it is too-low level to be of interest to the wider ISMIR bunch. I think the authors need to justify in clear terms the gains to the MIR community of carrying such an evaluation exercise.
The problem is well defined, however the author needs to take care when defining the task of onset detection for non-percussive events (e.g. bowed onset from a cello) or for non-musical events (e.g. breathing, key strokes that produce transient noise in the signal). Evaluations need to consider these cases.
The list of participants is good. I would add to the list Nick Collins and Stephen Hainsworth from Cambridge U., Chris Duxbury and Samer Abdallah from Queen Mary, and perhaps Chris Raphael from Indiana University.
The evaluation procedures are not clear to me. The current proposal is quite verbose, I will suggest that the author reduces the length of the proposal and makes it more assertive. There seems to be a few different possibilities for evaluation: measuring the precision/recall of the algorithms against a database of hand-labeled onsets (from different genres/instrumentations); measuring the temporal localization of detected onsets against a database of "precisely-labeled" onsets (perhaps from MIDI-generated sounds); measuring the computational complexity of the algorithms; measuring their scalability to large sound files; and measuring their robustness to signal distortion/noise. I think the first three evaluations are a must, and that the last two evaluations will depend on the organizers and the feedback from the contestants. For the first two evaluations, there needs to be a large set of ground truth data. The ground truth could be generated using the semi-automatic tool developed by Leveau et al. Each sound file needs to be cross-annotated by a set of different annotators (5?), such that the variability between the different annotations is used to define the "tolerance window" for each onset. Onsets with too-high variance in their annotation should be discarded for the evaluation (obviously also eliminating from the evaluation the false positives that they might produce). Onsets with very little variance can be used to evaluate the temporal precision of the algorithms. You should expect, for example, percussive onsets in low polyphonies to present low variance in the annotations, while non-percussive onsets in, say, pop music are more likely to present a high variance in their annotations. These observations on the annotated database, could be already of great interest to the community. Additionally, if the evaluated systems output some measure of the reliability of their detections, you should incorporate that into your evaluation procedures. I am not entirely sure how could you do that, so it is probably a matter for discussion within the community.
Regarding the test data, I cannot see why sounds should be monophonic and not polyphonic. Most music is polyphonic and for results to be of interest to the community the test data should contain real-life cases. I will also suggest keeping the use of MIDI sounds to the minimum possible. Separating results by type of onset (e.g. percussive, pop, etc) seems a logical choice, so I agree with the author on that the dataset should comprise music that covers the relevant categories. I personally prefer the classification of onsets according to the context on which they appear: onsets on pitched percussive music (e.g. piano and guitar music), onsets on pitched non-percussive music (e.g. string and brass music, voice or orchestral music), onsets on non-pitched percussive music (e.g. drums) and a combination of the above ("complex mixes", e.g. pop, rock and jazz music, presenting leading instruments such as voice and sax, combined with drums, pianos and bass in the background). I don't think a classification regarding monophonic and polyphonic instruments is that relevant.
Downie's Comments
1. Tend to agree that this is a rather low level and not very sexy task to evaluate in the MIR context. However, I have great respect for folks working in this area and will defer to the judgement of the community on the suitablility of this task as part of our evaluation framework.
2. Like many of these proposals, the dependence on annontations appears to be one of the biggests hurdles. If we cannot get the suitable annotations done in time, is there a doable sub-set of this that we might run as we prepare for future MIREXes? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197