2005:Audio Drum Det
IMPORTANT: see below for a copy of the email with the final contest setup decisions
Koen Tanghe (Ghent University) koen [dot] tanghe [at] ugent [dot] be
Drum detection from polyphonic audio.
The task consists of determining the positions (localization) and corresponding drum class names (labeling) of drum events in polyphonic music. This is very interesting rhythmic information for the popular music genres nowadays, can help in determining tempo and (sub)genre, and can also be queried for directly (typical rhythmic sequences/patterns).
1) Input data The only input for this task is a set of sound file excerpts adhering to the format and content requirements mentioned below.
- CD-quality (PCM, 16-bit, 44100 Hz)
- mono and stereo
- 30 seconds excerpts
- files are named as "001.wav" to "999.wav" (or with another extension depending on the chosen format)
- polyphonic music with drums (most)
- polyphonic music without drums (some)
- different genres / playing styles
- both live performances and sequenced music
- different types of drum sets (acoustic, electronic, ...)
- at least 50 files
- participants receive at least 10 files in advance
[Perfe 02/25/05: I would vote for mono and more than 50 files; the 10 files to be given to participants should be randomly drawn from the total pool of N available annotated files unless there is some bias in the collection that is related to genre or other important class; I mean, if there are 30% of electronic percussion files, then 3 out of 10 files should contain that. In that case, a stratified sampling should be used]
[Masataka 03/07/2005: I agree with the above Perfe's comments. In addition, our team prefers to use the whole songs (not excerpts). I would like to make sure that the input audio signals contain sounds of various musical instruments (some of them include vocals, too), and that the actual drum sounds (sound samples) included in the input mixture are not known in advance because we have to deal with those situations in practical applications.]
2) Output results The output of this task is, for each sound file, an ASCII text file containing 2 columns, where each line represents a drum event. The first column is the position (in seconds) of the drum event, and the second column is the label for the drum event at that position. Multiple drum events may occur at the same time, so there may be multiple lines having the same value in the first column. The file names of the output files are the same as the audio files, but the extension is ".txt" (so: "001.txt" for "001.wav").
Classes and labels that are considered:
- BD (bass drum)
- SD (snare drum)
- HH (hihat)
- CY (cymbal)
- TM (tom)
[Perfe 02/25/05: What about adding an "other" class? How are we going to manage the combination of sounds?]
[Masataka 03/07/2005: How about adding the option of evaluating only BD, SD, and HH?]
- Balaji Thoshkahna (Indian Institute of Science,Bangalore), balajitn[at]ee.iisc.ernet.in [BD,SD,HH]
- Olivier Gillet and Ga├½l Richard (ENST), email@example.com, firstname.lastname@example.org
- George Tzanetakis (University of Victoria), email@example.com
- Christian Dittmar (Fraunhofer), firstname.lastname@example.org
- Jouni Paulus (Tampere University of Technology), email@example.com
- Kazuyoshi Yoshii (Kyoto University), Masataka Goto (AIST), Hiroshi G. Okuno (Kyoto University), firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
- Koen Tanghe (IPEM, Ghent University), Sven Degroeve (KERMIT, Ghent University), koen [dot] tanghe [at] ugent [dot] be, sven [dot] degroeve [at] ugent [dot] be, Bernard De Baets (KERMIT, Ghent University), bernard [dot] debaets [at] ugent [dot] be
Other Potential Participants
- Vegard Sandvold (Notam), Fabien Gouyon (MTG, University of Pompeu Fabra), Perfecto Herrera (UPF)
vegardsa[at]student[dot]matnat[dot]uio[dot]no, fabien[dot]gouyon[at]iua[dot]upf[dot]es, perfe[at]iua[dot]upf[dot]es, likely
- Christian Uhle (Fraunhofer)
- Derry FitzGerald (Cork Institute of Technology)
Comparison rules: Questions to be answered:
- when do we consider a detected event as "correct"?
- when do we consider a detected event as "false"?
- when do we consider a ground truth event as "missed"?
- what's the maximum difference in time between real drum event position and detected drum event position that can be allowed?
- is detecting an event at a valid ground truth position but classifying it incorrectly as bad as not detecting the event at all?
Evaluation measures: which performance measure are we going to use? precision, recall, accuracy, F-measure, ...?
Drum detection may have several goals and thus the evaluation should reflect the algorithms relatively to the initial goal or application. In our case, I believe that the interest is "obtaining metadata that describe the drum track of a file". Then in this context, the ideal would be a kind of perceptual distance in the metadata domain but is there such distances ? is it possible to define one without conducting lengthly perceptual experiments ? One possibility would be to use a distance similar to the one we have used for our drum loop query system (to be soon published in the special issue of JIIS). The basic idea is to compute an "edition distance" between the obtained metadata and ground truth metadata strings. The edition distance computes deletion, insertion and confusion but also takes into account desyncrhonisation between events and allow to associate coefficients for confusions (for example it is often less dramatic to miss a charley hit than a bass drum hit....).
Relevant Test Collections
Ground truth annotations:
For each sound file to be analyzed, there is a corresponding annotation file using the same format as described in "3. Output". The ground truth files are obtained by manual annotation by people who have experience with drum sounds (drummers?). RWC and Magnatune are potential excellent sources. For annotation, it would be important to include annotation cross-check (several drummers 3 would be an ideal minimum annotate the same files). This would be quite similar to the methodology that we have followed for Onset detection evaluation (see P. Leveau paper at last ISMIR). This would permit to have an excellent ground truth annotation, would also permit to evaluate which kind of confusions are never done and which ones are often done, what is the acceptable maximum difference in time between real and detected drum events,etc... However, as always, this requires more efforts and time. Another option (for sequenced music) is to use *audio recordings* of MIDI sequences, and use the drum tracks of the MIDI files to obtain the ground truth annotations.
[Perfe 02/25/05: In case of using MIDI files, I'd suggest to add some "human touch" midi post-processing, plus some audio production basic tricks such as compression and reverb, in order to make the audio as much close as possible to the complexity of real recordings; I would not use more than a 30% of midi files, if needed]
[Masataka 03/07/2005: For annotation, I've been working on labeling all the onset times of BD, SD, and HH on more than 50 songs in the RWC Music Database (RWC-MDB-P-2001). I have a plan to put them on http://staff.aist.go.jp/m.goto/RWC-MDB/ so that they can be available for RWC-MDB users.]
[Christian Dittmar 04/06/2005: I just enlisted for participation and I wanted to let you know of our small annotated database. It comprises 44 audio audio snippets of approximately 30 seconds duration. They are 44.1kHz/16Bit/Mono, unfortunately not copyright free. The annotation was done by 3 different listeners, all experienced with drumsounds and musical rules. 17 Different instrument classes are featured, inclusive Kick, Snare, Tom, Hihat and Cymbal (though not given in every sample).]
Problem is both clearly defined and interesting in terms of current research.
Audio format and content are fine, however, it would be nice to include more than 50 files, although this would probably make the transcription task too difficult/time consuming. Either Mono or Stereo recordings should be chosen, I suggest polling participants to see if anyone intends to use stereo information or whether all participants will down-mix to mono. There is no mention of transcribed datasets so this will have to be done from scratch and therefore the proposed use of the RWC or Magnatune databases is a good idea. I am unsure whether the use of synthesized midi files is valid unless they are produced using samples rather than synthesized drum sounds and even then you would need to use several different samples of each sound to ensure enough variance for a proper evaluation. I agree that ground truth annotations should be produced by 2-3 non-participating transcribers.
The output result format is fine, however, there maybe more classes of drum/percussive sounds that should be considered, such as maracas or a tambourine. Obviously this will depend on the content of audio files used and could form an abstract grouping if there are insufficient training examples for separate groups.
Evaluation procedures contains more questions than answers. Obviously this task is quite dependent on the onset detection/segmentation. Paul Brossier proposed that for the onset detection evaluation events detected within 50ms of the transcribed position should be considered correct. I assume this holds for the drum detection proposal. I think it would be interesting and not too taxing to have two tracks one supplying the ground-truth segmentation, requiring only the classification of detected events and another performing the whole task.
Will submissions be run once or cross-validated? As there is going to be a very small dataset a high number of folds should be used, although this should be limited so that every fold contains at least one example of each class.
F-measure (mean and variance for cross validated results) would seem to be the most applicable evaluation metric if the whole task is performed. Will precision and recall be given equal weighting in the F-measure? See Speech & Language Processing, Jurafsky and Martin, 2000, p.578, the generalization of F-measure - F = (b2+1)PR/(b^2P + R) When b=1, P and R have equal weight. b>1 gives more weight to P, b<1 to R. A simple accuracy result would be fine if segmentation is supplied. Statistical significance of differences between algorithms should be estimated and it would be interesting to see statistical significance of differences between using ground-truth segmentation and the detected segmentation, thereby allowing us to assess whether the segmentation or event classification were at fault.
Finally, given the list of potential participants and their publications, I think we can be confident of sufficient participation to run the evaluation.
Recommendation: Refine proposal and accept
[Masataka 03/07/2005: I would also vote for the use of F-measure that is the harmonic mean of the recall rate and the precision rate. I think the above-mentioned 50ms threshold for onset-deviation errors is too large for drum sounds: how about using 25ms, for example? We found it sufficient and appropriate when evaluating our method in our ISMIR 2004 paper: http://staff.aist.go.jp/m.goto/PAPER/ISMIR2004yoshii.pdf]
The problem is well described and its applications are of great concern to the MIR community. However the evaluation procedures and test data contain more questions than answers. The proposal should be much more affirmative. Precise evaluation metrics need to be defined (so that every participant can implement them in a reproducible way), and the choice of the test data has to be discussed (is it relevant to test algorithms on MIDI data only if different synthesizers are used, or is it necessary to use audio data ?). This proposal is not mature enough now and the participants should provide some effort to improve it.
Another issue is that the problem is not MIR in itself, but rather mid-level sound description. If the main applications are tempo induction and subgenre classification, why not evaluate the performance for these applications directly ? This would be more relevant for MIR and annotation would be far less time-consuming. I think this issue has to be seriously considered by the participants in case they do not own already a sufficient amount of annotated data.
[Perfe 02/25/02: I do not agree that it is not MIR. It is MIR and it is high-level description. The main application is knowing if the song has drums, if there are lots of drums or only some spare hits, if there are lots of cymbals or not (hence, some genres could be discarded). The direct application, on the other hand, is still a bit far, as there are some perceptual issues involved, and perceptual issues require some time to be sorted out]
1. Am intrigued by the idea that MIDI or some other symbolic representation could be used to bring together the generation and ground truth tasks. Where does quantization fit into this (i.e., it is hard to "swing" midi files)?
2. If MIDI files used for generation/ground truth, would it be necessary to introduce background music to make the task more difficult? I suppose the MIDI file could generate the background music also.....wonder if there are some other tricks we might be missing.
Open issues that need to be finalized
1.1 Input data
1.1.1 number of audio channels (mono/stereo)?
1.1.2 audio fragment length?
1.1.3 number of files?
1.2 Output results
1.2.1 which drum classes do we consider?
1.2.2 do we use an "other" class?
2.1 status for Christian Uhle ?
2.2 everyone: mail Emmanuel Vincent before June 12
3 Evaluation procedures
3.1 we still need a concrete formal procedure that is sufficiently worked-out so that the organizers can use it
3.2 onset-deviation errors: what's the limit?
3.3 what data do the particpants receive in advance?
3.4 we should also have an efficieny/speed measure, because that is an important evaluation criterium too
4 Relevant test collections
4.1 what data will be available (with certainty)?
4.2 do we add audio from MIDI files or not, and if so, how many?
4.3 do we use cross-annotations, and if so: how do we handle that?
Final decisions as sent out by email to all participants on 20050705
Alright then, these are the final decisions:
1. Drum types and label names
- 3 drum types only:
BD bass drum SD snare drum HH hihat (any hihat like open, half-open, closed, ...)
- given the fact that the MG data set did not put ride cymbals into the HH group and to make sure the data is consistent, drum types are strictly these types only (so: no ride cymbals in the HH, no toms in the BD, no claps nor side sticks/rim shots in the SD, etc...)
- this involves the following remapping from 18 labels in the KT set to these 3 base labels:
18 3 -- -- BD BD SD SD OH HH CH HH RC CC LT MT HT CP RS SC SH TB WB LC HC CB
- a similar remapping for the CD set only keeps the true BD, SD and HH events
- all annotations are remapped to these 3 labels in advance (no looking back to the broader labels afterwards)
2. Sound files
- all audio is 44100 Hz, 16-bit mono, WAV PCM
- all available sound files will be used in their entirety (so 30 s where only 30 s are avialble, complete sound files where complete sound files are available)
- most sound files are "real" polyphonic music with drums
- some sound files may be audio recordings of MIDI files
- a few sound files may not contain any drums
3. Distributed data
- a representative random subset of the data will be made available to all participants in advance of the evaluation (20% of all available files, the organizers know how many they received on their ftp site)
- this data can be used by the participants as they please
- this data will not be used again during the evaluation
- the MIREX organizers will send a message with all the necessary info as soon as this data is available
- participants only send in the application part of their algorithm, not the training part (if there is one)
- algorithms must adhere to the specifications on the MIREX web page
- the MIREX organizers make sure the latest M2K release contains the final itineraries for the drum detection task as soon as possible and notify the participants when it is available
- F-measure (harmonic mean of the recall rate and the precision rate, beta parameter 1, so equal importance to prec. and recall) is calculated for each of three drum types (BD, SD, and HH), resulting in three F-measure scores and their average score
- speed measure: the time it takes to do the complete run from the moment your algorithm starts until the moment it stops will be reported
- parameter: the limit of onset-deviation errors in calculating the above F-measure is 30 ms (so a range of [-30 ms, +30 ms] around the true times)
- condition: the actual drum sounds (sound samples) used in the input audio signal of each song are not known in advance
- condition: participants who provided data and who need in-advance training or tuning, should only use the data made available to all participants by the organizers (If not, they should explictely state that they used their own data that was donated to the MIREX organizers so that this is known in public, and that they can be put in a separate category. They could also submit two versions: one trained with the public data only, and one trained as they had done before using all of their own data. The point is that this must be clear to everyone so that this is known for interpreting the evaluation results correctly.)
This is the final take-it-or-leave decision people have been asking for. Let's continue all further discussions next year (keep a log of the issues you have/had with this year's version!).
(might sound strange coming from my mouth, but still:)
Thanks to all the ones who provided data and thus make this contest possible in the first place.
Thanks also to everyone for the valuable discussions!
All the best and good luck!
PS Just to be clear: as Masataka already said, I have been contacted by the organizers to take up the "man-in-the-middle" position (together with Masataka), so I'm not trying to steer anything in my own personal direction here...