2005:Audio Drum Det
Koen Tanghe (Ghent University) firstname.lastname@example.org
Drum detection from polyphonic audio.
The task consists of determining the positions (localization) and corresponding drum class names (labeling) of drum events in polyphonic music. This is very interesting rhythmic information for the popular music genres nowadays, can help in determining tempo and (sub)genre, and can also be queried for directly (typical rhythmic sequences/patterns).
1) Input data The only input for this task is a set of sound file excerpts adhering to the format and content requirements mentioned below.
- CD-quality (PCM, 16-bit, 44100 Hz)
- mono and stereo
- 30 seconds excerpts
- files are named as "001.wav" to "999.wav" (or with another extension depending on the chosen format)
- polyphonic music with drums (most)
- polyphonic music without drums (some)
- different genres / playing styles
- both live performances and sequenced music
- different types of drum sets (acoustic, electronic, ...)
- at least 50 files
- participants receive at least 10 files in advance
2) Output results The output of this task is, for each sound file, an ASCII text file containing 2 columns, where each line represents a drum event. The first column is the position (in seconds) of the drum event, and the second column is the label for the drum event at that position. Multiple drum events may occur at the same time, so there may be multiple lines having the same value in the first column. The file names of the output files are the same as the audio files, but the extension is ".txt" (so: "001.txt" for "001.wav").
Classes and labels that are considered:
- BD (bass drum)
- SD (snare drum)
- HH (hihat)
- CY (cymbal)
- TM (tom)
- Fabien Gouyon (MTG, University of Pompeu Fabra)
- Koen Tanghe (IPEM, Ghent University)
- Christian Uhle (Fraunhofer)
- Anssi Klapuri (Tampere University of Technology)
- Kazuyoshi Yoshii (Kyoto University)
- Derry FitzGerald (Dublin Institute of Technology)
- Ga├½ll Richard (Telecom Paris)
Comparison rules: Questions to be answered:
- when do we consider a detected event as "correct"?
- when do we consider a detected event as "false"?
- when do we consider a ground truth event as "missed"?
- what's the maximum difference in time between real drum event position and detected drum event position that can be allowed?
- is detecting an event at a valid ground truth position but classifying it incorrectly as bad as not detecting the event at all?
Evaluation measures: which performance measure are we going to use? precision, recall, accuracy, F-measure, ...?
Drum detection may have several goals and thus the evaluation should reflect the algorithms relatively to the initial goal or application. In our case, I believe that the interest is "obtaining metadata that describe the drum track of a file". Then in this context, the ideal would be a kind of perceptual distance in the metadata domain but is there such distances ? is it possible to define one without conducting lengthly perceptual experiments ? One possibility would be to use a distance similar to the one we have used for our drum loop query system (to be soon published in the special issue of JIIS). The basic idea is to compute an "edition distance" between the obtained metadata and ground truth metadata strings. The edition distance computes deletion, insertion and confusion but also takes into account desyncrhonisation between events and allow to associate coefficients for confusions (for example it is often less dramatic to miss a charley hit than a bass drum hit....).
Relevant Test Collections
Ground truth annotations:
For each sound file to be analyzed, there is a corresponding annotation file using the same format as described in "3. Output". The ground truth files are obtained by manual annotation by people who have experience with drum sounds (drummers?). RWC and Magnatune are potential excellent sources. For annotation, it would be important to include annotation cross-check (several drummers 3 would be an ideal minimum annotate the same files). This would be quite similar to the methodology that we have followed for Onset detection evaluation (see P. Leveau paper at last ISMIR). This would permit to have an excellent ground truth annotation, would also permit to evaluate which kind of confusions are never done and which ones are often done, what is the acceptable maximum difference in time between real and detected drum events,etc... However, as always, this requires more efforts and time. Another option (for sequenced music) is to use *audio recordings* of MIDI sequences, and use the drum tracks of the MIDI files to obtain the ground truth annotations.
Problem is both clearly defined and interesting in terms of current research.
Audio format and content are fine, however, it would be nice to include more than 50 files, although this would probably make the transcription task too difficult/time consuming. Either Mono or Stereo recordings should be chosen, I suggest polling participants to see if anyone intends to use stereo information or whether all participants will down-mix to mono. There is no mention of transcribed datasets so this will have to be done from scratch and therefore the proposed use of the RWC or Magnatune databases is a good idea. I am unsure whether the use of synthesized midi files is valid unless they are produced using samples rather than synthesized drum sounds and even then you would need to use several different samples of each sound to ensure enough variance for a proper evaluation. I agree that ground truth annotations should be produced by 2-3 non-participating transcribers.
The output result format is fine, however, there maybe more classes of drum/percussive sounds that should be considered, such as maracas or a tambourine. Obviously this will depend on the content of audio files used and could form an abstract grouping if there are insufficient training examples for separate groups.
Evaluation procedures contains more questions than answers. Obviously this task is quite dependent on the onset detection/segmentation. Paul Brossier proposed that for the onset detection evaluation events detected within 50ms of the transcribed position should be considered correct. I assume this holds for the drum detection proposal. I think it would be interesting and not too taxing to have two tracks one supplying the ground-truth segmentation, requiring only the classification of detected events and another performing the whole task.
Will submissions be run once or cross-validated? As there is going to be a very small dataset a high number of folds should be used, although this should be limited so that every fold contains at least one example of each class.
F-measure (mean and variance for cross validated results) would seem to be the most applicable evaluation metric if the whole task is performed. Will precision and recall be given equal weighting in the F-measure? See Speech & Language Processing, Jurafsky and Martin, 2000, p.578, the generalization of F-measure - F = (b2+1)PR/(b^2P + R) When b=1, P and R have equal weight. b>1 gives more weight to P, b<1 to R. A simple accuracy result would be fine if segmentation is supplied. Statistical significance of differences between algorithms should be estimated and it would be interesting to see statistical significance of differences between using ground-truth segmentation and the detected segmentation, thereby allowing us to assess whether the segmentation or event classification were at fault.
Finally, given the list of potential participants and their publications, I think we can be confident of sufficient participation to run the evaluation.
Recommendation: Refine proposal and accept
The problem is well described and its applications are of great concern to the MIR community. However the evaluation procedures and test data contain more questions than answers. The proposal should be much more affirmative. Precise evaluation metrics need to be defined (so that every participant can implement them in a reproducible way), and the choice of the test data has to be discussed (is it relevant to test algorithms on MIDI data only if different synthesizers are used, or is it necessary to use audio data ?). This proposal is not mature enough now and the participants should provide some effort to improve it.
Another issue is that the problem is not MIR in itself, but rather mid-level sound description. If the main applications are tempo induction and subgenre classification, why not evaluate the performance for these applications directly ? This would be more relevant for MIR and annotation would be far less time-consuming. I think this issue has to be seriously considered by the participants in case they do not own already a sufficient amount of annotated data.