Difference between revisions of "2005:Audio Drum Det"
(→Relevant Test Collections) |
(→Review 1) |
||
Line 118: | Line 118: | ||
Recommendation: Refine proposal and accept | Recommendation: Refine proposal and accept | ||
+ | [Masataka 03/07/2005: I would also vote for the use of F-measure that is the harmonic mean of the recall rate and the precision rate. I think the above-mentioned 50ms threshold for onset-deviation errors is too large for drum sounds: how about using 25ms, for example? We found it sufficient and appropriate when evaluating our method in our ISMIR 2004 paper: http://staff.aist.go.jp/m.goto/PAPER/ISMIR2004yoshii.pdf] | ||
==Review 2== | ==Review 2== |
Revision as of 09:37, 7 March 2005
Contents
Proposer
Koen Tanghe (Ghent University) koen [dot] tanghe [at] ugent [dot] be
Title
Drum detection from polyphonic audio.
Description
The task consists of determining the positions (localization) and corresponding drum class names (labeling) of drum events in polyphonic music. This is very interesting rhythmic information for the popular music genres nowadays, can help in determining tempo and (sub)genre, and can also be queried for directly (typical rhythmic sequences/patterns).
1) Input data The only input for this task is a set of sound file excerpts adhering to the format and content requirements mentioned below.
Audio format:
- CD-quality (PCM, 16-bit, 44100 Hz)
- mono and stereo
- 30 seconds excerpts
- files are named as "001.wav" to "999.wav" (or with another extension depending on the chosen format)
Audio content:
- polyphonic music with drums (most)
- polyphonic music without drums (some)
- different genres / playing styles
- both live performances and sequenced music
- different types of drum sets (acoustic, electronic, ...)
- at least 50 files
- participants receive at least 10 files in advance
[Perfe 02/25/05: I would vote for mono and more than 50 files; the 10 files to be given to participants should be randomly drawn from the total pool of N available annotated files unless there is some bias in the collection that is related to genre or other important class; I mean, if there are 30% of electronic percussion files, then 3 out of 10 files should contain that. In that case, a stratified sampling should be used]
[Masataka 03/07/2005: I agree with the above Perfe's comments. In addition, our team prefers to use the whole songs (not excerpts). I would like to make sure that the input audio signals contain sounds of various musical instruments (some of them include vocals, too), and that the actual drum sounds (sound samples) included in the input mixture are not known in advance because we have to deal with those situations in practical applications.]
2) Output results The output of this task is, for each sound file, an ASCII text file containing 2 columns, where each line represents a drum event. The first column is the position (in seconds) of the drum event, and the second column is the label for the drum event at that position. Multiple drum events may occur at the same time, so there may be multiple lines having the same value in the first column. The file names of the output files are the same as the audio files, but the extension is ".txt" (so: "001.txt" for "001.wav").
Classes and labels that are considered:
- BD (bass drum)
- SD (snare drum)
- HH (hihat)
- CY (cymbal)
- TM (tom)
[Perfe 02/25/05: What about adding an "other" class? How are we going to manage the combination of sounds?]
[Masataka 03/07/2005: How about adding the option of evaluating only BD, SD, and HH?]
Potential Participants
- Vegard Sandvold (Notam), Fabien Gouyon (MTG, University of Pompeu Fabra), Perfecto Herrera (UPF)
vegardsa[at]student[dot]matnat[dot]uio[dot]no, fabien[dot]gouyon[at]iua[dot]upf[dot]es, perfe[at]iua[dot]upf[dot]es, likely
- Koen Tanghe (IPEM, Ghent University)
Koen[dot]Tanghe[at]UGent[dot]be, highly likely
- Christian Uhle (Fraunhofer)
uhle[at]idmt[dot]fraunhofer[dot]de, ???
- Anssi Klapuri (Tampere University of Technology)
klap[at]cs[dot]tut[dot]fi, not participating (Jouni Paulus represents our group)
- Kazuyoshi Yoshii (Kyoto University), Masataka Goto (AIST), Hiroshi G. Okuno (Kyoto University)
yoshii[at]kuis[dot]kyoto-u[dot]ac[dot]jp, m.goto[at]aist[dot]go[dot]jp, okuno[at]i[dot]kyoto-u[dot]ac[dot]jp, highly likely
- Derry FitzGerald (Cork Institute of Technology)
derry[dot]fitzgerald[at]cit[dot]ie, likely
- Gaëll Richard (Telecom Paris)
gael[dot]richard[at]enst[dot]fr, very likely
- Jouni Paulus (Tampere University of Techonology)
jouni[dot]paulus[at]tut[dot]fi, moderately likely
- George Tzanetakis (University of Victoria, Canada)
gtzan[at]cs[dot]uvic[dot]ca, moderately likely
Evaluation Procedures
Comparison rules: Questions to be answered:
- when do we consider a detected event as "correct"?
- when do we consider a detected event as "false"?
- when do we consider a ground truth event as "missed"?
- what's the maximum difference in time between real drum event position and detected drum event position that can be allowed?
- is detecting an event at a valid ground truth position but classifying it incorrectly as bad as not detecting the event at all?
Evaluation measures: which performance measure are we going to use? precision, recall, accuracy, F-measure, ...?
Drum detection may have several goals and thus the evaluation should reflect the algorithms relatively to the initial goal or application. In our case, I believe that the interest is "obtaining metadata that describe the drum track of a file". Then in this context, the ideal would be a kind of perceptual distance in the metadata domain but is there such distances ? is it possible to define one without conducting lengthly perceptual experiments ? One possibility would be to use a distance similar to the one we have used for our drum loop query system (to be soon published in the special issue of JIIS). The basic idea is to compute an "edition distance" between the obtained metadata and ground truth metadata strings. The edition distance computes deletion, insertion and confusion but also takes into account desyncrhonisation between events and allow to associate coefficients for confusions (for example it is often less dramatic to miss a charley hit than a bass drum hit....).
Relevant Test Collections
Ground truth annotations:
For each sound file to be analyzed, there is a corresponding annotation file using the same format as described in "3. Output". The ground truth files are obtained by manual annotation by people who have experience with drum sounds (drummers?). RWC and Magnatune are potential excellent sources. For annotation, it would be important to include annotation cross-check (several drummers 3 would be an ideal minimum annotate the same files). This would be quite similar to the methodology that we have followed for Onset detection evaluation (see P. Leveau paper at last ISMIR). This would permit to have an excellent ground truth annotation, would also permit to evaluate which kind of confusions are never done and which ones are often done, what is the acceptable maximum difference in time between real and detected drum events,etc... However, as always, this requires more efforts and time. Another option (for sequenced music) is to use *audio recordings* of MIDI sequences, and use the drum tracks of the MIDI files to obtain the ground truth annotations.
[Perfe 02/25/05: In case of using MIDI files, I'd suggest to add some "human touch" midi post-processing, plus some audio production basic tricks such as compression and reverb, in order to make the audio as much close as possible to the complexity of real recordings; I would not use more than a 30% of midi files, if needed]
[Masataka 03/07/2005: For annotation, I've been working on labeling all the onset times of BD, SD, and HH on more than 50 songs in the RWC Music Database (RWC-MDB-P-2001). I have a plan to put them on http://staff.aist.go.jp/m.goto/RWC-MDB/ so that they can be available for RWC-MDB users.]
Review 1
Problem is both clearly defined and interesting in terms of current research.
Audio format and content are fine, however, it would be nice to include more than 50 files, although this would probably make the transcription task too difficult/time consuming. Either Mono or Stereo recordings should be chosen, I suggest polling participants to see if anyone intends to use stereo information or whether all participants will down-mix to mono. There is no mention of transcribed datasets so this will have to be done from scratch and therefore the proposed use of the RWC or Magnatune databases is a good idea. I am unsure whether the use of synthesized midi files is valid unless they are produced using samples rather than synthesized drum sounds and even then you would need to use several different samples of each sound to ensure enough variance for a proper evaluation. I agree that ground truth annotations should be produced by 2-3 non-participating transcribers.
The output result format is fine, however, there maybe more classes of drum/percussive sounds that should be considered, such as maracas or a tambourine. Obviously this will depend on the content of audio files used and could form an abstract grouping if there are insufficient training examples for separate groups.
Evaluation procedures contains more questions than answers. Obviously this task is quite dependent on the onset detection/segmentation. Paul Brossier proposed that for the onset detection evaluation events detected within 50ms of the transcribed position should be considered correct. I assume this holds for the drum detection proposal. I think it would be interesting and not too taxing to have two tracks one supplying the ground-truth segmentation, requiring only the classification of detected events and another performing the whole task.
Will submissions be run once or cross-validated? As there is going to be a very small dataset a high number of folds should be used, although this should be limited so that every fold contains at least one example of each class.
F-measure (mean and variance for cross validated results) would seem to be the most applicable evaluation metric if the whole task is performed. Will precision and recall be given equal weighting in the F-measure? See Speech & Language Processing, Jurafsky and Martin, 2000, p.578, the generalization of F-measure - F = (b2+1)PR/(b^2P + R) When b=1, P and R have equal weight. b>1 gives more weight to P, b<1 to R. A simple accuracy result would be fine if segmentation is supplied. Statistical significance of differences between algorithms should be estimated and it would be interesting to see statistical significance of differences between using ground-truth segmentation and the detected segmentation, thereby allowing us to assess whether the segmentation or event classification were at fault.
Finally, given the list of potential participants and their publications, I think we can be confident of sufficient participation to run the evaluation.
Recommendation: Refine proposal and accept
[Masataka 03/07/2005: I would also vote for the use of F-measure that is the harmonic mean of the recall rate and the precision rate. I think the above-mentioned 50ms threshold for onset-deviation errors is too large for drum sounds: how about using 25ms, for example? We found it sufficient and appropriate when evaluating our method in our ISMIR 2004 paper: http://staff.aist.go.jp/m.goto/PAPER/ISMIR2004yoshii.pdf]
Review 2
The problem is well described and its applications are of great concern to the MIR community. However the evaluation procedures and test data contain more questions than answers. The proposal should be much more affirmative. Precise evaluation metrics need to be defined (so that every participant can implement them in a reproducible way), and the choice of the test data has to be discussed (is it relevant to test algorithms on MIDI data only if different synthesizers are used, or is it necessary to use audio data ?). This proposal is not mature enough now and the participants should provide some effort to improve it.
Another issue is that the problem is not MIR in itself, but rather mid-level sound description. If the main applications are tempo induction and subgenre classification, why not evaluate the performance for these applications directly ? This would be more relevant for MIR and annotation would be far less time-consuming. I think this issue has to be seriously considered by the participants in case they do not own already a sufficient amount of annotated data.
[Perfe 02/25/02: I do not agree that it is not MIR. It is MIR and it is high-level description. The main application is knowing if the song has drums, if there are lots of drums or only some spare hits, if there are lots of cymbals or not (hence, some genres could be discarded). The direct application, on the other hand, is still a bit far, as there are some perceptual issues involved, and perceptual issues require some time to be sorted out]
Downie's Comments
1. Am intrigued by the idea that MIDI or some other symbolic representation could be used to bring together the generation and ground truth tasks. Where does quantization fit into this (i.e., it is hard to "swing" midi files)?
2. If MIDI files used for generation/ground truth, would it be necessary to introduce background music to make the task more difficult? I suppose the MIDI file could generate the background music also.....wonder if there are some other tricks we might be missing.