2007:Audio Drum Detection

From MIREX Wiki


The material below is largely taken from the 2005 page.


Jouni Paulus (Tampere University of Technology): 2007 resurrection attempt [original proposer in 2005 was Koen Tanghe (Ghent University)]


  • Christian Dittmar (Fraunhofer) dmr<at>idmt<dot>fraunhofer<dot>de
  • Jouni Paulus (Tampere University of Technology) jouni<dot>paulus<at>tut<dot>fi
  • (potentially) Amaury Hazan (Pompeu Fabra University) ahazan<at>iua<dot>upf<dot>edu
  • (potentially) Alexandre Lacoste (Montr├⌐al)


The task consists of determining the positions (localization) and corresponding drum class names (labeling) of drum events in polyphonic music. This is very interesting rhythmic information for the popular music genres nowadays, can help in determining tempo and (sub)genre, and can also be queried for directly (typical rhythmic sequences/patterns).

1) Input data The only input for this task is a set of sound file excerpts adhering to the format and content requirements mentioned below. Audio format:

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • mono
  • 30 seconds excerpts (longer excerpts of whole pieces?)
  • files are named as "001.wav" to "999.wav" (or with another extension depending on the chosen format)

Audio content:

  • polyphonic music with drums (most)
  • polyphonic music without drums (only few)
  • different genres / playing styles
  • both live performances and sequenced music
  • different types of drum sets (acoustic, electronic, ...)
  • at least 50 files
  • participants receive a representative subset in advance


Distributed data

  • a representative random subset of the data will be made available to all participants in advance of the evaluation (20% of all available files, the organizers know how many they received on their ftp site)
  • this data can be used by the participants as they please
  • this data will not be used again during the evaluation

2) Output results The output of this task is, for each sound file, an ASCII text file containing 2 columns, where each line represents a drum event. The first column is the position (in seconds) of the drum event, and the second column is the label for the drum event at that position. Multiple drum events may occur at the same time, so there may be multiple lines having the same value in the first column. The file names of the output files are the same as the audio files, but the extension is ".txt" (so: "001.txt" for "001.wav").

Classes and labels that are considered:

  • BD (bass drum)
  • SD (snare drum)
  • HH (hihat, open, closed, pedal...)

Evaluation Procedures

  • F-measure (harmonic mean of the recall rate and the precision rate, beta parameter 1, so equal importance to prec. and recall) is calculated for each of three drum types (BD, SD, and HH), resulting in three F-measure scores and their average score
  • speed measure: the time it takes to do the complete run from the moment your algorithm starts until the moment it stops will be reported (relevance?)
  • parameter: the limit of onset-deviation errors in calculating the above F-measure is 30 ms (so a range of [-30 ms, +30 ms] around the true times)
  • condition: the actual drum sounds (sound samples) used in the input audio signal of each song are not known in advance
  • condition: participants who provided data and who need in-advance training or tuning, should only use the data made available to all participants by the organizers (If not, they should explictely state that they used their own data that was donated to the MIREX organizers so that this is known in public, and that they can be put in a separate category. They could also submit two versions: one trained with the public data only, and one trained as they had done before using all of their own data. The point is that this must be clear to everyone so that this is known for interpreting the evaluation results correctly.)


  • Should the systems be trained only with the provided subset or should it be allowed to utilise all data available? (I'd suggest allowing free use of any personal data.)
  • If free use is not allowed, should the methods using musicological modelling be trained also only on the provided training data? (Perhaps not. It is likely that there isn't enough data to train generic enough language models.)


From Andreas: I think we can largely rerun this contest as it was in 2005, provided we have atleast 3 participants. I would encourage a 'bring your own model' type approach, and simply evaluate off of what we already have. I think a 'train on whatever data you have approach' is good, and shouldn't necessarily be limited to the public distribution, as it may not be expansive enough. Looking into adding some of the ENST data might be a good way to add some more data to the evaluation. I assume Olivier and Gael would possibly want to get involved once again, as Olivier believed there was some problems in his submission in 2005 that significantly impacted his performance. So Jouni, try and email past participants and let's get a discussion going here, regarding the task, and the inclusion of the ENST database, etc.

Jouni: This bring-your-own-model approach would be quite nice, I think. Of course it can be said that it puts participants to unequal positions since not everyone has lots of own data, but I hope this is not a problem here. I've emailed past participants, but it still seems that it is not certan that the task will be run.