2005:Audio Genre

From MIREX Wiki
Revision as of 15:12, 31 January 2005 by Admin (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk


Genre Classification from polyphonic audio.


The automatic classification of polyphonic musical audio (in PCM format) into a single high-level genre per example. If there is sufficient demand, a multiple genre track could be defined, requiring submissions to identify each genre (without prior knowledge of the number of labels), with the precision and recall scores calculated for each result.

1) Input data The input for this task is a set of sound file excerpts adhering to the format, metadata and content requirements mentioned below.

Audio format:

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono)
  • Either whole files or 1 minute excerpts

Audio content:

  • polyphonic music
  • data set should include at least 8 different genres (Suggestions include: Pop, Jazz/Blues, Rock, Heavy Metal, Reggae, Ballroom Dance, Electronic/Modern Dance, Classical, Folk - to Exclude "World" music as this is a common "catch-all" for ethnic/folk music that is not easily classified into another group and can contain such diverse music as Indian tabla and Celtish rock)
  • the classification could also be evaluated in two levels. For example, a rough level I: Rock/Pop vs. Classical vs. Jazz/Blues and a detailed level II: Rock, Pop (within Pop/Rock), Chamber music, orchestral music (within Classical), Jazz, Blues (within Jazz/Blues).
  • both live performances and sequenced music are eligible
  • Each class should be represented by a minimum of 100 examples, but 150 would be preferred. If possible the same number of examples should represent each class.
  • If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing correct execution of algorithm submissions.


  • By definition each example must have a genre label corresponding to one of the output classes.
  • Where possible existing genre labels should be confirmed by two or more non-entrants, due to IP contsraints it is unlikely that we will be allowed to distribute any database for meta data validation by participants.
  • The training set should be defined by a text file with one entry per line, in the following format:

<example path and filename>\t<genre label>\n

2) Output results Results should be output into a text file with one entry per line in the following format:

<example path and filename>\t<genre classification>\n

Potential Participants

  • Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, High
  • Elias Pampalk (├ûFAI), elias@oefai.at, High
  • George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, High
  • Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High
  • Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, High
  • Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium
  • Fran├ºois Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium

Evaluation Procedures

3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.

Evaluation measures:

  • Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).
  • Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.

Evaluation framework:

Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early Febuary for submission development.

Relevant Test Collections

Re-use Magnatune database (???) Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons) Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments) Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites

Ground truth annotations:

All annotations should be validated, rather than accepting supplied genre labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be exended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.

Review 2

The single genre problem is well defined and seems to be a relevant problem for the MIR community nowadays. Obviously, it would be more relevant to classify each track into multiple genres or to use a hierarchy of genres, but the proposal does not deal with these issues in a satisfying way. If a track belongs to several genres, are these genres equally weighted or not ? Are they determined by asking several people to classify each track into one genre, or by asking each one to classify each track into several genres ? If there are nodes for Electronic and Jazz/Blues, where lies the leaf Electro-jazz ? I suggest that the contest concentrates on the well-defined simple genre problem. An interesting development of it would be to ask algorithms to associate a percentage of probability to each predefined genre on each track, instead of outputing a single genre with 100% probability. Regarding the input format, I think that whole files are better (the total duration and the volume variation are already good genre descriptors) and that polyphony is not required (classical music contains many works for solo instruments).

I have no precise opinion regarding the defined genres, since this is more of a cultural importance. I'm not sure that Rock is less diverse than World (what's the common point between Elvis and Radiohead ?). Also I am surprised that there is no Rap/RnB. The choice of the genre classes is a crucial issue for the contest to be held several times. Indeed existing databases can be reused only when the defined categories are identical each year. Thus I would like this choice to be more discussed by the participants.

The list of participants is relevant. McKinney and Breebart could be added.

It is a good idea to accept many programming languages for submission. However it seems quite difficult to implement the learning phase, because each algorithm may use different structures to store learnt data. For instance, when the algorithm computes descriptors and feeds them through a classifier, is it possible to select the best descriptors ? If not, it is not realistic to suppose that the participant has to do it beforehand on his own limited set of data. Then I see two possibilities: either participants are given 50% of the database and do all the learning work themselves (then no k-fold cross validation is performed), or submissions concern only sets of descriptors and not full classification algorithms. The second choice has the advantage of allowing to compare different sets of descriptors with the same classifiers.

The test data are relevant but still a bit vague. Obviously existing databases should be used again and completed with new annotated data. The participants should list their own databases in detail and put them in common for evaluation in order to evaluate the time needed to annotate new data.