2005:Symbolic Genre Class

From MIREX Wiki


Cory McKay (McGill University) cory.mckay@mail.mcgill.ca


3rd AMENDED Genre Classification of MIDI Files Proposal


Submitted software will automatically classify MIDI recordings into genre categories.

1) Genre Categories

Two sets of genre categories will be used, one consisting of a few categories and one consisting of more categories. Systems will be trained and tested separately on these two taxonomies. This will allow measurements of how well the systems perform coarse classifications as well as fine classifications.

Although the categories will be organized hierarchically, submitted software will only need to produce classifications of leaf categories. This means that entrants who have not implemented hierarchical classification can simply treat the problem as a flat classification among the leaf categories, and can effectively ignore the hierarchy. The use of a hierarchical structure is suggested because this reflects the natural way in which humans appear to organize genre, and it allows one to take advantage of hierarchical classification techniques if desired. The approach proposed here has the advantage of allowing entrants to treat the problem as either a flat or hierarchical classification problem, whatever their preference.

Based on responses to the original proposal, each recording will belong to one and only one category. Although the original proposal of allowing multiple memberships is more realistic, it is inconsistent with how most systems have been implemented. Furthermore, allowing multiple classifications would have greatly complicated the task of the evaluation committee, so the choice of requiring single category classifications is probably better for all involved.

I have two taxonomies that I have used in past research, one consisting of 9 unique leaf categories, and the other of 38 unique leaf categories. I also have 25 hand-labelled MIDI recordings for each category (a total of 950 annotated recordings) that could be used for training and/or validation. Although alternative suggestions are certainly welcome, I propose these taxonomies and recordings simply because I have them and am more than willing to share them.

The taxonomies I propose are as follows (leaf categories, which are the only classification outputs of systems, in order to allow for flat classification if desired, are marked with plus signs to their right):

9 Leaf Category Taxonomy:

  • Jazz
    • Bebop +
    • Jazz Soul +
    • Swing +
  • Popular
    • Rap +
    • Punk +
    • Country +
  • Western Classical
    • Baroque +
    • Modern Classical +
    • Romantic +

38 Leaf Category Taxonomy:

  • Country
    • Bluegrass +
    • Contemporary +
    • Trad. Country +
  • Jazz
    • Bop
      • Bebop +
      • Cool +
    • Fusion
      • Bossa Nova +
      • Jazz Soul +
      • Smooth Jazz +
    • Ragtime +
    • Swing +
  • Modern Pop
    • Adult Contemporary +
    • Dance
      • Dance Pop +
      • Pop Rap +
      • Techno +
    • Smooth Jazz +
  • Rap
    • Hardcore Rap +
    • Pop Rap +
  • Rhythm and Blues
    • Blues
      • Blues Rock +
      • Chicago Blues +
      • Country Blues +
      • Soul Blues +
    • Funk +
    • Jazz Soul +
    • Rock and Roll +
    • Soul +
  • Rock
    • Classic Rock
      • Blues Rock +
      • Hard Rock +
      • Psychedelic +
    • Modern Rock
      • Alternative Rock +
      • Hard Rock +
      • Metal +
      • Punk +
  • Western Classical
    • Baroque +
    • Classical +
    • Early Music
      • Medieval +
      • Renaissance +
    • Modern Classical +
    • Romantic +
  • Western Folk
    • Bluegrass +
    • Celtic +
    • Country Blues +
    • Flamenco +
  • Worldbeat
    • Latin
      • Bossa Nova +
      • Salsa +
      • Tango +
    • Reggae +

Please note that, in the 38-category taxonomy, some leaf-categories belong to more than one parent. This is done in order to more realistically interrelationships between categories. Those performing flat classification could simply ignore this fact, and simply use the 38 unique leaf categories.

Of course, as stated earlier, alternative suggestions are certainly welcome, and it may be desirable to use subsets of these taxonomies, as some participants have expressed an interest in using a smaller number of categories.

Instead of the original suggestions of XML files, it has been decided by general consensus and by the organizers to encode model classifications and results in .txt files, with one line for each file, as follows:

<example path and filename>\t<genre label>\n

where \t denotes a tab and \n the end of line. The < and > characters are not included.

2) Training and Testing Recordings

The evaluation committee has asked that each participant submit labelled (see just above) MIDI recordings that they have to the symbolicGenreClassification directory of the following FTP site: ftp://ftp.ncsa.uiuc.edu/incoming/NCSA/mirex/. If you want to arrange an alternative mode of transport for the data, please contact Stephen Downie at jdownie@uiuc.edu. Either MIDI excerpts or full-length MIDI recordings may be submitted.

The evaluation committee will then choose appropriate training and testing recordings.

The evaluation committee may wish to test the entrants using cross-validation. Cross-validation has the advantage that no additional validation set needs to be collected and annotated by the committee, but then the committee would need to train each system, which may pose problems for some systems, and some participants have expressed a preference for training their own systems.

So, perhaps it would be preferable for the evaluation committee to distribute training recordings to all participants so that they may each train their own systems with them. The evaluation committee could then use their own small validation set to test performance. It is suggested here that if training is performed ahead of time by individual participants, then the validation recordings should be kept confidential until after evaluation in order to ensure that no classifiers were trained with them. This would, however, require the gathering and annotation of, perhaps, 10 validation files per category. The committee may therefore wish to shorten the number of categories in the 38-category taxonomy.

3) Input Data

The evaluation committee has specified the following command-line based calling format:

CommandName inputFileNameAndPath outputFileNameAndPath

and reverse

CommandName outputFileNameAndPath inputFileNameAndPath

CommandName inputFileNameAndPath (here ouput file name is created by adding an extension ΓÇ£.outputΓÇ¥, e.g. an input of ΓÇ£001.midΓÇ¥ produces an output of ΓÇ£001.outputΓÇ¥)

CommandName inputFileNameAndPath1 inputFileNameAndPath2 outputFileNameAndPath (also command out in1 in2 & command in1 out in2)

CommandName inputFileNameAndPath1 inputFileNameAndPath2 outputFileNameAndPath (ouput filename created by adding an extension to inputFileNameAndPath1, e.g. ".features")

4) Output Data

Instead of the original suggestions of XML files, it has been decided by general consensus and by the organizers to encode model classifications and results in .txt files, with one line for each file, as follows:

<example path and filename>\t<genre label>\n

where \t denotes a tab and \n the end of line. The < and > characters are not included.


  • Cory McKay and Ichiro Fujinaga (McGill University), cory.mckay@mail.mcgill.ca
  • Ming Li (University of East Anglia), mli@cmp.uea.ac.uk
  • Pedro J. Ponce de Leon and Jose M. Inesta (Universidad de Alicante), pierre@dlsi.ua.es
  • Alfredo Serafini, Roberto Basili and Armando Stellato (University of Rome Tor Vergata), ser.alf@inwind.it, basili@info.uniroma2.it, stellato@info.uniroma2.it
  • George Tzanetakis (University of Victoria), gtzan@cs.uvic.ca

Potential Participants

  • Expressed interest: Rudi Cilibrasi (CWI) cilibrar@cwi.nl
  • Contacted but no response yet: Man-Kwan Shan & Fang-Fei Kuo (National Cheng Chi University), mkshan@cs.nccu.edu.tw
  • Gao Sheng and Kai Chen (Institute for Infocomm Research(A*STAR)), gaosheng@i2r.a-star.edu.sg, kchen@i2r.a-star.edu.sg

Evaluation Procedures

Entries will be evaluated based on their success rates with respect to both fine and coarse classifications. Entrants will have the option of enabling their software to output classifications of ΓÇ£unknown,ΓÇ¥ which will be penalized less severely during evaluation than misclassifications, as classifications flagged as uncertain are much better than false classifications in a practical context.

Submissions in C/C++, Java, MatLab and Python (and other languages?) will be accepted.

Relevant Test Collections

  • The 950 MIDI files I have available
  • Collections of other participants (e.g. Pedro in particular has expressed a willingness to share his MIDI files)
  • On-line repositories of MIDI files (sample links available at http://www.music.mcgill.ca/~cmckay/midi.html, although these were collected about a year ago)
  • Research databases

Review 1

The problem is very interesting for MIR, but too vaguely described. The role of the committee is not to propose anything, but to review the proposed evaluation sessions. Thus the author should propose a detailed list of genres and corresponding data.

I'm not against organizing the genres hierarchically and associating several genres to each file, but this raises many issues that are not discussed at all here. If a track belongs to several genres, are these genres equally weighted or not ? Are they determined by asking several people to classify each track into one genre, or by asking each one to classify each track into several genres ? If there are coarse categories for classical and folk music, where lies the fine category of classical music adapted from folk songs ? I suggest that the contest concentrates on the single genre problem.

The choice of the genre classes is a crucial issue for the contest to be held several times. Indeed existing databases can be reused only when the defined categories are identical each year. Obviously the list of categories should reflect the list of MIDI music available on the internet. It would help if some data were already labeled according to this list.

The list of relevant data should be developed. How many files are needed for learning and testing ? Have the participants already collected some labeled data that they could give to the organizers ? How much ?

Regarding the release of the data, I think that it would be better not to release anything. The training and test data should always be accessible through the D2K interface, and thus no copyright problem would appear. Is it possible to ensure that the test data are used only for testing and not for learning ? Is it possible to implement learning easily in M2K ? (each algorithm may use different structures to store learnt data)

Finally, the evaluation procedure seems nice, but I don't have any clue whether the proposed participants are really interested.

Review 2

This is an interesting topic, one that I haven't seen much work on. I do not believe that its difficult to get a large collection of midi files. Many are in public domain, were never intended to be copyrighted, or have copyleft / creative commons licences. However, its still difficult to assemble a reasonable collection of midi files of appropriate length which accurately represent a sufficient number of genres. This must be addressed.

A key point is that it requires the Contest Committee to handlabel a large number of midi files. We also need to determine what our genres are. Is the Committee capable and willing to do this? I personally would find it very difficult to determine the genre of a midi recording which I don't recognize. MIDI all sounds like Muzak to me, unless I know the original audio recording. Has anyone tried midi-based genre classification before?

I have no problems with the suggested evaluation and testing procedures.

I think we need some more feedback on whether people are really interested in this. Most researchers who use MIDI, to my knowledge, aren't concerned with genre issues. George typically works with audio, so the proposer is the only one I'm aware of who I know is interested. I could be wrong so lets ask around. We also need to explore the handlabelling task, and to see if we can assemble a decent collection (which we should do regardless of this proposal).

If there is significant interest, and the labeling can be done, then we should accept it.

Downie's Comments

1. Happy to see another symbolic proposal!

2. See my comments w/r/t the Audio Genre proposal. We need to make these two tasks as similar as possible!

Rudi's Comments

Looks good to me. I agree with the first reviewer's comments wholeheartedly. I think it may be too complicated to do hierarchical genre classification. What if we just restrict ourselves to just 2-5 genres and pretend they are disjoint? And then only pick songs that clearly fit one or the other. I'm not against a hierarchical system necessarily, but it does seem like it may involve so much more work and arbitrariness in labelling, scoring, etc. If you just want to get something a bit more interesting than simple Jazz / Rock / Classical then how about happy / sad music? We could train on two different dimensions in two parts (or perhaps on the same set of songs?) to add a little variety without much additional complexity on the part of the participants or organizers. Or how about "hit song" (greater than 1 million copies sold or something) versus "not hit" like that Hit Song predictor that got some press lately.

I agree that we will need to get some more parameters about the number of MIDI files involved in the experiment. Let's put a finer point on the data model. Each training sample will have

  • a MIDI file, provided as an absolute pathname string
  • a string song title
  • a string artist/group name
  • a numerical genre classification code as an integer
  • any other codes (e.g. happy/sad or hit/not-hit) also as integers

On each run of the system, the training set will be partitioned into five parts and set up for five-fold cross-validation testing. It will be given each song for training along with one of the N different integer label codes for whatever test is in progress. The program will read from standard input the following information, one record per line:

  • first line, the number of MIDI songs for training as ASCII decimal, then a space, then the number of testing songs, then a newline
  • next the training songs, one per line, with an ASCII decimal label code then a space then the absolute filename of the MIDI file, then a newline
  • next the testing songs, one per line, as an absolute filename

The program is to output one integer prediction per line, for each test song, in order.

The program may assume the current directory is readable, searchable, and writable.

In terms of the input training data, I think we need a lot of it. In a typical multiclass experiment using SVM, the all-pairs technique on k classes yields about k*k/2 different SVM's, each with only 2/k portion of the data. Thus, it is important that we reduce the number of categories or increase the number of training datapoints in order to get good measurements without too much confusion.

Pedro's comments

On taxonomies:

I like Rudi's proposal of a multidimensional taxonomy. The genre dimension could be, for example, Cory's nine-leaf taxonomy, or even something with less categories (Jazz, Rock, Classical, Worldbeat and Folk, maybe?). The 'mood' dimension can be also something like 'lyrical', 'frantic', 'syncopated' or 'pointillistic', as proposed in Danneberg et al. [1]. The 'popularity' dimension would be Hit/Nohit. The problem with this multidimensional taxonomy is that the labeling process is somewhat subjective and it has not been done yet.

I also would prefer an input format like the one described by Rudi. I like to keep things as simple as possible. I think we don't need information like song title or artist name to do music style classification. Moreover, this information could be used to classify by simply searching in a dictionary of authors and tunes. A way of preventing the use of such metadata in the classification process would be necessary.

On Training/testing:

It would be nice to be able to train our own systems, not depending on M2K, providing it is possible for the evaluation committe to provide us with training datasets without falling into copyright infringements.

On datasets:

We can send our small labelled dataset of Jazz and Classical music (110 files). I hope it is not too late to offer them to the evaluation committee (as of 2005/02/28).

[1] R. Dannenberg et al., "A machine learning approach to musical style recognition", Proc. ICMC'97, pp. 344-347