Difference between revisions of "2005:Audio Genre"

Latest revision as of 17:18, 9 May 2010

Proposer

Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk

Title

Genre Classification from polyphonic audio.

Description

The automatic classification of polyphonic musical audio (in PCM format) into a single high-level genre per example. If there is sufficient demand, a multiple genre track could be defined, requiring submissions to identify each genre (without prior knowledge of the number of labels), with the precision and recall scores calculated for each result.

1) Input data The input for this task is a set of sound file excerpts adhering to the format, metadata and content requirements mentioned below.

Audio format:

CD-quality (PCM, 16-bit, 44100 Hz)
single channel (mono)
Whole files, algorithms may use segments at authors discretion

Audio content:

polyphonic music
data set should include at least 10 different genres (Suggestions include: Pop, Jazz/Blues, Rock, Heavy Metal, Reggae, Dancehall Ragga, Ballroom Dance, Electronic/Modern Dance (Jungle, Drum and Bass, Techno, House), Classical, Folk - to Exclude "World" music as this is a common "catch-all" for ethnic/folk music that is not easily classified into another group and can contain such diverse music as Indian tabla and Celtish rock).
Final set of Genres to be decided on data available
Genres will be organised hierachically, on at least two levels. For example, a rough level I: Rock/Pop vs. Classical vs. Jazz/Blues and a detailed level II: Rock, Pop (within Pop/Rock), Chamber music, orchestral music (within Classical), Jazz, Blues (within Jazz/Blues).
both live performances and sequenced music are eligible
Each class should be represented by a minimum approximately 100 examples. It is NOT essential that the same number of examples represent each class.
A tuning database will NOT be provided. However the RWC Magnatune database used for the 2004 Audio desciption contest is still available (Training part 1 [1], Training part 2 [2], Development part 1 [3], Development part 2 [4])

Metadata:

By definition each example must have a genre label corresponding to one of the lowest level output classes. (Upper-level labels will be interpolated by evaluation software).
Where possible existing genre labels should be confirmed by two or more sources, due to IP contsraints it is unlikely that we will be allowed to distribute any database for meta data validation by participants. Viable sources for this metadata include CDDB, http://www.allmediaguide.com (http://www.allmusic.com), MP3.com or agreement by two or more human subjects.
The training set should be defined by a text file with one entry per line, in the following format(<> should be omitted, used here for clarity):
<example path and filename>\t<bottom-level genre classification>\t<top-level genre classification>\n

2) Output results

Results should be output into a text file with one entry per line in either of the following formats (<> should be omitted, used here for clarity):
- <example path and filename>\t<lowest-level genre classification>\n
  (Higher level classifications will be interpolated by evaluation framework)
  or
- <example path and filename>\t<bottom-level genre classification>\t<top-level genre classification>\n
  (This example uses a 2 level hierachy, number of labels is limited to height of taxonomy)
The following optional tab delimited descriptor format can be used by authors that wish to allow hybridisation of their submissions with other algorithms (including WEKA for classifier benchmarking)
- Descriptors for each example should be contained in their own file, named according to the following format: originalFileName.wav.features
- The file should an ascii text file in the following format:
  <columnLabel1>\t<columnLabel2>\t<columnLabel3>...etc
  0.0\t0.0\t0.0...etc

3) Maximum running time

The maximum running time for a single iteration of a submitted algorithm will be 24 hours (allowing a maximum of 72 hours for 3-fold cross-validation)

Participants

Kris West and Ming Li (University of East Anglia), kw@cmp.uea.ac.uk, mli@cmp.uea.ac.uk
Peter Ahrendt and Anders Meng (ISP, IMM, Technical University of Denmark), pa@imm.dtu.dk, am@imm.dtu.dk
Elias Pampalk (├ûFAI), elias@oefai.at
James Bergstra, Norman Casagrande and Douglas Eck (University of Montreal), james.bergstra@umontreal.ca, casagran@iro.umontreal.ca, eckdoug@iro.umontreal.ca
Gao Sheng and Kai Chen (Institute for Infocomm Research(A*STAR)), gaosheng@i2r.a-star.edu.sg, kchen@i2r.a-star.edu.sg
Michael Mandel and Dan Ellis (Columbia University), mim@ee.columbia.edu, dpwe@ee.columbia.edu
Thomas Lidy and Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at
George Tzanetakis (University of Victoria), gtzan@cs.uvic.ca
Enric Guaus (Universitat Pompeu Fabra), eguaus@iua.upf.es
Juan Jose Burred (Technical University of Berlin), burred@nue.tu-berlin.de
Vitor Soares (University of Porto), vitor.soares@semanticaudiolabs.org
Nicolas Scaringella (EPFL), nicolas.scaringella@epfl.ch

Other Potential Participants

Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium
Fran├ºois Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium
Beth Logan, HP, beth.logan@hp.com, Medium
Nicolas Scaringella (EPFL), nicolas.scaringella@epfl.ch, CONFIRMED
McKinney and Breebart (Philips research labs), martin.mckinney@philips.com, jeroen.breebaart@philips.com
Enrique Alexandre & Manuel Rosa (University of Alcala, Spain), enrique.alexandre@uah.es, manuel.rosa@uah.es

Evaluation Procedures

3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.

Evaluation measures:

1 point will be scored for each correct label. I.e. for a two level hierachy correctly assigning the the labels Jazz&Blues and Blues to an example scores 2 points.
If only the lowest-level classification (in the hierachical taxonomy) is returned the higher level classification will be interpolated. I.e. (in the previois example) correctly assigning the label Blues will score 2 points.
Simple accuracy and standard deviation of results (in the event of uneven class sizes both this will be normalised according to class size).
Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.

Evaluation framework:

Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k, first release due end of Feb 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available in March for submission development.

Format for algorithm calls

There are four formats for calls to code external to D2K that will be supported:

CommandName inputFileNameAndPath outputFileNameAndPath
CommandName inputFileNameAndPath (ouput file name created by adding an extension, e.g. ".features")

The second two formats allow an additional file to be passed as a parameter:

CommandName inputFileNameAndPath1 inputFileNameAndPath2 outputFileNameAndPath
CommandName inputFileNameAndPath1 inputFileNameAndPath2 outputFileNameAndPath (ouput file name created by adding an extension to inputFileNameAndPath1, e.g. ".features")

E.g.
ExtractFeatures C:\inTrainFiles.txt C:\outTrainFeatures.feat
ExtractFeatures C:\inTestFiles.txt C:\outTestFeatures.feat
TrainModel C:\outTrainFeatures.feat
ApplyModel C:\outTrainFeatures.feat.model C:\outTestFeatures.feat C:\results.txt

Relevant Test Collections

Re-use Magnatune database Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons) Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments) Solicite contributions from http://creativecommons.org/audio/, http://www.epitonic.com, http://www.mp3.com/ (offers several free audio streams) and similar sites Validate metadata though free services such as http://www.MP3.com, http://ww.allmusic.com and CDDB

Ground truth annotations:

All annotations should be validated, rather than accepting supplied genre labels, by at least two sources including non-participating volunteers (if possible).

Review 1

The two proposals on artist identification and genre classification from musical audio are essentially the same in that they involve classifying long segments of audio (1 minute or longer) into a set of categories defined by training examples. Both tests follow on from successful evaluations held at ISMIR2004; there was good interest and interesting results, and I think we can expect good participation in 2005.

The tasks are well-defined, easily understood, and appear to have some practical importance. The evaluation and testing procedures are very good. This is an active research area, so it should be possible to obtain multiple submissions, particularly given last year's results.

My only comments relate to the choice of ground truth data. In terms of a dataset to use, I do not think we should worry unduly about copyright restrictions on distribution. If it were possible to set up a centralized "feature calculation server" (e.g. using D2K), we could put a single copy of the copyright materials on that server, then allow participants to download only the derived features, which I'm sure would avoid any complaints from the copyright holders. (I believe NCSA has a copy of the "uspop2002" dataset from MIT/Columbia.)

My worry is that the bias of using only unencumbered music will give results not representative of performance on 'real' data, although I suppose we could distribute a small validation set of this kind purely to verify that submitted algorithms are running the same at both sites.

In fact, the major problems from running these evaluations in 2004 came from the ambitious goal of having people submit code rather than results. In speech recognition, evaluations are run by distributing the test data, leaving each site to run their recognizers themselves, then having them upload the recognition outputs for scoring (only). They sometimes even deal with copyright issues by making each participant promise to destroy the evaluation source materials after the evaluation is complete. Although this relies on the integrity of all participants not to manually fix up their results, this is not a big risk in practice, particularly if no ground truth for the evaluation set is distributed i.e. you'd have to be actively deceitful, rather than just sloppy, to cheat.

Having a separate training and testing sets, with and without ground truth respectively, precludes the option of multiple 'jackknife' testing, where a single pool of data is divided into multiple train/test divisions. However, having each site run their own classifiers is a huge win in terms of the logistics of running the test. I would, however, discourage any scheme which involved releasing the ground-truth results for the test set, since it is too easy to unwittingly train your classifier on your test set, if the test set labels are just lying around.

I'm not sure how important the M2K/D2K angle is. It's a nice solution to the copyright issue, and I suppose the hope is that it will solve the problem of getting code running at remote sites, but I am worried that the added burden of figuring out D2K and porting existing systems to it will act as an additional barrier to participation. By contrast, requiring that people submit only the textual output labels in the specified format should be pretty easy for any team to produce without significant additional coding.

In terms of the genre contest, the big issue is the unreliability and unclear definitions of the ground truth labels. It seems weird to have one evaluation on the ability to distinguish an arbitrary set of artists - a very general-sounding problem - and another contest which is specifically dominated by the ability to distinguish classical from jazz from rock - a very specific, and perhaps not very important, problem.

Again in this case I don't particularly like the idea of trying to get multiple labellings: for artists, I thought it was unnecessary because agreement will be very high. Here, I think it's of dubious value because agreement will be so low; in both cases, errors in ground truth impact all participants equally, and so are not really a concern - we are mostly interested in relative values, so a ceiling on absolute performance due to a few 'incorrect' reference labels is of little consequence.

Clearly, we can run a genre contest: I would again advocate for real music, and not worry too much about copyright issues, and not even worry too much about where the genre ground truth comes from, since it is always pretty suspect; allmusic.com is as good a source as any. But I personally find this contest of less intellectual interest than artist ID, even though it has historically received more attention, because of the poor definition of the true, underlying classes.

I guess the strongest thing in favor of the genre contest is that if you have a system to evaluate either of artist ID or genre ID, you can use it unmodified for both (simply by changing the ground truth labels), so we might as well run both if only to see how well the results of these two tests correlate over different algorithms. It's a great shame we didn't do this at ISMIR2004, which I think was due only to a needless misunderstanding among participants (related to the MFCC features made available).

Review 2

The single genre problem is well defined and seems to be a relevant problem for the MIR community nowadays. Obviously, it would be more relevant to classify each track into multiple genres or to use a hierarchy of genres, but the proposal does not deal with these issues in a satisfying way. If a track belongs to several genres, are these genres equally weighted or not ? Are they determined by asking several people to classify each track into one genre, or by asking each one to classify each track into several genres ? If there are nodes for Electronic and Jazz/Blues, where lies the leaf Electro-jazz ? I suggest that the contest concentrates on the well-defined simple genre problem. An interesting development of it would be to ask algorithms to associate a percentage of probability to each predefined genre on each track, instead of outputing a single genre with 100% probability. Regarding the input format, I think that whole files are better (the total duration and the volume variation are already good genre descriptors) and that polyphony is not required (classical music contains many works for solo instruments).

I have no precise opinion regarding the defined genres, since this is more of a cultural importance. I'm not sure that Rock is less diverse than World (what's the common point between Elvis and Radiohead ?). Also I am surprised that there is no Rap/RnB. The choice of the genre classes is a crucial issue for the contest to be held several times. Indeed existing databases can be reused only when the defined categories are identical each year. Thus I would like this choice to be more discussed by the participants.

The list of participants is relevant. McKinney and Breebart could be added.

It is a good idea to accept many programming languages for submission. However it seems quite difficult to implement the learning phase, because each algorithm may use different structures to store learnt data. For instance, when the algorithm computes descriptors and feeds them through a classifier, is it possible to select the best descriptors ? If not, it is not realistic to suppose that the participant has to do it beforehand on his own limited set of data. Then I see two possibilities: either participants are given 50% of the database and do all the learning work themselves (then no k-fold cross validation is performed), or submissions concern only sets of descriptors and not full classification algorithms. The second choice has the advantage of allowing to compare different sets of descriptors with the same classifiers.

The test data are relevant but still a bit vague. Obviously existing databases should be used again and completed with new annotated data. The participants should list their own databases in detail and put them in common for evaluation in order to evaluate the time needed to annotate new data.

Downie's Comment

1. Think genre tasks are kinda fun, actually. Devil is in the details. Would give my eye teeth to avoid manually labelling genre classes. You set up eight classes with 100-150 examples. That comes to 800-1200 labels that need applying. Can we as a group come up with a possible standardized source for genre labels and then, even though they are not perfect, live with our choice? Perhaps in this early days, we would be best served by looking at only the broadest of categories and not fussing about the fine-grained subdivisions?

2. Would be interesting to have a TRUE genre task! As we learned in the UPF doctoral seminar prior to ISMIR 2004, genre is properly defined as the "use" of the music: dance, liturgical, funereal, etc. What we are calling genre here is really style. Just a thought.

Kris' thoughts

Contents:

Multiple genres and Artist ID
Framework issues and algorithm submission
Producing ground truth and answers to Downie's comments
Who has data?

1. Multiple genres and Artist ID

Dan Ellis wrote:

> About multiple genre classification: I have pretty serious doubts
> about genre classification in the first place, because of the
> seemingly arbitrary nature of the classes and how they are assigned.

IMHO the genre classification task is to reproduce an arbitrary set of culturally assigned classes. Tim Pohle (at the ISMIR 2004 grad school) gave an interesting talk on using genre classifiers to reproduce arbitrary, user assigned classes, to manage a user's personal music collection. We also discussed how to suggest new music choices from a larger catalog by thresholding the probabilities of membership of new music to favored classes.

Dan Ellis wrote:

> This is why I prefer artist identification as a task. That said,
> assigning multiple genres seems not much worse, but not much better
> either. Allowing for fuzzy, multiple characteristics seems to address
> some of the problems with genres -- which is good -- but now defining
> the ground truth is even more arguable and problematic, since we now
> have that much more ground truth data to define -- degree of
> membership, and over a larger set of classes.

I also prefer the artist ID task, but for different reasons; I think we use too few classes to properly evaluate the Genre classification task as some models/techniques fall over if given too many classes to evaluate. Obviously this has come about because of storage, IP and ground-truth constraints. However, if a hierarchy is used (as suggested for the symbolic track), rather than a bag of labels, the ground-truth problem is no bigger as higher level labels can be interpolated and it will be easier to both expand the database to include more pieces and to implement a finer granuality of labels (more sub-genres) in later evaluations. Small taxonomies are the biggest hurdle in the accurate evaluation of Genre classification systems, we can probably define around 10 lowest level classes for this year, but should aim to add the same number again next year and the year after until we can be confident that we have a database that poses a classification problem that is as difficult as a real world application (such as organizing/classifying the whole Epitonic catalog).

Dan Ellis wrote:

> One of the reasons I am interested in a parallel evaluation of genre
> classification and artist ID is that it may provide some objective
> evidence for my gut bias against genres: if the results of different
> algorithms under genre classification are more equivocal than artist
> ID (i.e. they reveal less about the difference between the
> algorithms), then that's some kind of evidence that the task itself is
> ill-posed. My suspicion is that multiple, real-valued genre
> memberships will be even less revealing.

I also believe that many classification techniques and feature extractions are vulnerable to a smaller numbers of examples per class and I think this is far more likely to show up in the comparison of the performance of an algorithm between the two tracks (my own submission will be modified for the artist id track). Artist identification is about modeling a natural grouping within the data, whereas genres are not neccessarily natural groupings and I believe their accurate modeling of hierarchical, multi-modal genres is likely to be more complex than that of modeling an artist's work (although this is alleviated by the additional data available). An artist may work in a number of styles but there is *usually* some dimension along which all the examples are grouped.

Dan Ellis wrote:

> The most important thing, I think, is to define the evaluation to support
>(and encourage) the largest number of participants, meaning that we could
>include this as an option, but also evaluate a 1-best genre ID to remain
>accessible to algorithms that intrinsically can only report one class.

With this in mind I think we should opt for a hierarchical taxonomy, which can support direct comparison of hierarchical classifiers, single label classifiers (by interpolating higher level classifications in evaluation framework) and multiple label classifiers (in a somewhat limited fashion, perhaps with a penalty for additional incorrect labels, which is probably not fair, or by limiting number of labels to match height of taxonomy). I suggest that each correct label scores one point, e.g. rock/pop, rock, indie rock would score 3 if all labels are correct.

2. Framework issues and algorithm submission

I don't think it is particularly ambitious to have people submit their code for evaluation at a single local. I have already implemented a basic D2K framework that can run anything that will run from the command line including Matlab. The only constraint is that a submission will have to conform to a simple text file input and output format. Marsyas-0.1 and Matlab examples have been produced and I am happy for people to take in IO portions directly from this code if they wish. Having code submitted to a central evaluation site will allow us to perform cross-validation experiments and assess exactly how much variance their is in each system's performance. This would not be so essential if we had a very large data set (min 10,000 examples) however we are going to get nowhere near that many (maybe in later years...). It was also suggested in the reviews that this would hamstring feature selection techniques (see review 2) but I don't believe this, surely the feature selection code (including any classifier used) would be correctly implemented in the feature extraction phase.

I could also define an optional simple text file format for descriptors. This would allow the hybridization of any submitted systems using this format and the use of a bench mark classifier to evaluate the power of the descriptors and classifiers independent of each other. I would be happy to provide several bench mark classifiers for this purpose (possibly by creating an interface to Weka). I would also be interested in seeing the performance of a mixture of experts system, built from all the submitted systems, which should, in theory, be able to equal or better the performance of all of the submitted systems.

M2K is coming up to its Alpha release and will include a cut down version of the competition framework so that people can see how it works (External integration itinerary). As D2K can run across X windows we could even provide a virtual lab evaluation setup, so that each participant could run their own submission (without violating any IP laws) if they really wanted, and ensure that it ran ok. Anyone can get a license for D2K and the framework will probably be included in a later version of M2K so anyone can make sure that their submission works wok ahead of time.

3. Producing ground truth and answers to Downie's comments

First I don't think we need to send out a tuning database, it creates problems and solves none. If data is held at evaluation site we don't have any SIP issues and as I stated earlier, anyone could launch their submission themselves in D2K across an X Windows session (note all console output from code external to D2K is collected and forwarded to D2K console to aid debugging). If we use IP free databases, we are unlikely to be able validate ground-truth with on line services such as http://www.allmediaguide.com/ and it has also been suggested that IP-free databases are not necessarily representative of the whole music community. Several people have said that it doesn't matter if some of labels are incorrect however I'm not afraid to volunteer to validate the labels of a subset of the data (say 200 files, humans can get through them quicker than you'd think) and if there were sufficient volunteers it would go a long way to establishing an IP free research dbase with good ground-truth (if I don't get any volunteers I won't consider this an option, so email me!).

Personally I think we should use a large volume of copyrighted material, with labels confirmed by at least two sources (existing dbase label, CDDB and allmediaguide, or a human labeler). The format should be WAV (MP3s would have to be decoded to this anyway) and will be mono unless anyone specifically requests stereo (both can be made available or can be handled by framework).

Should we rename this the Style classification task?

4. Who has data? Anyone with music (with or without ground-truth) that we can use should make themselves known ASAP. I can provide a fair selection of white label (IP-free) dance music in at least 3 subgenres, with labels defined by 3 expert listeners.

Difference between revisions of "2005:Audio Genre"

Latest revision as of 17:18, 9 May 2010

Contents

Proposer

Title

Description

Participants

Other Potential Participants

Evaluation Procedures

Format for algorithm calls

Relevant Test Collections

Review 1

Review 2

Downie's Comment

Kris' thoughts

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools

@@ Line 19: / Line 19: @@
 * CD-quality (PCM, 16-bit, 44100 Hz)
 * single channel (mono)
-* Either whole files or 1 minute excerpts
+* Whole files, algorithms may use segments at authors discretion
 Audio content:
 * polyphonic music
-* data set should include at least 8 different genres (Suggestions include: Pop, Jazz/Blues, Rock, Heavy Metal, Reggae, Ballroom Dance, Electronic/Modern Dance, Classical, Folk - to Exclude "World" music as this is a common "catch-all" for ethnic/folk music that is not easily classified into another group and can contain such diverse music as Indian tabla and Celtish rock)
+* data set should include at least 10 different genres (Suggestions include: Pop, Jazz/Blues, Rock, Heavy Metal, Reggae, Dancehall Ragga, Ballroom Dance, Electronic/Modern Dance (Jungle, Drum and Bass, Techno, House), Classical, Folk - to Exclude "World" music as this is a common "catch-all" for ethnic/folk music that is not easily classified into another group and can contain such diverse music as Indian tabla and Celtish rock).
-* the classification could also be evaluated in two levels. For example, a rough level I: Rock/Pop vs. Classical vs. Jazz/Blues and a detailed level II: Rock, Pop (within Pop/Rock), Chamber music, orchestral music (within Classical), Jazz, Blues (within Jazz/Blues).
+* Final set of Genres to be decided on data available
+* Genres will be organised hierachically, on at least two levels. For example, a rough level I: Rock/Pop vs. Classical vs. Jazz/Blues and a detailed level II: Rock, Pop (within Pop/Rock), Chamber music, orchestral music (within Classical), Jazz, Blues (within Jazz/Blues).
 * both live performances and sequenced music are eligible
-* Each class should be represented by a minimum of 100 examples, but 150 would be preferred. If possible the same number of examples should represent each class.
+* Each class should be represented by a minimum approximately 100 examples. It is NOT essential that the same number of examples represent each class.
-* If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing  correct execution of algorithm submissions.
+* A tuning database will NOT be provided. However the RWC Magnatune database used for the 2004 Audio desciption contest is still available (Training part 1 [http://www.iua.upf.es/mtg/ismir2004/contest/Training_Tracks1.tar.gz], Training part 2 [http://www.iua.upf.es/mtg/ismir2004/contest/Training_Tracks2.tar.gz], Development part 1 [http://www.iua.upf.es/mtg/ismir2004/contest/Development_Tracks1.tar.gz], Development part 2 [http://www.iua.upf.es/mtg/ismir2004/contest/Development_Tracks2.tar.gz])
 Metadata:
-* By definition each example must have a genre label corresponding to one of the output classes.
+* By definition each example must have a genre label corresponding to one of the lowest level output classes. (Upper-level labels will be interpolated by evaluation software).
-* Where possible existing genre labels should be confirmed by two or more non-entrants, due to IP contsraints it is unlikely that we will be allowed to distribute any database for meta data validation by participants.
+* Where possible existing genre labels should be confirmed by two or more sources, due to IP contsraints it is unlikely that we will be allowed to distribute any database for meta data validation by participants. Viable sources for this metadata include CDDB, http://www.allmediaguide.com (http://www.allmusic.com), MP3.com or agreement by two or more human subjects.
-* The training set should be defined by a text file with one entry per line, in the following format:
+* The training set should be defined by a text file with one entry per line, in the following format(<> should be omitted, used here for clarity):<br><example path and filename>\t<bottom-level genre classification>\t<top-level genre classification>\n<br>
-<example path and filename>\t<genre label>\n
 ) Output results
-Results should be output into a text file with one entry per line in the following format:
-<example path and filename>\t<genre classification>\n
+* Results should be output into a text file with one entry per line in either of the following formats (<> should be omitted, used here for clarity):
+** <example path and filename>\t<lowest-level genre classification>\n<br>(Higher level classifications will be interpolated by evaluation framework)<br>'''or'''<br>
+** <example path and filename>\t<bottom-level genre classification>\t<top-level genre classification>\n<br>(This example uses a 2 level hierachy, number of labels is limited to height of taxonomy)
+* The following optional tab delimited descriptor format can be used by authors that wish to allow hybridisation of their submissions with other algorithms (including WEKA for classifier benchmarking)
+** Descriptors for each example should be contained in their own file, named according to the following format: originalFileName.wav.features
+** The file should an ascii text file in the following format:<br><columnLabel1>\t<columnLabel2>\t<columnLabel3>...etc<br>0.0\t0.0\t0.0...etc<br>
+) Maximum running time
+* The maximum running time for a single iteration of a submitted algorithm will be 24 hours (allowing a maximum of 72 hours for 3-fold cross-validation)
+==Participants==
+* Kris West and Ming Li (University of East Anglia), kw@cmp.uea.ac.uk, mli@cmp.uea.ac.uk
+* Peter Ahrendt and Anders Meng (ISP, IMM, Technical University of Denmark), pa@imm.dtu.dk, am@imm.dtu.dk
+* Elias Pampalk (├ûFAI), elias@oefai.at
+* James Bergstra, Norman Casagrande and Douglas Eck (University of Montreal), james.bergstra@umontreal.ca, casagran@iro.umontreal.ca, eckdoug@iro.umontreal.ca
+* Gao Sheng and Kai Chen (Institute for Infocomm Research(A*STAR)), gaosheng@i2r.a-star.edu.sg, kchen@i2r.a-star.edu.sg
+* Michael Mandel and Dan Ellis (Columbia University), mim@ee.columbia.edu, dpwe@ee.columbia.edu
+* Thomas Lidy and Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at
+* George Tzanetakis (University of Victoria), gtzan@cs.uvic.ca
+* Enric Guaus (Universitat Pompeu Fabra), eguaus@iua.upf.es
+* Juan Jose Burred (Technical University of Berlin), burred@nue.tu-berlin.de
+* Vitor Soares (University of Porto), vitor.soares@semanticaudiolabs.org
+* Nicolas Scaringella (EPFL), nicolas.scaringella@epfl.ch
-==Potential Participants==
+==Other Potential Participants==
-* Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, High
-* Elias Pampalk (├ûFAI), elias@oefai.at, High
-* George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, High
-* Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High
-* Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, High
 * Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium
 * Fran├ºois Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium
+* Beth Logan, HP, beth.logan@hp.com, Medium
+* Nicolas Scaringella (EPFL), nicolas.scaringella@epfl.ch, CONFIRMED
+* McKinney and Breebart (Philips research labs), martin.mckinney@philips.com, jeroen.breebaart@philips.com
+* Enrique Alexandre & Manuel Rosa (University of Alcala, Spain), enrique.alexandre@uah.es, manuel.rosa@uah.es
 ==Evaluation Procedures==
@@ Line 56: / Line 77: @@
 Evaluation measures:
-* Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).
+* 1 point will be scored for each correct label. I.e. for a two level hierachy correctly assigning the the labels Jazz&Blues and Blues to an example scores 2 points.
+* If only the lowest-level classification (in the hierachical taxonomy) is returned the higher level classification will be interpolated. I.e. (in the previois example) correctly assigning the label Blues will score 2 points.
+* Simple accuracy and standard deviation of results (in the event of uneven class sizes both this will be normalised according to class size).
 * Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.
 Evaluation framework:
-Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early Febuary for submission development.
+Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k, first release due end of Feb 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available in March for submission development.
+==Format for algorithm calls==
+There are four formats for calls to code external to D2K that will be supported:
+* CommandName inputFileNameAndPath outputFileNameAndPath
+* CommandName inputFileNameAndPath (ouput file name created by adding an extension, e.g. ".features")
+The second two formats allow an additional file to be passed as a parameter:
+* CommandName inputFileNameAndPath1 inputFileNameAndPath2 outputFileNameAndPath
+* CommandName inputFileNameAndPath1 inputFileNameAndPath2 outputFileNameAndPath (ouput file name created by adding an extension to inputFileNameAndPath1, e.g. ".features")
+'''E.g.'''<br>
+ExtractFeatures C:\inTrainFiles.txt C:\outTrainFeatures.feat<br>
+ExtractFeatures C:\inTestFiles.txt C:\outTestFeatures.feat<br>
+TrainModel C:\outTrainFeatures.feat<br>
+ApplyModel C:\outTrainFeatures.feat.model C:\outTestFeatures.feat C:\results.txt<br>
 ==Relevant Test Collections==
-Re-use Magnatune database (???)
+Re-use Magnatune database
 Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons)
 Individual contributions of  usable but copyright-controlled recordings (including in-house recordings from music departments)
-Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites
+Solicite contributions from http://creativecommons.org/audio/, http://www.epitonic.com, http://www.mp3.com/ (offers several free audio streams) and similar sites
+Validate metadata though free services such as http://www.MP3.com, http://ww.allmusic.com and CDDB
 Ground truth annotations:
-All annotations should be validated, rather than accepting supplied genre labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be exended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.
+All annotations should be validated, rather than accepting supplied genre labels, by at least two sources including non-participating volunteers (if possible).
 ==Review 1==
+The two proposals on artist identification and genre classification from musical audio are essentially the same in that they involve classifying long segments of audio (1 minute or longer) into a set of categories defined by training examples.  Both tests follow on from successful evaluations held at ISMIR2004; there was good interest and interesting results, and I think we can expect good participation in 2005.
+The tasks are well-defined, easily understood, and appear to have some practical importance.  The evaluation and testing procedures are very good.  This is an active research area, so it should be possible to obtain multiple submissions, particularly given last year's results.
+My only comments relate to the choice of ground truth data. In terms of a dataset to use, I do not think we should worry unduly about copyright restrictions on distribution.  If it were possible to set up a centralized "feature calculation server" (e.g. using D2K), we could put a single copy of the copyright materials on that server, then allow participants to download only the derived features, which I'm sure would avoid any complaints from the copyright holders.  (I believe NCSA has a copy of the "uspop2002" dataset from MIT/Columbia.)
+My worry is that the bias of using only unencumbered music will give results not representative of performance on 'real' data, although I suppose we could distribute a small validation set of this kind purely to verify that submitted algorithms are running the same at both sites.
+In fact, the major problems from running these evaluations in 2004 came from the ambitious goal of having people submit code rather than results.  In speech recognition, evaluations are run by distributing the test data, leaving each site to run their recognizers themselves, then having them upload the recognition outputs for scoring (only). They sometimes even deal with copyright issues by making each participant promise to destroy the evaluation source materials after the evaluation is complete.  Although this relies on the integrity of all participants not to manually fix up their results, this is not a big risk in practice, particularly if no ground truth for the evaluation set is distributed i.e. you'd have to be actively deceitful, rather than just sloppy, to cheat.
+Having a separate training and testing sets, with and without ground truth respectively, precludes the option of multiple 'jackknife' testing, where a single pool of data is divided into multiple train/test divisions.  However, having each site run their own classifiers is a huge win in terms of the logistics of running the test.  I would, however, discourage any scheme which involved releasing the ground-truth results for the test set, since it is too easy to unwittingly train your classifier on your test set, if the test set labels are just lying around.
+I'm not sure how important the M2K/D2K angle is.  It's a nice solution to the copyright issue, and I suppose the hope is that it will solve the problem of getting code running at remote sites, but I am worried that the added burden of figuring out D2K and porting existing systems to it will act as an additional barrier to participation.  By contrast, requiring that people submit only the textual output labels in the specified format should be pretty easy for any team to produce without significant additional coding.
+In terms of the genre contest, the big issue is the unreliability and unclear definitions of the ground truth labels.  It seems weird to have one evaluation on the ability to distinguish an arbitrary set of artists - a very general-sounding problem - and another contest which is specifically dominated by the ability to distinguish classical from jazz from rock - a very specific, and perhaps not very important, problem.
+Again in this case I don't particularly like the idea of trying to get multiple labellings: for artists, I thought it was unnecessary because agreement will be very high.  Here, I think it's of dubious value because agreement will be so low; in both cases, errors in ground truth impact all participants equally, and so are not really a concern - we are mostly interested in relative values, so a ceiling on absolute performance due to a few 'incorrect' reference labels is of little consequence.
+Clearly, we can run a genre contest: I would again advocate for real music, and not worry too much about copyright issues, and not even worry too much about where the genre ground truth comes from, since it is always pretty suspect; allmusic.com is as good a source as any. But I personally find this contest of less intellectual interest than artist ID, even though it has historically received more attention, because of the poor definition of the true, underlying classes.
+I guess the strongest thing in favor of the genre contest is that if you have a system to evaluate either of artist ID or genre ID, you can use it unmodified for both (simply by changing the ground truth labels), so we might as well run both if only to see how well the results of these two tests correlate over different algorithms.  It's a great shame we didn't do this at ISMIR2004, which I think was due only to a needless misunderstanding among participants (related to the MFCC features made available).
 ==Review 2==
@@ Line 92: / Line 153: @@
 The test data are relevant but still a bit vague. Obviously existing databases should be used again and completed with new annotated data. The participants should list their own databases in detail and put them in common for evaluation in order to evaluate the time needed to annotate new data.
+==Downie's Comment==
+. Think genre tasks are kinda fun, actually. Devil is in the details. Would give my eye teeth to avoid manually labelling genre classes. You set up eight classes with 100-150 examples. That comes to 800-1200 labels that need applying. Can we as a group come up with a possible standardized source for genre labels and then, even though they are not perfect, live with our choice? Perhaps in this early days, we would be best served by looking at only the broadest of categories and not fussing about the fine-grained subdivisions?
+. Would be interesting to have a TRUE genre task! As we learned in the UPF doctoral seminar prior to ISMIR 2004, genre is properly defined as the "use" of the music: dance, liturgical, funereal, etc. What we are calling genre here is really style. Just a thought.
+==Kris' thoughts==
+Contents:
+# Multiple genres and Artist ID
+# Framework issues and algorithm submission
+# Producing ground truth and answers to Downie's comments
+# Who has data?
+----
+. Multiple genres and Artist ID
+Dan Ellis wrote:
+> About multiple genre classification:  I have pretty serious doubts<br>
+> about genre classification in the first place, because of the<br>
+> seemingly arbitrary nature of the classes and how they are assigned.<br>
+IMHO the genre classification task is to reproduce an arbitrary set of culturally assigned classes. Tim Pohle (at the ISMIR 2004 grad school) gave an interesting talk on using genre classifiers to reproduce arbitrary, user assigned classes, to manage a user's personal music collection. We also discussed how to suggest new music choices from a larger catalog by thresholding the probabilities of membership of new music to favored classes.
+Dan Ellis wrote:
+> This is why I prefer artist identification as a task.  That said,<br>
+> assigning multiple genres seems not much worse, but not much better<br>
+> either.  Allowing for fuzzy, multiple characteristics seems to address<br>
+> some of the problems with genres -- which is good -- but now defining<br>
+> the ground truth is even more arguable and problematic, since we now<br>
+> have that much more ground truth data to define -- degree of<br>
+> membership, and over a larger set of classes.<br>
+I also prefer the artist ID task, but for different reasons; I think we use too few classes to properly evaluate the Genre classification task as some models/techniques fall over if given too many classes to evaluate. Obviously this has come about because of storage, IP and ground-truth constraints. However, if a hierarchy is used (as suggested for the symbolic track), rather than a bag of labels, the ground-truth problem is no bigger as higher level labels can be interpolated and it will be easier to both expand the database to include more pieces and to implement a finer granuality of labels (more sub-genres) in later evaluations. Small taxonomies are the biggest hurdle in the accurate evaluation of Genre classification systems, we can probably define around 10 lowest level classes for this year, but should aim to add the same number again next year and the year after until we can be confident that we have a database that poses a classification problem that is as difficult as a real world application (such as organizing/classifying the whole Epitonic catalog).
+Dan Ellis wrote:
+> One of the reasons I am interested in a parallel evaluation of genre<br>
+> classification and artist ID is that it may provide some objective<br>
+> evidence for my gut bias against genres: if the results of different<br>
+> algorithms under genre classification are more equivocal than artist<br>
+> ID (i.e. they reveal less about the difference between the<br>
+> algorithms), then that's some kind of evidence that the task itself is<br>
+> ill-posed.  My suspicion is that multiple, real-valued genre<br>
+> memberships will be even less revealing.
+I also believe that many classification techniques and feature extractions are vulnerable to a smaller numbers of examples per class and I think this is far more likely to show up in the comparison of the performance of an algorithm between the two tracks (my own submission will be modified for the artist id track). Artist identification is about modeling a natural grouping within the data, whereas genres are not neccessarily natural groupings and I believe their accurate modeling of hierarchical, multi-modal genres is likely to be more complex than that of modeling an artist's work (although this is alleviated by the additional data available). An artist may work in a number of styles but there is *usually* some dimension along which all the examples are grouped.
+Dan Ellis wrote:
+> The most important thing, I think, is to define the evaluation to support <br>
+>(and encourage) the largest number of participants, meaning that we could <br>
+>include this as an option, but also evaluate a 1-best genre ID to remain <br>
+>accessible to algorithms that intrinsically can only report one class.
+With this in mind I think we should opt for a hierarchical taxonomy, which can support direct comparison of hierarchical classifiers, single label classifiers (by interpolating higher level classifications in evaluation framework) and multiple label classifiers (in a somewhat limited fashion, perhaps with a penalty for additional incorrect labels, which is probably not fair, or by limiting number of labels to match height of taxonomy). I suggest that each correct label scores one point, e.g. rock/pop, rock, indie rock would score 3 if all labels are correct.
+----
+. Framework issues and algorithm submission
+I don't think it is particularly ambitious to have people submit their code for evaluation at a single local. I have already implemented a basic D2K framework that can run anything that will run from the command line including Matlab. The only constraint is that a submission will have to conform to a simple text file input and output format. Marsyas-0.1 and Matlab examples have been produced and I am happy for people to take in IO portions directly from this code if they wish. Having code submitted to a central evaluation site will allow us to perform cross-validation experiments and assess exactly how much variance their is in each system's performance. This would not be so essential if we had a very large data set (min 10,000 examples) however we are going to get nowhere near that many (maybe in later years...). It was also suggested in the reviews that this would hamstring feature selection techniques (see review 2) but I don't believe this, surely the feature selection code (including any classifier used) would be correctly implemented in the feature extraction phase.
+I could also define an optional simple text file format for descriptors. This would allow the hybridization of any submitted systems using this format and the use of a bench mark classifier to evaluate the power of the descriptors and classifiers independent of each other. I would be happy to provide several bench mark classifiers for this purpose (possibly by creating an interface to Weka). I would also be interested in seeing the performance of a mixture of experts system, built from all the submitted systems, which should, in theory, be able to equal or better the performance of all of the submitted systems.
+M2K is coming up to its Alpha release and will include a cut down version of the competition framework so that people can see how it works (External integration itinerary). As D2K can run across X windows we could even provide a virtual lab evaluation setup, so that each participant could run their own submission (without violating any IP laws) if they really wanted, and ensure that it ran ok. Anyone can get a license for D2K and the framework will probably be included in a later version of M2K so anyone can make sure that their submission works wok ahead of time.
+----
+. Producing ground truth and answers to Downie's comments
+First I don't think we need to send out a tuning database, it creates problems and solves none. If data is held at evaluation site we don't have any SIP issues and as I stated earlier, anyone could launch their submission themselves in D2K across an X Windows session (note all console output from code external to D2K is collected and forwarded to D2K console to aid debugging). If we use IP free databases, we are unlikely to be able validate ground-truth with on line services such as http://www.allmediaguide.com/ and it has also been suggested that IP-free databases are not necessarily representative of the whole music community. Several people have said that it doesn't matter if some of labels are incorrect however I'm not afraid to volunteer to validate the labels of a subset of the data (say 200 files, humans can get through them quicker than you'd think) and if there were sufficient volunteers it would go a long way to establishing an IP free research dbase with good ground-truth (if I don't get any volunteers I won't consider this an option, so email me!).
+Personally I think we should use a large volume of copyrighted material, with labels confirmed by at least two sources (existing dbase label, CDDB and allmediaguide, or a human labeler). The format should be WAV (MP3s would have to be decoded to this anyway) and will be mono unless anyone specifically requests stereo (both can be made available or can be handled by framework).
+Should we rename this the Style classification task?
+----
+. Who has data?
+Anyone with music (with or without ground-truth) that we can use should make themselves known ASAP. I can provide a fair selection of white label (IP-free) dance music in at least 3 subgenres, with labels defined by 3 expert listeners.