2005:Audio Artist
Contents
Proposer
Kris West (Univ. of East Anglia) kw@cmp.uea.ac.uk
Title
Artist or group identification from musical audio.
Description
The automatic artist identification of musical audio.
1) Input data The input for this task is a set of sound file excerpts adhering to the format, meta data and content requirements mentioned below.
Audio format:
- CD-quality (PCM, 16-bit, 44100 Hz)
- single channel (mono)
- Either whole files or 1 minute excerpts
Audio content:
- Any type of music
- data set should include at least 25 different artists or groups working in any genre
- both live performances and sequenced music are eligible
- Each artist should be represented by a minimum of 10 examples. If possible the same number of examples should represent each artist.
- If possible a subset of data (20%) should be given to participants, in the contest format. It is not essential that these examples belong to the final database (distribution of which may be constrained by copyright issues), as they should primarily be used for testing correct execution of algorithm submissions.
- Would be good to enforce some sort of cross-album component for the actual contest to avoid producer detection
Metadata:
- By definition each example must have an artist or group label corresponding to one of the output classes.
- It is assumed that artist labels will be correct, however, where possible existing artist labels should be confirmed by two or more non-entrants, due to IP constraints it is unlikely that we will be allowed to distribute any database for metadata validation by participants. This validation should ensure that each artist or group has a single label which is applied to all of their examples and that any conflicts, such as an artist also belonging to a group also represented within the data, are resolved/removed for simplicity. Other possibilities include allowing multiple artist labels, and requiring submissions to identify each label, with the final score divided evenly among the labels (I doubt there is demand for this).
- The training set should be defined by a text file with one entry per line, in the following format:
<example path and filename>\t<genre label>\n
2) Output results Results should be output into a text file with one entry per line in the following format: <example path and filename>\t<genre classification>\n
Potential Participants
- Dan Ellis & Brian Whitman (Columbia University, MIT), dpwe@ee.columbia.edu, Medium
- Elias Pampalk (ÖFAI), elias@oefai.at, Medium
- George Tzanetakis (Univ. of Victoria), gtzan@cs.uvic.ca, Medium
- Kris West (Univ. of East Anglia), kw@cmp.uea.ac.uk, High
- Thomas Lidy & Andreas Rauber (Vienna University of Technology), lidy@ifs.tuwien.ac.at, rauber@ifs.tuwien.ac.at, Medium
- Fabien Gouyon (Universitat Pompeu Fabra), fabien.gouyon@iua.upf.es, Medium
- François Pachet (Sony CSL-Paris), pachet@csl.sony.fr, Medium
Evaluation Procedures
3 (or 5, time permitting) fold cross validation of all submissions using an equal proportion of each class for each fold.
Evaluation measures:
- Simple accuracy and standard deviation of results (in the event of uneven class sizes both this should be normalised according to class size).
- Test significance of differences in error rates of each system at each iteration using McNemar's test, mean average and standard deviation of P-values.
- Perhaps specify different class #s (1-in-10, 1-in-50, 1-in-1000) to test scaling and robustness among different implementations
Evaluation framework:
Competition framework to be defined in Data-2-Knowledge, D2K (http://alg.ncsa.uiuc.edu/do/tools/d2k), that will allow submission of contributions both in native D2K (using Music-2-Knowledge, http://www.isrl.uiuc.edu/~music-ir/evaluation/m2k/, first release sue 20th Jan 2005), Matlab, Python and C++ using external code integration services provided in M2K. Submissions will be required to read in training set definitions from a text file in the format specified in 2.1 and output results in the format described in 2.2 above. Framework will define test and training set for each iteration of cross-validation, evaluate and rank results and perform McNemar's testing of differences between error-rates of each system. An example framework could be made available early February for submission development.
Relevant Test Collections
(Note potentially significant data overlap between this task and genre classification competition) Re-use Magnatune database (???) Individual contributions of copyright-free recordings (including white-label vinyl and music DBs with creative commons) Individual contributions of usable but copyright-controlled recordings (including in-house recordings from music departments) Solicite contributions from http://creativecommons.org/audio/, http://www.mp3.com/ (offers several free audio streams) and similar sites
Ground truth annotations:
All annotations should be validated, to ensure homogenenuity of artist labels, by at least two non-participating volunteers (if possible). If copyright restrictions allow, this could be extended to each of the participating groups, final classification being decided by a majority vote. Any particularly contentious classifications could be removed.
Review 1
Review 2
This proposal is very interesting and it is one the most well defined. Indeed it seems quite straightforward to establish the ground truth and to evaluate the results.
The mentioned participants really belong to the field. People working on voice separation could be added, such as Feng, Zhuang & Pan and Tsai & Wang.
The test data are also relevant and seem easy to obtain. The RWC database could also provide some data. However I don't think that data synthesized from MIDI can be used (to avoid the "MIDI-producer" detection).
My main concern is about the range of genres spanned by the data. Indeed, if most data come from different genres, the problem becomes far easier and less relevant. I believe that artist identification and artist similarity (which is close to genre classification) are very different queries, and that artist identification is relevant only within a given genre. Thus I would like to perform the evaluation on one of two sets of artists belonging to a single genre (say classical or rock) and containing some very similar artists (say Mozart/Haydn/Gluck or The beatles/The rolling stones/The who).
Downie's Comments
Review #2 does raise the interesting point of too much spread in the "genre" aspect. I do see how it could turn into a genre task if not thought out. Would be interesting to also add in the idea of "covers": same pieces but performed by different artists. Maybe, if possible, a mix of "live" and "studio" recordings of same pieces if available?
Some questions:
1. Why PCM? Why mono? Why not MP3? Am being a bit of a weeny, but I am interested.
2. Do we **really* need to supply the training set? Being both provocative and pragmatic with this question.