2007:Audio Cover Song Identification

From MIREX Wiki


The Audio Cover Song task was a new task for MIREX 2006. It was closely related to the 2007:Audio Music Similarity and Retrieval (AMS) task as the cover songs were embedded in the Audio Music Similarity and Retrieval test collection. However, AMS has change its input format this year so Audio Cover Song and AMS will not be interlinked tasks this year.

Task Description

Within the 1000 pieces in the Audio Cover Song database, there are embedded 30 different "cover songs" each represented by 11 different "versions" for a total of 330 audio files (16bit, monophonic, 22.05khz, wav). The "cover songs" represent a variety of genres (e.g., classical, jazz, gospel, rock, folk-rock, etc.) and the variations span a variety of styles and orchestrations.

Using each of these cover song files in turn as as the "seed/query" file, we will examine the returned lists of items for the presence of the other 10 versions of the "seed/query" file. See DPWE's Average Precision comments below in the Evaluation discussion section.

Input Files

The input lists file format will be of the form:


Two input files will be provide:

  1. A list of all 1000 test collection files
  2. A list of 330 cover song files

Output File

The only output will be a distance matrix file that is 330 rows by 1000 columns in the following format:

Example distance matrix 0.1 (replace this line with your system name)
1    path/to/audio/file/1.wav
2    path/to/audio/file/2.wav
3    path/to/audio/file/3.wav
N    path/to/audio/file/N.wav
Q/R    1        2        3        ...        N
1    0.0      1.241    0.2e-4     ...    0.4255934
2    1.241    0.000    0.6264     ...    0.2356447
3    50.2e-4  0.6264   0.0000     ...    0.3800000
...    ...    ...      ...        ...    0.7172300
5    0.42559  0.23567  0.38       ...    0.000

All distances should be zero or positive (0.0+) and should not be infinite or NaN. Values should be separated by a TAB.

Audio Format Poll

<poll> Use clips from tracks instead of the whole track? Yes No (this changes the task significantly) No (for some other reason) </poll>

<poll> What is your preferred audio format for the Cover song ID task? 22 khz mono WAV 22 khz stereo WAV 44 khz mono WAV 44 khz stereo WAV 22 khz mono MP3 128kb 22 khz stereo MP3 128kb 44 khz mono MP3 128kb 44 khz stereo MP3 128kb </poll>


We could employ the same measures used in 2006:Audio Cover Song.


Evaluation measures: Perhaps the MRR of the 1st correctly classified instance could be changed to the MRR of the whole 10 answers...

dpwe: Average Precision is a popular and well-behaved measure for scoring the ranking of multiple correct answers in a retrieval task. It is calculated from a full list of ranked results as the average of the precisions (proportion of returns that are relevant) calculated when the ranked list is cut off at each true item. So if there are 4 true items, and they occur at ranks 1, 3, 4, and 6, the average precision is 0.25*(1/1 + 2/3 + 3/4 + 4/6) = 0.77. It has the nice properties of not cutting off the return list at some particular length, and of progressively discounting the contribution of items ranked deep in the list. It's also widely used in multimedia retrieval, so people are used to it. http://en.wikipedia.org/wiki/Information_retrieval#Average_precision

dpwe: In addition to a ranking task, I think it would be interesting to include a detection task i.e. does a cover version of this track exist in this database or not? This is essentially just setting a threshold on the metric used in ranking the returns, but different algorithms may be better or worse at setting such a threshold that is *consistent* between different tracks. This could be computed as a Receiver Operating Characteristic i.e. False Alarms vs. False Reject curve as a threshold is varied on the similarity scores returned by the unmodified algorithms.

--jserra 11:58, 14 June 2007 (CDT):

- I agree on using AP for ranking instead of MRR.

- Regarding detection, I think a simple recall measure with the, say, 10 first answers would be enough. Or perhaps a mean number of detected cover songs within these 10 first ranked elements (like last year) would be more straightforward and intuitive. All this, of course, taking into account that all cover groups have the same number of items.

- I will propose to use geometric means (i.e. GMAP) to average results between queries. When averaging over all queries, this penalizes very bad answers to a given query. Geometric mean is commonly used in TREC (see here).

- So, my proposal is: GMAP and Geometric mean of the number of correctly classified covers in the 10 first retrieved documents.

--jserra 03:01, 12 July 2007 (CDT)

Audio length: I think one of the most challenging scenarios for a Cover Song Identification system is to deal with different song structures for a group of covers. Therefore, the whole song should be employed for this concrete task instead of the 30s proposed in the Audio Similarity and Retrieval wiki page.


Potential Participants

  • Joan Serr├á (jserra at iua dot upf dot edu) and Emilia G├│mez (egomez at iua dot upf dot edu)
  • Dan Ellis (dpwe at ee dot columbia dot edu)
  • Juan P. Bello (jpbello at nyu dot edu)
  • Kyogu Lee (kglee at ccrma dot stanford dot edu)
  • Matija Marolt (matija.marolt at fri dot uni-lj dot si)