2013:Audio Music Similarity and Retrieval

From MIREX Wiki


As the size of digitial music collections grow, music similarity has an increasingly important role as an aid to music discovery. A music similarity system can help a music consumer find new music by finding the music that is most musically similar to specific query songs (or is nearest to songs that the consumer already likes).

This page presents the Audio Music Similarity Evaluation, including the submission rules and formats. Additionally background information can be found here that should help explain some of the reasoning behind the approach taken in the evaluation. The intention of the Music Audio Search track is to evaluate music similarity searches (A music search engine that takes a single song as a query aka Query-by-example), not playlist generation or music recommendation.

The Audio Music Similarity and Retrieval task has been run in MIREX 2012, 2011, 2010, 2009, 2007, and 2006.

Audio Music Similarity and Retrieval task in MIREX 2012 || Results

Audio Music Similarity and Retrieval task in MIREX 2011 || Results

Audio Music Similarity and Retrieval task in MIREX 2010 || Results

Audio Music Similarity and Retrieval task in MIREX 2009 || Results

Audio Music Similarity and Retrieval task in MIREX 2007 || Results

Audio Music Similarity and Retrieval task in MIREX 2006 || Results

Task specific mailing list

In continuing with the changes that took place last year, task specific lists have been dropped. Instead we are asking that all discussions take place on the MIREX "EvalFest" list. If you have an question or comment, simply include the task name in the subject heading.


Collection statistics: 7000 30-second audio clips drawn from 10 genres (700 clips from each genre).

The Genres that data was drawn from are:

  • Blues
  • Jazz
  • Country/Western
  • Baroque
  • Classical
  • Romantic
  • Electronica
  • Hip-Hop
  • Rock
  • HardRock/Metal

Audio formats

Participating algorithms will have to read audio in the following format:

  • Sample rate: 22 KHz
  • Sample size: 16 bit
  • Number of channels: 1 (mono)
  • Encoding: WAV
  • clip length: 30 secs from the middle of each file


Two distinct evaluations will be performed

  • Human Evaluation
  • Objective statistics derived from the results lists

Note that at MIREX 2006 particpating algorithms were required to return full distance matrices showing the distance between all tracks, however, in subsequent years we have also supported sparse distance matrix format (detailed below) where only the distances of the top 100 results for each query in the collection are returned.

Human Evaluation

The primary evaluation will involve subjective judgments by human evaluators of the retrieved sets using IMIRSEL's Evalutron 6000 system. This year algorithms will be presented with the same 30 second preview clip that will be reviewed by the human evaluators.

  • Evaluator question: Given a search based on track A, the following set of results was returned by all systems. Please place each returned track into one of three classes (not similar, somewhat similar, very similar) and provide an inidcation on a continuous scale of 0 - 10 of high similar the track is to the query.
  • ~120 randomly selected queries, 5 results per query, 1 set of ears, ~10 participating labs
  • Higher number of queries preferred as IR research indicates variance is in queries
  • The songs by the same artist as the query will be filtered out of each result list (artist-filtering) to avoid colouring an evaluators judgement (a cover song or song by the same artist in a result list is likely to reduce the relative ranking of other similar but independent songs - use of songs by the same artist may allow over-fitting to affect the results)
  • It will be possible for researchers to use this data for other types of system comparisons after MIREX 2013 results have been finalized.
  • Human evaluation to be designed and led by IMIRSEL following a similar format to that used at MIREX 2006 (see: Evalutron Issues in MIREX 2006).
  • Human evaluators will be drawn from the participating labs (and any volunteers from IMIRSEL or on the MIREX lists)

Objective Statistics derived from the distance matrix

Statistics of each distance matrix will be calculated including:

  • Average % of Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Precision at 5, 10, 20 & 50
  • Average % of Genre matches in the top 5, 10, 20 & 50 results after artist filtering of results
  • Average % of available Genre, Artist and Album matches in the top 5, 10, 20 & 50 results - Recall at 5, 10, 20 & 50 (just normalising scores when less than 20 matches for an artist, album or genre are available in the database)
  • Always similar - Maximum # times a file was in the top 5, 10, 20 & 50 results
  •  % File never similar (never in a top 5, 10, 20 & 50 result list)
  •  % of 'test-able' song triplets where triangular inequality holds
    • Note that as we are not requiring full distance matrices this year we will only be testing triangles that are found in the sparse distance matrix.
  • Plot of the "number of times similar curve" - plot of song number vs. number of times it appeared in a top 20 list with songs sorted according to number times it appeared in a top 20 list (to produce the curve). Systems with a sharp rise at the end of this plot have "hubs", while a long 'zero' tail shows many never similar results.


In addition computation times for feature extraction/Index-building and querying will be measured.

Submission format

Submission to this task will have to conform to a specified format detailed below.

Implementation details

Scratch folders will be provided for all submissions for the storage of feature files and any model or index files to be produced. Executables will have to accept the path to their scratch folder as a command line parameter. Executables will also have to track which feature files correspond to which audio files internally. To facilitate this process, unique filenames will be assigned to each audio track.

The audio files to be used in the task will be specified in a simple ASCII list file. This file will contain one path per line with no header line. Executables will have to accept the path to these list files as a command line parameter. The formats for the list files are specified below.

Multi-processor compute nodes (2, 4 or 8 cores) will be used to run this task. Hence, participants could attempt to use parrallelism. Ideally, the number of threads to use should be specified as a command line parameter. Alternatively, implementations may be provided in hard-coded 2, 4 or 8 thread configurations. Single threaded submissions will, of course, be accepted but may be disadvantaged by time constraints.

Submissions will have to output either a full distance matrix or a search results file with the top 100 search results for each track in the collection. This list of results will be used to extract the artist-filtered results to present to the human evaluators and will facilitate the computation of the objective statistics.

I/O formats

In this section the input and output files used in this task are described as are the command line calling format requirements for submissions.

Audio collection list file (input)

The list file passed for feature extraction and indexing will be a simple ASCII list file. This file will contain one path per line with no header line, all paths will be absolute (full paths).



Distance matrix output files

Participants should return one of two available output file formats, a full distance matrix or a sparse distance matrix. The sparse distance matrix format is preferred (as the dense distance matrices can be very large).

Sparse Distance Matrix

If computation or exhaustive search is a concern or not a normal output of the indexing algorithm employed, the sparse distance matric format detailed below may be used:

A simple ASCII file listing a name for the algorithm and the top 100 search results for every track in the collection.

This file should start with a header line with a name for the algorithm and should be followed by the results for one query per line, prefixed by the filename portion of the query path. This should be followed by a tab character and a tab separated, ordered list of the top 100 search results. Each result should include the result filename (e.g. a034728.wav) and the distance (e.g. 17.1 or 0.23) separated by a a comma.

MyAlgorithm (my.email@address.com)
<example 1 filename>\t<result 1 name>,<result 1 distance>,\t<result 2 name>,<result 2 distance>, ... \t<result 100 name>,<result 100 distance>
<example 2 filename>\t<result 1 name>,<result 1 distance>,\t<result 2 name>,<result 2 distance>, ... \t<result 100 name>,<result 100 distance>

which might look like:

MyAlgorithm (my.email@address.com)
a009342.wav	b229311.wav,0.16	a023821.wav,0.19	a001329,0.24  ... etc.
a009343.wav	a661931.wav,0.12	a043322.wav,0.17	c002346,0.21  ... etc.
a009347.wav	a671239.wav,0.13	c112393.wav,0.20	b083293,0.25  ... etc.

The path to which this list file should be written must be accepted as a parameter on the command line.

Full Distance Matrix

Full distance matrix files should be generated in the the following format:

  • A simple ASCII file listing a name for the algorithm on the first line,
  • Numbered paths for each file appearing in the matrix, these can be in any order (i.e. the files don't have to be i the same order as they appeared in the list file) but should index into the columns/rows of of the distance matrix.
  • A line beginning with 'Q/R' followed by a tab and tab separated list of the numbers 1 to N, where N is the files covered by the matrix.
  • One line per file in the matrix give the distances of that files to each other file in the matrix. All distances should be zero or positive (0.0+) and should not be infinite or NaN. Values should be separated by a single tab character. Obviously the diagonal of the matrix (distance or a track to itself) should be zero.
Distance matrix header text with system name
1\t0.0\t<dist 1 to 2>\t<dist 1 to 3>\t...\t<dist 1 to N>
2\t<dist 2 to 1>\t0.0\t<dist 2 to 3>\t...\t<dist 2 to N>
3\t<dist 3 to 2>\t<dist 3 to 2>\t0.0\t...\t<dist 3 to N>
N\t<dist N to 1>\t<dist N to 2>\t<dist N to 3>\t...\t0.0

which might look like:

Example distance matrix 0.1
1    /path/to/audio/file/1.wav
2    /path/to/audio/file/2.wav
3    /path/to/audio/file/3.wav
4    /path/to/audio/file/4.wav
Q/R   1        2        3        4
1     0.00000  1.24100  0.2e-4   0.42559
2     1.24100  0.00000  0.62640  0.23564
3     50.2e-4  0.62640  0.00000  0.38000
4     0.42559  0.23567  0.38000  0.00000

Example submission calling formats

  extractFeatures.sh /path/to/scratch/folder /path/to/collectionListFile.txt
  Query.sh /path/to/scratch/folder /path/to/collectionListFile.txt /path/to/outputResultsFile.txt


  doAudioSim.sh -numThreads 8 /path/to/scratch/folder /path/to/collectionListFile.txt /path/to/outputResultsFile.txt

Packaging submissions

All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guarenteed).

All submissions should include a README file including the following the information:

  • Command line calling format for all executables and an example formatted set of commands
  • Number of threads/cores used or whether this should be specified on the command line
  • Expected memory footprint
  • Expected runtime
  • Any required environments (and versions), e.g. python, java, bash, matlab.

Time and hardware limits

Due to the potentially high number of particpants in this and other audio tasks, hard limits on the runtime of submissions are specified.

A hard limit of 72 hours will be imposed on runs (total feature extraction and querying times). Submissions that exceed this runtime may not receive a result.

AMS evaluation software

The legacy software for performing various AMS related function is available here, README file. It maybe used to benchmark systems prior to submission and to check distance matrix file formats.

This tool set supports the following functions:

  • the import of collection metadata from a delimited text file (e.g. TAB or CSV)
  • the selection of a stratified random list of queries from the collection (i.e. an equal number of queries are chosen for each class of a particular metadata field, such as genre).
  • the generation of results from distance matrices based on a list of pre-chosen queries.
  • (pseudo-)objective statistical evaluation of distance matrices by comparing query metadata to the metadata of the top N results retrieved. Supports artist, album, genre and artist-filtered genre (where results form the same artist as query are skipped). Additionally, the number tracks never returned as results for all possible queries (orphans) and the largest hub (track similar to the most other tracks) are measured. Finally, the number of cases where the triangular inequality holds.
  • preparation and post processing of results for the IMIRSEL Evalutron 6k human evaluation interface.

Potential Participants

name / email