2008:Audio Tag Classification

From MIREX Wiki
Revision as of 09:52, 15 July 2008 by Bertinmt (talk | contribs) (Packaging submissions)

Overview

This task will compare various algorithms' abilities to associate tags with 10-second audio clips of songs. The tags come from the MajorMiner game. This task is very much related to the other audio classification tasks. One new twist, however, is that many tags can apply to the same clip, so instead of one N-way classification per clip, this task requires N binary classifications per clip. In order to set a standard baseline of 50% accuracy for the clips in the test set, there will always be the same number of positive and negative examples.

Status

A provisional specification of the tag classification task is detailed below. This proposal may be refined based on feedback from the participants.

Note that audio tag classification is a new task at MIREX 2008.

Please feel free to edit this page.

Discussion Points

It is possible for each tag to be treated as a completely separate classification problem. It is also possible to present the tags "all at once" for training, but then separately for testing. The former is a subset of the latter, and learning separate classifiers can be done inside any "all at once" classifier. The separate approach, however, has the nice property of being almost identical to the other audio classification tasks.

Possible ways of presenting training tags

  • One at a time
  • All at once

TBM: I strongly support all at once. As mentioned above, one at a time is almost identical to genre classification, but I'm not sure it's a positive fact, tagging is different than finding one genre. Also, relative performance of the "all at once" versus "one at a time" algorithms is of interest. However, submissions should clearly state whether they learn tags independently, as the "one at a time" approach should be harder.

This task could also be run as a retrieval task using a metric like area precision-at-10. It could also be evaluated as a classifier on un-balanced test sets with other metrics like area under the ROC curve or F-measure. The choice of metric would obviously change the types of evaluations that could be performed. The fact that there are no definite negative tags might make an evaluation with many examples more difficult.

Possible evaluation metrics

  • Classification accuracy on a balanced dataset
  • Precision-at-K
  • Area under the ROC curve
  • F-measure

TBM: what kind of output should our algorithm produce? a value meaning yes/no (like 0-1) for each song/tag pair, or a more refine value between 0 and 1 for example? 0=do not apply, 1=would apply, we could put a threshold at .5 when we need to decide whether we apply a tag or not, but we could also rank tags for a particular song.

Data

All of the data is browseable via the MajorMiner search page.

Music

The music consists of 2300 clips selected at random from 3900 tracks. Each clip is 10 seconds long. The 2300 clips represent a total of 1400 different tracks on 800 different albums by 500 different artists. To give a sense for the music collection, the following genre tags have been applied to these artists, albums, and tracks on Last.fm: electronica, rock, indie, alternative, pop, britpop, idm, new wave, hip-hop, singer-songwriter, trip-hop, post-punk, ambient, jazz.

Tags

The MajorMiner game has collected a total of about 73000 taggings, 12000 of which have been verified by at least two users. In these verified taggings, there are 43 tags that have been verified at least 35 times, for a total of about 9000 verified uses. These are the tags we will be using in this task.

Note that these data do not include strict negative labels. While many clips are tagged rock, none are tagged not rock. Frequently, however, a clip will be tagged many times without being tagged rock. We take this as an indication that rock does not apply to that clip. More specifically, a negative example of a particular tag is a clip on which another tag has been verified, but the tag in question has not.

TBM: how many tags will we predict, top 50? seems good to me. And what is the lowest number of (verified) tags applied to one of 2300 clips?

Here is a list of the top 50 tags along with an approximate number of times each has been verified, how many times it's been used in total, and how many different users have ever used it:

Tag Verified Total Users
drums 962 3223 127
guitar 845 3204 181
male 724 2452 95
rock 658 2619 198
synth 498 1889 105
electronic 490 1878 131
pop 479 1761 151
bass 417 1632 99
vocal 355 1378 99
female 342 1387 100
dance 322 1244 115
techno 246 943 104
piano 179 826 120
electronica 168 686 67
hip hop 166 701 126
voice 160 790 55
slow 157 727 90
beat 154 708 90
rap 151 723 129
jazz 136 735 154
80s 130 601 94
fast 109 494 70
instrumental 103 539 62
drum machine 89 427 35
british 81 383 60
country 74 360 105
distortion 73 366 55
saxophone 70 316 86
house 65 298 66
ambient 61 335 78
soft 61 351 58
silence 57 200 35
r&b 57 242 59
strings 55 252 62
quiet 54 261 57
solo 53 268 56
keyboard 53 424 41
punk 51 242 76
horns 48 204 38
drum and bass 48 191 50
noise 46 249 61
funk 46 266 90
acoustic 40 193 58
trumpet 39 174 68
end 38 178 36
loud 37 218 62
organ 35 169 46
metal 35 178 64
folk 33 195 58
trance 33 226 49

Audio Formats

Participating algorithms will have to read audio in the following format:

  • Sample rate: 44 KHz
  • Sample size: 16 bit
  • Number of channels: 2 (stereo)
  • Encoding: mp3

Requests for additional audio formats will be considered, if they are submitted a minimum of three weeks before the submission deadline.

Evaluation

Participating algorithms will be evaluated with 3-fold cross validation. Artist filtering will be used the test and training splits, I.e. training and test sets will contain different artists. The raw classification accuracy and standard deviation for each tag and each algorithm will be computed.

Beta-Binomial model

In order to make the variance of the accuracy estimates the same for all tags, the same number of test examples must be used. This unnecessarily reduces the amount of test data, a property that can be avoided if we use the beta-binomial empirical Bayes estimator of accuracy. The basic idea of the model is that for each submission, it is possible to intelligently combine the overall performance with performance on each tag in proportion to the number of examples of each tag. Basically, performance on tags with more examples will matter more, and performance on tags with fewer examples will be "shrunk" towards the mean of all of the tags. The wikipedia page is a bit sparse, but slightly informative.

More specifically, the beta-binomial model treats performance on each tag as a binomial random variable, with the parameter of that binomial (the probability of success) drawn from a beta distribution. The parameters of the beta distribution will be estimated and will yield a mean and variance that can be used to compare algorithms. See Chapter 5 of Bayesian Data Analysis by Gelman, Carlin, Stern, and Rubin for even more detail.


Ranking and significance testing

Additionally, more standard tests could be performed on the average classification accuracy, although the cross-tag variance tends to increase each algorithm's variance, interfering with significance tests without further handling.

In addition computation times for feature extraction and training/classification will be measured.

Submission format

Submission to this task will have to conform to a specified format detailed below, which is very similar to the audio genre classification task, among others.

Audio formats

Participating algorithms will have to read audio in the following format:

  • Sample rate: 44 KHz
  • Sample size: 16 bit
  • Number of channels: 2 (stereo)
  • Encoding: mp3

Requests for additional audio formats will be considered, if they are submitted a minimum of three weeks before the submission deadline.

Implementation details

Scratch folders will be provided for all submissions for the storage of feature files and any model files to be produced. Executables will have to accept the path to their scratch folder as a command line parameter. Executables will also have to track which feature files correspond to which audio files internally. To facilitate this process, unique filenames will be assigned to each audio track.

The audio files to be used in the task will be specified in a simple ASCII list file. For feature extraction and classification this file will contain one path per line with no header line. For model training this file will contain one path per line, followed by a tab character and the tag label, again with no header line. Executables will have to accept the path to these list files as a command line parameter. The formats for the list files are specified below.

Algorithms should divide their feature extraction and training/classification into separate executables/scripts. This will facilitate a single feature extraction step for the task, while training and classification can be run for each cross-validation fold.

Multi-processor compute nodes (2, 4 or 8 cores) will be used to run this task. Hence, participants should attempt to use parallelism where-ever possible. Ideally, the number of threads to use should be specified as a command line parameter. Alternatively, implementations may be provided in hard-coded 2, 4 or 8 thread configurations. Single threaded submissions will, of course, be accepted but may be disadvantaged by time constraints.


I/O formats

In this section the input and output files used in this task are described as are the command line calling format requirements for submissions.

Feature extraction list file

The list file passed for feature extraction will be a simple ASCII list file. This file will contain one path per line with no header line.

Training list file

The list file passed for model training will be a simple ASCII list file. This file will contain one path per line, followed by a tab character and the tag label, again with no header line.

E.g. <example path and filename>\t<tag classification>

Depending on the results of the poll above, there might be only one line for each path or there might be multiple lines for each path, one for each tag associated with it.

Test (classification) list file

The list file passed for testing classification will be a simple ASCII list file identical in format to the Feature extraction list file. This file will contain one path per line with no header line.

Classification output files

Participating algorithms should produce a simple ASCII list file identical in format to the Training list file. This file will contain one path per line, followed by a tab character and the tag label, again with no header line. E.g.:

<example path and filename>\t<tag classification>

Again, depending on the results of the poll above, there might be only one line per path, or one line per (path, tag) pair.

The path to which this list file should be written must be accepted as a parameter on the command line.

Example submission calling formats

 extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 TrainAndClassify.sh /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputListFile.txt
 extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 TrainAndClassify.sh /path/to/scratch/folder /path/to/hierachy/file /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputListFile.txt
 extractFeatures.sh -numThreads 8 /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 TrainAndClassify.sh -numThreads 8 /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputListFile.txt
 extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 Train.sh /path/to/scratch/folder /path/to/trainListFile.txt 
 Classify.sh /path/to/testListFile.txt /path/to/outputListFile.txt
 myAlgo.sh -extract -numThreads 8 /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 myAlgo.sh -TrainAndClassify -numThreads 8 /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputListFile.txt
 myAlgo.sh -extract /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 myAlgo.sh -train /path/to/scratch/folder /path/to/trainListFile.txt 
 myAlgo.sh -classify /path/to/testListFile.txt /path/to/outputListFile.txt

Packaging submissions

All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guaranteed).

All submissions should include a README file including the following the information:

  • Command line calling format for all executables
  • Number of threads/cores used or whether this should be specified on the command line
  • Expected memory footprint
  • Expected runtime
  • Any required environments (and versions) such as Matlab, Java, Python, Bash, Ruby etc.

Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be specified.

A hard limit of 24 hours will be imposed on feature extraction times.

A hard limit of 24 hours will be imposed on each training/classificaiton cycle. Leading to a total runtime limit of 72 hours.


Submission opening date

1st August 2008 - provisional

Submission closing date

TBA

Interested participants

If this sounds interesting to you, please leave your name and email. Doing so is not binding in any way.

  • Michael Mandel <mim at ee columbia edu>
  • Thierry Bertin-Mahieux <bertinmt at iro umontreal ca>
  • Grigorios Tsoumakas <greg at csd auth gr>
  • [Your name here]