2009:Audio Tag Classification

Overview

This task will compare various algorithms' abilities to associate tags with 10-second audio clips of songs. The tags come from the MajorMiner game. This task is very much related to the other audio classification tasks. One new twist, however, is that many tags can apply to the same clip, so instead of one N-way classification per clip, this task requires N binary classifications per clip.

Description

The text of this section is copied from the 2008 page. Please add your comments and discussions for 2009. This proposal may be refined based on feedback from the participants.

Audio tag classification was first run at MIREX 2008 2008:Audio_Tag_Classification and as a special MIREX task at 2009 2009:SpecialTagatuneEvaluation .

Below is the task description from MIREX 2008. Please feel free to edit this page.

New Mood Multi Tag Dataset for 2009

This dataset is derived from mood related tags on last.fm. All tags in this set are identified by a general affect lexicon (WordNet-Affect) and by human experts. Similar tags are grouped together to define a mood tag group and each song may belong to multiple mood tag groups.

There are 18 mood tag groups containing 135 unique tags. The dataset contains 3,469 unique songs. The following table lists the tag groups, their member tags and number of songs in each group:

Group id	Tags	num. of tags	num. of songs
G12	calm, comfort, quiet, serene, mellow, chill out, calm down, calming, chillout, comforting, content, cool down, mellow music, mellow rock, peace of mind, quietness, relaxation, serenity, solace, soothe, soothing, still, tranquil, tranquility, tranquility	25	1,680
G15	sad, sadness, unhappy, melancholic, melancholy, feeling sad, mood: sad ΓÇô slightly, sad song	8	1,178
G5	happy, happiness, happy songs, happy music, glad, mood: happy	6	749
G32	romantic, romantic music	2	619
G2	upbeat, gleeful, high spirits, zest, enthusiastic, buoyancy, elation, mood: upbeat	8	543
G16	depressed, blue, dark, depressive, dreary, gloom, darkness, depress, depression, depressing, gloomy	11	471
G28	anger, angry, choleric, fury, outraged, rage, angry music	7	254
G17	grief, heartbreak, mournful, sorrow, sorry, doleful, heartache, heartbreaking, heartsick, lachrymose, mourning, plaintive, regret, sorrowful	14	183
G14	dreamy	1	146
G6	cheerful, cheer up, festive, jolly, jovial, merry, cheer, cheering, cheery, get happy, rejoice, songs that are cheerful, sunny	13	142
G8	brooding, contemplative, meditative, reflective, broody, pensive, pondering, wistful	8	116
G29	aggression, aggressive	2	115
G25	angst, anxiety, anxious, jumpy, nervous, angsty	6	80
G9	confident, encouraging, encouragement, optimism, optimistic	5	61
G7	desire, hope, hopeful, mood: hopeful	4	45
G11	earnest, heartfelt	2	40
G31	pessimism, cynical, pessimistic, weltschmerz, cynical/sarcastic	5	38
G1	excitement, exciting, exhilarating, thrill, ardor, stimulating, thrilling, titillating	8	30
TOTAL		135	6,490

The songs are mostly from the USPOP collection, a detailed breakdown of the songs are listed in the following table:

Collection	num. of songs in the dataset	percentage of songs in the dataset
USPOP	2764	80%
Assorted pop	366	10%
American music	145	4%
Beatles	128	4%
USCRAP	40	1%
Metal music	25	1%
Magnatune	1	0%
TOTAL	3469	100%

Details on how the mood tag groups were derived are described in X. Hu, J. S. Downie, A.Ehmann, Lyric Text Mining in Music Mood Classification, In Proceedings of the 10th International Symposium on Music Information Retrieval (ISMIR), Oct. 2009, Kobe , Japan

Details on how the songs were selected are available in the description.

Discussions for 2009

Your comments here

I don't know where to put this but the Marsyas team plans to participate in this task if it is run this year (George Tzanetakis)

Discussion from 2008

It is possible for each tag to be treated as a completely separate classification problem. It is also possible to present the tags "all at once" for training, but then separately for testing. The former is a subset of the latter, and learning separate classifiers can be done inside any "all at once" classifier. The separate approach, however, has the nice property of being almost identical to the other audio classification tasks.

Possible ways of presenting training tags

One at a time
All at once

Consensus: all at once training

This task could also be run as a retrieval task using a metric like area precision-at-10. It could also be evaluated as a classifier on un-balanced test sets with other metrics like area under the ROC curve or F-measure. The choice of metric would obviously change the types of evaluations that could be performed. The fact that there are no definite negative tags might make an evaluation with many examples more difficult.

Possible evaluation metrics

Classification accuracy on a balanced dataset
Precision-at-K
Area under the ROC curve
F-measure

Consensus: perform both retrieval and classification tasks. See Doug Turnbull's proposal below

Data

All of the data is browseable via the MajorMiner search page.

Music

The music consists of 2300 clips selected at random from 3900 tracks. Each clip is 10 seconds long. The 2300 clips represent a total of 1400 different tracks on 800 different albums by 500 different artists. To give a sense for the music collection, the following genre tags have been applied to these artists, albums, and tracks on Last.fm: electronica, rock, indie, alternative, pop, britpop, idm, new wave, hip-hop, singer-songwriter, trip-hop, post-punk, ambient, jazz.

Evaluation

Participating algorithms will be evaluated with 3-fold cross validation. Artist filtering will be used the test and training splits, I.e. training and test sets will contain different artists. The raw classification accuracy and standard deviation for each tag and each algorithm will be computed.

Beta-Binomial model

In order to make the variance of the accuracy estimates the same for all tags, the same number of test examples must be used. This unnecessarily reduces the amount of test data, a property that can be avoided if we use the beta-binomial empirical Bayes estimator of accuracy. The basic idea of the model is that for each submission, it is possible to intelligently combine the overall performance with performance on each tag in proportion to the number of examples of each tag. Basically, performance on tags with more examples will matter more, and performance on tags with fewer examples will be "shrunk" towards the mean of all of the tags. The wikipedia page is a bit sparse, but slightly informative.

More specifically, the beta-binomial model treats performance on each tag as a binomial random variable, with the parameter of that binomial (the probability of success) drawn from a beta distribution. The parameters of the beta distribution will be estimated and will yield a mean and variance that can be used to compare algorithms. See Chapter 5 of Bayesian Data Analysis by Gelman, Carlin, Stern, and Rubin for even more detail.

[Turnbull] When considering tag-based performance, if we use Area Under the ROC Curve (AUC) instead of classification accuracy, we would not need to use the same number of positive and negative test examples since AUC is a metric is not related to the prior probability of a tag. That is, we can average over AUC value.

A good reference for ROC curves and AUC is: 'ROC Graphs: Notes and Practical Considerations for Data Mining Researchers' Tom Fawcett, HP Labs --Kriswest 08:52, 18 August 2008 (CDT)

[Turnbull] "performance on tags with more examples will matter more" - This is a concern since, in our experience, we have found the tags that are most useful for tag-based retrieval are those that are neither the most common (e.g., generic) tags nor the very rare (e.g., obscure) tags.

[TBM] answer to Turnbull, if we use the top 50 tags from MajorMinor, all tags are kinda common, and there's no very rare ones. The problem comes from machine learning, I would not trust a model trained on less than 50 positive examples, therefore the model's performance on tags with more examples should matter more.

Ranking and significance testing

Additionally, more standard tests could be performed on the average classification accuracy, although the cross-tag variance tends to increase each algorithm's variance, interfering with significance tests without further handling. One such test is Friedman's ANOVA.

We wish to compare a number of treatments/systems (the submissions) over a number of blocks/rows. We can either compute average classification accuracy and/or precision metrics over all the tags and use the cross validation folds as the blocks/rows - which will handle variance between different folds. However, we are more interested in considering each tag (averaged over all folds) or (perhaps better) each tag on each fold as a separate block.

The Friedman test should handle the variance between tags (caused by different difficulties of modeling each tag and different numbers of positive and negative examples per tag) by replacing the actual scores achieved by each system on each block (tag) with the rank achieved by that system on that tag amongst all the systems. Hence, we make the assumption that each tag (or combination of tag and fold) is of equal importance in the evaluation. This is an oft used approach at TREC when considering retrieval results (where each query is of equal importance, but unequal variance/difficulty).

Tukey-Kramer Honestly Significant Difference multiple comparisons are made over the results of Friedman's ANOVA as this (and other tests, such as multiply applied Student's T-tests) can only safely tell you if one system is statistically significantly different from the rest. If you try to do the full NxN comparisons with such tests then the experiment wide alpha value is cumulative over all the tests. E.g. if we compared 12 systems at an alpha level of 0.05, a total of 66 pairwise comparisons are made and the chance of incorrectly rejecting the hypothesis of no difference in error rates is: 1 ΓÇô (0.95^66) = 0.97 = 97%. This explanation is lifted from a paper by Tague-Sutcliffe and Blustein:

 @article{taguesutcliffe1995sat,
   title={A Statistical Analysis of the TREC-3 Data},
   author={Tague-Sutcliffe, J. and Blustein, J.},
   journal={Overview of the Third Text Retrieval Conference (Trec-3)},
   year={1995},
   publisher={DIANE Publishing}
 }

The use of Friedman's ANOVA and Tukey-Kramer HSD was originally proposed by Stephen Downie for the evaluation of the Audio Sim queries - and I've used for the same in my thesis and evaluation of classification results (where it tends to give similar results with other tests such as McNemar's Test). It also comes with nice easy to interpret column rank plots and can be used to put the submissions into equivalence groups, see Downie/M.C.Jones/IMIRSEL's papers on the evaluation of Audio Sim at MIREX 2006:

 @InProceedings{jones2007hsj,
   title={"Human Similarity Judgements: Implications for the Design of Formal Evaluations"},
   author="M.C. Jones and J.S. Downie and A.F. Ehmann",
   BOOKTITLE ="Proceedings of ISMIR  2007 International Society of Music Information Retrieval", 
   year="2007"
 }

--Kriswest 17:54, 23 July 2008 (CDT)

Proposal for three Annotation and Retrieval Sub-tasks

There seem to be to three tasks that are worth considering:

Tag Classification (Annotation) - T binary classification problems (where T is the number of tags.) Given a tag, identify all relevant songs.
Tag Ranking (Annotation) - S ranking problems (where S is the number of songs.) Given a song, find the most salient tags. However, we do not want our system to simply output the most common tags (i.e., the tags with the large prior probability.)
Song Ranking (Retrieval) - T ranking problems. Given a tag, rank order songs based on relevance.

Our research to date has focused on #2 and #3 because we tend to think of a song-tag pair in terms of an affinity score rather than binary class labels. (While some of the more objective tags (e.g., instruments) can be regarded as all-or-nothing, music is subjective, and as such, tends to resist absolutes. See the genre classification literature for more on this topic - Tzanetakis, McKay, etc.) However, for systems that use binary classifiers (e.g., boosted decision stumps, SVMs), #1 and #3 seem like the most natural tasks.

Our proposal for evaluation would be propose to split this into these three separate tasks.

Task #1: For each tag, we fix the proportion of positive and negative test set. Each participant outputs a binary labeling of songs for that tag which assumes an equal prior. The metric for #1 could be based on classification accuracy: a vector of 3 folds x 50 tags = 150-dimensions vector of accuracies. We then use Kris' ANOVA suggestion to determine statistical significance, as well as report mean classification accuracy by taking the arithmetic mean of the 150-dimensional vector.

Task #2: the participant provides an affinity score for each song-tag pair (e.g., a SxT real-valued matrix). For each song, we examine the top N tags (e.g., N = 1, 10, or 25). For each tag that is relevant, we get an "Annotation Score" of 1 / Pr(tag) and 0 otherwise . That is, if the prior probability of a tag is 1/2, the Annotation Score is 2. If the prior probability is 1/10, the Annotation Score is 10. We then average over the the top N tags. The expected value of (average) Annotation Score is 1. If an Annotation Score is greater than 1, it is better than random. We then get a S-dimensional vector of average Annotation Scores which can be used for an ANOVA test and we can report the mean average Annotation Score.

Task #3: the participant provides an affinity socre for each song-tag pair (e.g., a SxT real-valued matrix). For each tag, we rank order the songs and compute the Area under the ROC curve (AUC). The ROC curve is a plot of the true positive rate as a function of the false positive rate as we move down this ranked list of songs. AUC is a good metric because we can average AUC across all the tags since it is not related to the prior probability of a tag. Like tasks #1 & #2, we can report mean AUC and do an ANOVA test.

-- Dougturnbull 18:58, 24 July 2008 (CDT)

Runtime performance

In addition computation times for feature extraction and training/classification will be measured.

Submission format

Submission to this task will have to conform to a specified format detailed below, which is very similar to the audio genre classification task, among others.

Audio formats

Participating algorithms will have to read audio in the following format:

Sample rate: 44 KHz
Sample size: 16 bit
Number of channels: 2 (stereo)
Encoding: WAV (decoded from MP3 files by IMIRSEL)
Duration: 10 second clips

Implementation details

Scratch folders will be provided for all submissions for the storage of feature files and any model files to be produced. Executables will have to accept the path to their scratch folder as a command line parameter. Executables will also have to track which feature files correspond to which audio files internally. To facilitate this process, unique filenames will be assigned to each audio track.

The audio files to be used in the task will be specified in a simple ASCII list file. For feature extraction and classification this file will contain one path per line with no header line. For model training this file will contain one path per line, followed by a tab character and the tag label, again with no header line. Executables will have to accept the path to these list files as a command line parameter. The formats for the list files are specified below.

Algorithms should divide their feature extraction and training/classification into separate executables/scripts. This will facilitate a single feature extraction step for the task, while training and classification can be run for each cross-validation fold.

Multi-processor compute nodes (2, 4 or 8 cores) will be used to run this task. Hence, participants should attempt to use parallelism where-ever possible. Ideally, the number of threads to use should be specified as a command line parameter. Alternatively, implementations may be provided in hard-coded 2, 4 or 8 thread configurations. Single threaded submissions will, of course, be accepted but may be disadvantaged by time constraints.

I/O formats

In this section the input and output files used in this task are described as are the command line calling format requirements for submissions.

Feature extraction list file

The list file passed for feature extraction will be a simple ASCII list file. This file will contain one path per line with no header line.

Training list file

The list file passed for model training will be a simple ASCII list file. This file will contain one path per line, followed by a tab character and a tag label, again with no header line.

E.g.

 <example path and filename>\t<tag classification>\n

In this way, the input file will represent the sparse ground truth matrix. While no line will be duplicated, multiple lines may contain the same path, one for each tag associated with that clip. Any tag that is not specified as applying to a clip does not apply to that clip. The ordering of the lines is arbitrary and should not be depended upon.

Test (classification) list file

The list file passed for testing classification will be a simple ASCII list file identical in format to the Feature extraction list file. This file will contain one path per line with no header line.

Classification output files

Participating algorithms should produce two simple ASCII list files similar in format to the Training list file. The path to which each list file should be written must be accepted as a parameter on the command line.

Tag Affinity file

The first file will contain one path per line, followed by a tab character and the tag label, followed by another tab character and the affinity of that tag for that file, again with no header line.

I.e.:

 <example path and filename>\t<tag classification>\t<affinity>\n

E.g.:

 /data/file1.wav    rock      0.9
 /data/file1.wav    guitar    0.7
 /data/file1.wav    vocal     0.3
 /data/file2.wav    rock      0.5
 ...

In this way, the output file will represent the sparse classification matrix. A path should be repeated on a separate line for each tag that the submission deems applies to it. If a (path, tag) pair is not specified, it will be assumed to have an affinity of 0. The ordering of the lines is not important and can be arbitrary.

The affinity will be used for retrieval evaluation metrics, and its only specification is that for a given tag, larger (closer to +infinity) numbers indicate that the tag is more appropriate to a clip than smaller (closer to -infinity) numbers. As submissions are asked to also return a binary relevance listing, submissions that do not compute an affinity should provide only the binary relevance listing file.

Binary relevance file

The second file to be produced is a binary version of the tag classifications, where a tag must be marked as relevant or not relevant to a track. This file will contain one path per line, followed by a tab character and the tag label, followed by another tab character and either a 1 or a 0 indicating the relevance of that tag for that file, again with no header line.

I.e.:

 <example path and filename>\t<tag classification>\t<relevant? [0 | 1]>\n

E.g.:

 /data/file1.wav    rock      1
 /data/file1.wav    guitar    1
 /data/file1.wav    vocal     0
 /data/file2.wav    rock      1
 ...

If a (path, tag) pair is not specified, it will be assumed to be non-relevant (0). Any line with path but no numerical value will be assumed to be relevant (1).

Hence, the following is equivalent to the example above:

 /data/file1.wav    rock
 /data/file1.wav    guitar
 /data/file2.wav    rock

The ordering of the lines is not important and can be arbitrary.

Example submission calling formats

 extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 TrainAndClassify.sh /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputAffinityFile.txt /path/to/outputBinaryRelevanceFile.txt

 extractFeatures.sh -numThreads 8 /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 TrainAndClassify.sh -numThreads 8 /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputAffinityFile.txt /path/to/outputBinaryRelevanceFile.txt

 extractFeatures.sh /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 Train.sh /path/to/scratch/folder /path/to/trainListFile.txt 
 Classify.sh /path/to/scratch/folder /path/to/testListFile.txt /path/to/outputAffinityFile.txt /path/to/outputBinaryRelevanceFile.txt

 myAlgo.sh -extract -numThreads 8 /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 myAlgo.sh -TrainAndClassify -numThreads 8 /path/to/scratch/folder /path/to/trainListFile.txt /path/to/testListFile.txt /path/to/outputAffinityFile.txt /path/to/outputBinaryRelevanceFile.txt

 myAlgo.sh -extract /path/to/scratch/folder /path/to/featureExtractionListFile.txt
 myAlgo.sh -train /path/to/scratch/folder /path/to/trainListFile.txt 
 myAlgo.sh -classify /path/to/scratch/folder /path/to/testListFile.txt /path/to/outputAffinityFile.txt /path/to/outputBinaryRelevanceFile.txt

Packaging submissions

All submissions should be statically linked to all libraries (the presence of dynamically linked libraries cannot be guaranteed).

All submissions should include a README file including the following the information:

Command line calling format for all executables
Number of threads/cores used or whether this should be specified on the command line
Expected memory footprint
Expected runtime
Any required environments (and versions) such as Matlab, Java, Python, Bash, Ruby etc.

Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be specified.

A hard limit of 24 hours will be imposed on feature extraction times.

A hard limit of 24 hours will be imposed on each training/classificaiton cycle. Leading to a total runtime limit of 72 hours.

Submission opening date

TBA

Submission closing date

TBA

Interested participants

If this sounds interesting to you, please leave your name and email. Doing so is not binding in any way.

1. Hung-Yi Lo, Ju-Chiang Wang, Hsin-Min Wang (IIS, Academia Sinica, Taiwan), hungyi[at]iis[dot]sinica[dot]edu[dot]tw

2. James Bergstra bergstrj[at]iro[dot]umontreal[dot]ca

3. The Marsyas team. Their comments provided in the 2009 discussion part : I don't know where to put this but the Marsyas team plans to participate in this task if it is run this year (George Tzanetakis)

4. Michael Mandel, Columbia University, mim[at]ee[dot]columbia[dot]edu

5. Matt Hoffma, Princeton University, mdhoffma[at]cs[dot]princeton[dot]edu

6. Juan Jose Burred, Geoffroy Peeters (IRCAM), burred[at]ircam[dot]fr

Tag	Verified	Total	Users
drums	962	3223	127
guitar	845	3204	181
male	724	2452	95
rock	658	2619	198
synth	498	1889	105
electronic	490	1878	131
pop	479	1761	151
bass	417	1632	99
vocal	355	1378	99
female	342	1387	100
dance	322	1244	115
techno	246	943	104
piano	179	826	120
electronica	168	686	67
hip hop	166	701	126
voice	160	790	55
slow	157	727	90
beat	154	708	90
rap	151	723	129
jazz	136	735	154
80s	130	601	94
fast	109	494	70
instrumental	103	539	62
drum machine	89	427	35
british	81	383	60
country	74	360	105
distortion	73	366	55
saxophone	70	316	86
house	65	298	66
ambient	61	335	78
soft	61	351	58
silence	57	200	35
r&b	57	242	59
strings	55	252	62
quiet	54	261	57
solo	53	268	56
keyboard	53	424	41
punk	51	242	76
horns	48	204	38
drum and bass	48	191	50
noise	46	249	61
funk	46	266	90
acoustic	40	193	58
trumpet	39	174	68
end	38	178	36
loud	37	218	62
organ	35	169	46
metal	35	178	64
folk	33	195	58
trance	33	226	49