Difference between revisions of "2009:Audio Tag Classification Tagatune Results"

From MIREX Wiki
(Assorted Results Files for Download)
Line 398: Line 398:
 
[https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/binary_FMeasure_per_track.friedman.tukeyKramerHSD.csv binary_FMeasure_per_track.friedman.tukeyKramerHSD.csv]<br />  
 
[https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/binary_FMeasure_per_track.friedman.tukeyKramerHSD.csv binary_FMeasure_per_track.friedman.tukeyKramerHSD.csv]<br />  
 
[https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/binary_FMeasure_per_track.friedman.tukeyKramerHSD.png binary_FMeasure_per_track.friedman.tukeyKramerHSD.png]<br />
 
[https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/binary_FMeasure_per_track.friedman.tukeyKramerHSD.png binary_FMeasure_per_track.friedman.tukeyKramerHSD.png]<br />
 
====Results By Algorithm====
 
(.tgz format) <br />
 
 
'''Full dataset'''
 
 
'''LabX''' =  [https://www.music-ir.org/mirex/2009/results/tagatune/eval_T_on_T/LabX.tgz Anonymous]<br />
 
'''Mandel''' =  [https://www.music-ir.org/mirex/2009/results/tagatune/eval_T_on_T/Mandel.tgz Michael Mandel]<br />
 
'''Manzagol''' =  [https://www.music-ir.org/mirex/2009/results/tagatune/eval_T_on_T/Manzagol.tgz Pierre-Antoine Manzagol]<br />
 
'''Marsyas''' =  [https://www.music-ir.org/mirex/2009/results/tagatune/eval_T_on_T/Marsyas.tgz George Tzanetakis]<br />
 
'''Zhi''' =  [https://www.music-ir.org/mirex/2009/results/tagatune/eval_T_on_T/Zhi.tgz Zhi-Sheng Chen]<br />
 
 
 
'''100 query subset used in Tagatune evaluation'''
 
 
'''LabX''' =  [https://www.music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/LabX.tgz Anonymous]<br />
 
'''Mandel''' =  [https://www.music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/Mandel.tgz Michael Mandel]<br />
 
'''Manzagol''' =  [https://www.music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/Manzagol.tgz Pierre-Antoine Manzagol]<br />
 
'''Marsyas''' =  [https://www.music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/Marsyas.tgz George Tzanetakis]<br />
 
'''Zhi''' =  [https://www.music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/Zhi.tgz Zhi-Sheng Chen]<br />
 
  
 
====Results By Algorithm====
 
====Results By Algorithm====

Revision as of 18:56, 26 October 2009

Introduction

This task compares various algorithms' abilities to associate tags with 29-second audio clips of songs. The tags used were collected by the Tagatune game and algorithms were evaluated using the previously collected tags (using the same statistical procedures as the the other MIREX 2009 tag classification tasks) and in the Tagatune game itself (the Tagatune metric).

What is Tagatune?

Tagatune is a two-player game designed to extract information about music. In each round of the game, two players are each shown a song, either they are shown the same song or two different songs. Each player describes his given song by typing in any number of tags, which are immediately revealed to the partner. After reviewing each otherΓÇÖs tags, the players must each decide whether they have been given the same piece of music as their partner. After both players have voted, the game reveals the true answer (whether the songs given to the pair of players are the same or different) and prepares the next round. Tagatune is live at www.gwap.com

http://www.cs.cmu.edu/~elaw/tagatune.jpg

Since Tagatune is a two-player game, when no partner is available for a player, a bot (a computer program or algorithm) is instituted to play against that player. In each round of the game, the bot generates a set of appropriate tags for a song and reveals these tags to the player. The player then decides his votes for same or different by comparing what he is listening to and the tags revealed by his bot partner. If the songs given to the bot and the player are identical, and the tags generated by the bot are accurate for the song, then the player will have a high probability of guessing correctly that the songs are the same. Otherwise, we would expect the player to make more mistakes in making this judgment. In short, the hypothesis is that better algorithms generate tags that are more fitting descriptions of songs, which in turn, allows players to have a higher chance of guessing correctly.


What is the goal of the MIREX Special Tagatune Evaluation?

The goal of the MIREX Special Tagatune Evaluation competition is to investigate a new method of evaluating music tagging algorithms, by using them as bots in Tagatune, and measuring the number of mistakes players make in guessing whether they are listening to the same or different songs (we will call this the Tagatune metric) when paired against different algorithm bots. We are particularly interested in whether there is a statistical correlation between the ranking of the algorithms induced by the Tagatune metric versus the classical metrics used in MIREX. For the motivation behind this evaluation, see this paper.

There are three main steps to this evaluation.

Step 1: Algorithm to Tags

All submitted algorithms are trained using the Tagatune training set and tested on the Tagatune test set. Artist filtering was used in the production of the test and training split, I.e. the training and test sets contained different artists. The trained algorithm must generate a set of tags for each of the songs in the test set, and rank the tags in a particular order (e.g. by confidence, saliency, relevance etc). This part of the evaluation is very similar, if not identical, to the MIREX 2009 Audio Tag Classification tasks where two outputs are produced by each algorithm:

  • a set of binary classifications indicating which tags are relevant to each example,
  • a set of 'affinity' scores which indicate the degree to which each tag applies to each track.

These different outputs allow the algorithms to be evaluated both on tag 'classification' and tag 'ranking' (where the tags may be ranked for each track and tracks ranked for each tag).

Step 2: Tagatune Experiments

The tags returned as 'relevant' by each algorithm were subsequently displayed to players of Tagatune in an internet-wide experiment. The number of mistakes players make in guessing whether the songs were the same or different was recorded for each algorithm.

Step 3: Ranking

The submitted algorithm were then evaluated by two methods:

(1) ranking using the MIREX metrics

(2) ranking using the Tagatune metric


The Tagatune Dataset

The Tagatune training and test set consist of music clips that are 29 seconds long, and are associated with 6622 tracks, 517 albums and 270 artists. The genres include classical, new age, electronica, rock, pop, world, jazz, blues, metal, punk etc. The tags used in the experiments are each associated with more than fifty songs, where each song is associated with a tag by more than two players independently. The following table shows the minimum, maximum and average number of songs associated with any tags in the training set, test set and the complete set used in this evaluation.


Training Set Test Set Complete Set
MIN 18 15 50
MAX 2103 3767 5870
AVG 212 288 502


Number of samples in training set: 9598

Number of samples in test set: 13194


The following is a list of 160 tags found in the Tagatune dataset.

no voicesingerduethard rock
worldharpsichordsitarchorus
female operamale vocalvocalsclarinet
heavysilencebeatsfunky
no stringschimesforeignno piano
hornsclassicalfemalespacey
jazzguitarquietno beat
banjoelectricsoloviolins
folkfemale voicewindambient
new agesynthfunkno singing
middle easterntrumpetpercussiondrum
airyvoicerepetitivebirds
stringsbassharpsicordmedieval
male voicegirlacousticloud
classicstringdrumselectronic
not classicalchantingno violinnot rock
no guitarorganno vocaltalking
choralweirdoperafast
electric guitarmale singerman singingclassical guitar
countryviolinelectrotribal
darkmale operano vocalsirish
electronicahornoperaticarabic
lowinstrumentaltrancechant
strangeheavy metalmodernbells
mandeepfast beathard
harpno flutepoplute
female vocaloboemelloworchestral
lightpianocelticmale vocals
orchestraeasternoldflutes
punkspanishsadsax
slowmalebluesvocal
indianindiawomanwoman singing
rockdancepiano sologuitars
no drumsjazzysingingcello
calmfemale vocalsvoicestechno
clappinghouseflutenot opera
not englishorientalbeatupbeat
softnoisechoirfemale singer
rapmetalhip hopwater
baroquewomenfiddleenglish


NOTE: An interesting effect of Tagatune is that we have collected many negative tags, which indicates the absence of an instrument (e.g. no piano, no guitar) or the genre that the song does not belong to (e.g. not classical, not rock). Participants of this evaluation might want to tailor their algorithms to take advantage of these negative tags that are not available on the MIREX 2008/2009 datasets.


MIREX Statistical Evaluation

Participating algorithms were evaluated over a single artist-filtered test/train split using both the full test set and only the 100 query subset used in Tagatune evaluation.

Binary (Classification) Evaluation

Algorithms are evaluated on their performance at tag classification using F-measure. Results are also reported for simple accuracy, however, as this statistic is dominated by the negative example accuracy it is not a reliable indicator of performance (as a system that returns no tags for any example will achieve a high score on this statistic). However, the accuracies are also reported for positive and negative examples separately as these can help elucidate the behaviour of an algorithm (for example demonstrating if the system is under of over predicting).

Affinity (Ranking) Evaluation

Algorithms are evaluated on their performance at tag ranking using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC). The affinity scores for each tag to be applied to a track are sorted prior to the computation of the AUC-ROC statistic, which gives higher scores to ranked tag sets where the correct tags appear towards the top of the set.

General Legend

Team ID

Mandel = Michael Mandel
Manzagol = Pierre-Antoine Manzagol
Marsyas = George Tzanetakis
Zhi = Zhi-Sheng Chen
LabX = Anonymous


Results

The following sections provide detail the evaluation statistics computed. The results of the task are also detailed in the paper Evaluation of Algorithms Using Games: The Case of Music Tagging.

Overall Summary Results (Tagatune)

file /nema-raid/www/mirex/results/tagatune/summary_tagatune.csv not found

Friedman's Test Results

The following table and plot show the results of Friedman's ANOVA with Tukey-Kramer multiple comparisons computed over the Tagatune metric for each track in the test. The tags generated by the algorithms are pre-processed to remove redundant or contradictory tags, which is important to maintain a minimum quality for the algorithm bots. This pre-processing is not done on the data for which other metrics are computed.

file /nema-raid/www/mirex/results/tagatune/tagatune_correctness.friedman.tukeyKramerHSD.csv not found

https://music-ir.org/mirex/2009/results/tagatune/tagatune_correctness.friedman.tukeyKramerHSD.png

Tagatune Correctness

file /nema-raid/www/mirex/results/tagatune/correctness.csv not found


Overall Summary Results (MIREX Statistical evaluation - Binary)

Full dataset

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T/summary_binary.csv not found

100 query subset used in Tagatune evaluation

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T_subset/summary_binary.csv not found


Binary Relevance F-Measure

Full dataset

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T/binary_avg_Fmeasure.csv not found

100 query subset used in Tagatune evaluation

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T_subset/binary_avg_Fmeasure.csv not found

Binary Accuracy

Full dataset

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T/binary_avg_Accuracy.csv not found

100 query subset used in Tagatune evaluation

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T_subset/binary_avg_Accuracy.csv not found

Positive Example Accuracy

Full dataset

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T/binary_avg_positive_example_Accuracy.csv not found

100 query subset used in Tagatune evaluation

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T_subset/binary_avg_positive_example_Accuracy.csv not found

Negative Example Accuracy

Full dataset

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T/binary_avg_negative_example_Accuracy.csv not found

100 query subset used in Tagatune evaluation

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T_subset/binary_avg_negative_example_Accuracy.csv not found


Overall Summary Results (MIREX Statistical evaluation - Affinity)

Full dataset

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T/summary_affinity.csv not found

100 query subset used in Tagatune evaluation

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T_subset/summary_affinity.csv not found


AUC-ROC Tag

Full dataset

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T/affinity_tag_AUC_ROC.csv not found

100 query subset used in Tagatune evaluation

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T_subset/affinity_tag_AUC_ROC.csv not found


Select Friedman's Test Results

Tag F-measure (Binary) Friedman Test

The following table and plot show the results of Friedman's ANOVA with Tukey-Kramer multiple comparisons computed over the F-measure for each tag in the test, averaged over all folds.

Full dataset

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T/binary_FMeasure.friedman.tukeyKramerHSD.csv not found

https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T/binary_FMeasure.friedman.tukeyKramerHSD.png


100 query subset used in Tagatune evaluation

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T_subset/binary_FMeasure.friedman.tukeyKramerHSD.csv not found

https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/binary_FMeasure.friedman.tukeyKramerHSD.png


Per Track F-measure (Binary) Friedman Test

The following table and plot show the results of Friedman's ANOVA with Tukey-Kramer multiple comparisons computed over the F-measure for each track in the test, averaged over all folds.

Full dataset

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T/binary_FMeasure_per_track.friedman.tukeyKramerHSD.csv not found

https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T/binary_FMeasure_per_track.friedman.tukeyKramerHSD.png


100 query subset used in Tagatune evaluation

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T_subset/binary_FMeasure_per_track.friedman.tukeyKramerHSD.csv not found

https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/binary_FMeasure_per_track.friedman.tukeyKramerHSD.png


Tag AUC-ROC (Affinity) Friedman Test

The following table and plot show the results of Friedman's ANOVA with Tukey-Kramer multiple comparisons computed over the Area Under the ROC curve (AUC-ROC) for each tag in the test, averaged over all folds.

Full dataset

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T/affinity.AUC_ROC_TAG.friedman.tukeyKramerHSD.csv not found

https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T/affinity.AUC_ROC_TAG.friedman.tukeyKramerHSD.png


100 query subset used in Tagatune evaluation

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T_subset/affinity.AUC_ROC_TAG.friedman.tukeyKramerHSD.csv not found

https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/affinity.AUC_ROC_TAG.friedman.tukeyKramerHSD.png


Per Track AUC-ROC (Affinity) Friedman Test

The following table and plot show the results of Friedman's ANOVA with Tukey-Kramer multiple comparisons computed over the Area Under the ROC curve (AUC-ROC) for each track/clip in the test, averaged over all folds.

Full dataset

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T/affinity.AUC_ROC_TRACK.friedman.tukeyKramerHSD.csv not found

https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T/affinity.AUC_ROC_TRACK.friedman.tukeyKramerHSD.png


100 query subset used in Tagatune evaluation

file /nema-raid/www/mirex/results/tagatune/eval_T_on_T_subset/affinity.AUC_ROC_TRACK.friedman.tukeyKramerHSD.csv not found

https://music-ir.org/mirex/2009/results/tagatune/eval_T_on_T_subset/affinity.AUC_ROC_TRACK.friedman.tukeyKramerHSD.png


Assorted Results Files for Download

MIREX Statistical Evaluation Results

Full dataset

affinity_tag_fold_AUC_ROC.csv
affinity_clip_AUC_ROC.csv
binary_per_fold_Accuracy.csv
binary_per_fold_Fmeasure.csv
binary_per_fold_negative_example_Accuracy.csv
binary_per_fold_per_track_Accuracy.csv
binary_per_fold_per_track_Fmeasure.csv
binary_per_fold_per_track_negative_example_Accuracy.csv
binary_per_fold_per_track_positive_example_Accuracy.csv
binary_per_fold_positive_example_Accuracy.csv
affinity.PrecisionAt3.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt6.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt9.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt12.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt15.friedman.tukeyKramerHSD.csv

100 query subset used in Tagatune evaluation

affinity_tag_fold_AUC_ROC.csv
affinity_clip_AUC_ROC.csv
binary_per_fold_Accuracy.csv
binary_per_fold_Fmeasure.csv
binary_per_fold_negative_example_Accuracy.csv
binary_per_fold_per_track_Accuracy.csv
binary_per_fold_per_track_Fmeasure.csv
binary_per_fold_per_track_negative_example_Accuracy.csv
binary_per_fold_per_track_positive_example_Accuracy.csv
binary_per_fold_positive_example_Accuracy.csv
affinity.PrecisionAt3.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt6.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt9.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt12.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt15.friedman.tukeyKramerHSD.csv


Friedman's Tests Results

Full dataset

eval_T_on_T/affinity.AUC_ROC_TAG.friedman.tukeyKramerHSD.csv
eval_T_on_T/affinity.AUC_ROC_TAG.friedman.tukeyKramerHSD.png
eval_T_on_T/affinity.AUC_ROC_TRACK.friedman.tukeyKramerHSD.csv
eval_T_on_T/affinity.AUC_ROC_TRACK.friedman.tukeyKramerHSD.png
affinity.PrecisionAt3.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt3.friedman.tukeyKramerHSD.png
affinity.PrecisionAt6.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt6.friedman.tukeyKramerHSD.png
affinity.PrecisionAt9.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt9.friedman.tukeyKramerHSD.png
affinity.PrecisionAt12.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt12.friedman.tukeyKramerHSD.png
affinity.PrecisionAt15.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt15.friedman.tukeyKramerHSD.png
binary_Accuracy.friedman.tukeyKramerHSD.csv
binary_Accuracy.friedman.tukeyKramerHSD.png
binary_FMeasure.friedman.tukeyKramerHSD.csv
binary_FMeasure.friedman.tukeyKramerHSD.png
binary_FMeasure_per_track.friedman.tukeyKramerHSD.csv
binary_FMeasure_per_track.friedman.tukeyKramerHSD.png

100 query subset used in Tagatune evaluation

eval_T_on_T/affinity.AUC_ROC_TAG.friedman.tukeyKramerHSD.csv
eval_T_on_T/affinity.AUC_ROC_TAG.friedman.tukeyKramerHSD.png
eval_T_on_T/affinity.AUC_ROC_TRACK.friedman.tukeyKramerHSD.csv
eval_T_on_T/affinity.AUC_ROC_TRACK.friedman.tukeyKramerHSD.png
affinity.PrecisionAt3.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt3.friedman.tukeyKramerHSD.png
affinity.PrecisionAt6.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt6.friedman.tukeyKramerHSD.png
affinity.PrecisionAt9.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt9.friedman.tukeyKramerHSD.png
affinity.PrecisionAt12.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt12.friedman.tukeyKramerHSD.png
affinity.PrecisionAt15.friedman.tukeyKramerHSD.csv
affinity.PrecisionAt15.friedman.tukeyKramerHSD.png
binary_Accuracy.friedman.tukeyKramerHSD.csv
binary_Accuracy.friedman.tukeyKramerHSD.png
binary_FMeasure.friedman.tukeyKramerHSD.csv
binary_FMeasure.friedman.tukeyKramerHSD.png
binary_FMeasure_per_track.friedman.tukeyKramerHSD.csv
binary_FMeasure_per_track.friedman.tukeyKramerHSD.png

Results By Algorithm

(.tgz format)

Full dataset

LabX = Anonymous
Mandel = Michael Mandel
Manzagol = Pierre-Antoine Manzagol
Marsyas = George Tzanetakis
Zhi = Zhi-Sheng Chen


100 query subset used in Tagatune evaluation

LabX = Anonymous
Mandel = Michael Mandel
Manzagol = Pierre-Antoine Manzagol
Marsyas = George Tzanetakis
Zhi = Zhi-Sheng Chen