Difference between revisions of "2013:Symbolic Melodic Similarity Results"

From MIREX Wiki
(General Legend)
(Evalutron 6000 Summary Data)
Line 16: Line 16:
 
'''Number of evaluations per query/candidate pair''' = 1 <br />
 
'''Number of evaluations per query/candidate pair''' = 1 <br />
 
'''Number of queries per grader''' = 1 <br />
 
'''Number of queries per grader''' = 1 <br />
'''Total number of candidates returned''' = 3900 <br />
+
'''Total number of unique query/candidate pairs graded''' = 388 <br />
'''Total number of unique query/candidate pairs graded''' = 895 <br />
+
'''Average number of query/candidate pairs evaluated per grader: 67 <br />
'''Average number of query/candidate pairs evaluated per grader: 149 <br />
 
 
'''Number of queries''' = 6 (perfect) with each perfect query error-mutated 4 different ways = 30<br />
 
'''Number of queries''' = 6 (perfect) with each perfect query error-mutated 4 different ways = 30<br />
 
  
 
== General Legend ==
 
== General Legend ==

Revision as of 11:27, 28 October 2013

Introduction

These are the results for the 2013 running of the Symbolic Melodic Similarity task set. For background information about this task set please refer to the 2013:Symbolic Melodic Similarity page.

Each system was given a query and returned the 10 most melodically similar songs from those taken from the Essen Collection (5274 pieces in the MIDI format; see ESAC Data Homepage for more information). For each query, we made four classes of error-mutations, thus the set comprises the following query classes:

  • 0. No errors
  • 1. One note deleted
  • 2. One note inserted
  • 3. One interval enlarged
  • 4. One interval compressed

For each query (and its 4 mutations), the returned results (candidates) from all systems were then grouped together (query set) for evaluation by the human graders. The graders were provide with only heard perfect version against which to evaluate the candidates and did not know whether the candidates came from a perfect or mutated query. Each query/candidate set was evaluated by 1 individual grader. Using the Evalutron 6000 system, the graders gave each query/candidate pair two types of scores. Graders were asked to provide 1 categorical score with 3 categories: NS,SS,VS as explained below, and one fine score (in the range from 0 to 100).

Evalutron 6000 Summary Data

Number of evaluators = 6
Number of evaluations per query/candidate pair = 1
Number of queries per grader = 1
Total number of unique query/candidate pairs graded = 388
Average number of query/candidate pairs evaluated per grader: 67
Number of queries = 6 (perfect) with each perfect query error-mutated 4 different ways = 30

General Legend

Sub code Submission name Abstract Contributors
JU1 ShapeH PDF Julián Urbano, Juan Lloréns,Jorge Morato, Sonia Sánchez-Cuadrado
JU2 ShapeL PDF Julián Urbano, Juan Lloréns,Jorge Morato, Sonia Sánchez-Cuadrado
JU3 ShapeG PDF Julián Urbano, Juan Lloréns,Jorge Morato, Sonia Sánchez-Cuadrado
RTBB1 ShapeTime PDF Julián Urbano, Juan Lloréns,Jorge Morato, Sonia Sánchez-Cuadrado
YHKH1 Time PDF Julián Urbano, Juan Lloréns,Jorge Morato, Sonia Sánchez-Cuadrado

Broad Categories

NS = Not Similar
SS = Somewhat Similar
VS = Very Similar

Table Headings

ADR = Average Dynamic Recall
NRGB = Normalize Recall at Group Boundaries
AP = Average Precision (non-interpolated)
PND = Precision at N Documents

Calculating Summary Measures

Fine(1) = Sum of fine-grained human similarity decisions (0-100).
PSum(1) = Sum of human broad similarity decisions: NS=0, SS=1, VS=2.
WCsum(1) = 'World Cup' scoring: NS=0, SS=1, VS=3 (rewards Very Similar).
SDsum(1) = 'Stephen Downie' scoring: NS=0, SS=1, VS=4 (strongly rewards Very Similar).
Greater0(1) = NS=0, SS=1, VS=1 (binary relevance judgment).
Greater1(1) = NS=0, SS=0, VS=1 (binary relevance judgment using only Very Similar).

(1)Normalized to the range 0 to 1.

Summary Results

Overall Scores (Includes Perfect and Error Candidates)

SCORE DB1 ULMS1 ULMS2 ULMS3 ULMS4 ULMS5
ADR 0.0033 0.6085 0.4830 0.5416 0.6706 0.6567
NRGB 0.0049 0.5339 0.4275 0.4707 0.5788 0.5670
AP 0.0014 0.5316 0.2728 0.4175 0.5414 0.4872
PND 0.0067 0.5243 0.3271 0.4464 0.5161 0.4865
Fine 25.1533 62.9367 49.56 54.5733 63.5167 62.6133
PSum 0.32667 1.36 0.93333 1.1633 1.37 1.3267
WCSum 0.34667 1.8867 1.1733 1.5967 1.9067 1.8267
SDSum 0.36667 2.4133 1.4133 2.03 2.4433 2.3267
Greater0 0.30667 0.83333 0.69333 0.73 0.83333 0.82667
Greater1 0.02 0.52667 0.24 0.43333 0.53667 0.5

download these results as csv

Scores by Query Error Types

No Errors

SCORE DB1 ULMS1 ULMS2 ULMS3 ULMS4 ULMS5
ADR 0.0000 0.6201 0.5108 0.5760 0.6555 0.6464
NRGB 0.0000 0.5177 0.4471 0.4843 0.5389 0.5422
AP 0.0023 0.5165 0.2701 0.4372 0.5199 0.4814
PND 0.0000 0.5357 0.3196 0.4899 0.5077 0.4786
Fine 25.9833 65.85 50.2167 56.8833 66.7667 63.9167
PSum 0.33333 1.45 0.95 1.2167 1.4667 1.3667
WCSum 0.35 2.0167 1.2 1.6833 2.05 1.8833
SDSum 0.36667 2.5833 1.45 2.15 2.6333 2.4
Greater0 0.31667 0.88333 0.7 0.75 0.88333 0.85
Greater1 0.016667 0.56667 0.25 0.46667 0.58333 0.51667

download these results as csv

Note Deletions

SCORE DB1 ULMS1 ULMS2 ULMS3 ULMS4 ULMS5
ADR 0.0000 0.6258 0.5599 0.5526 0.7231 0.6998
NRGB 0.0000 0.5605 0.5031 0.4754 0.6467 0.6068
AP 0.0000 0.5943 0.3646 0.3955 0.6309 0.5168
PND 0.0000 0.5841 0.3865 0.4238 0.5867 0.4756
Fine 24.05 67.2667 50.3167 52.5167 68.2667 63.45
PSum 0.26667 1.45 0.95 1.1167 1.4833 1.35
WCSum 0.26667 2.0167 1.2 1.5333 2.0833 1.85
SDSum 0.26667 2.5833 1.45 1.95 2.6833 2.35
Greater0 0.26667 0.88333 0.7 0.7 0.88333 0.85
Greater1 0 0.56667 0.25 0.41667 0.6 0.5

download these results as csv

Note Insertions

SCORE DB1 ULMS1 ULMS2 ULMS3 ULMS4 ULMS5
ADR 0.0000 0.6066 0.4451 0.5154 0.6623 0.6439
NRGB 0.0000 0.5332 0.3777 0.4751 0.5687 0.5476
AP 0.0000 0.5314 0.2225 0.4254 0.5281 0.4953
PND 0.0000 0.4946 0.2714 0.4780 0.4917 0.4679
Fine 24.3667 63.8167 47.35 57.8833 65.0833 62.75
PSum 0.31667 1.3833 0.86667 1.2667 1.4 1.3167
WCSum 0.33333 1.9167 1.0833 1.7333 1.9333 1.8333
SDSum 0.35 2.45 1.3 2.2 2.4667 2.35
Greater0 0.3 0.85 0.65 0.8 0.86667 0.8
Greater1 0.016667 0.53333 0.21667 0.46667 0.53333 0.51667

download these results as csv

Enlarged Intervals

SCORE DB1 ULMS1 ULMS2 ULMS3 ULMS4 ULMS5
ADR 0.0000 0.5970 0.4818 0.5452 0.6584 0.6576
NRGB 0.0000 0.5390 0.4286 0.4778 0.5622 0.5783
AP 0.0000 0.5270 0.2676 0.3973 0.5244 0.4769
PND 0.0000 0.5212 0.3450 0.3937 0.4878 0.5265
Fine 25.75 57.1 49.55 51.1667 55.9167 59.75
PSum 0.33333 1.2167 0.93333 1.05 1.1833 1.25
WCSum 0.35 1.6833 1.1667 1.4333 1.6333 1.7167
SDSum 0.36667 2.15 1.4 1.8167 2.0833 2.1833
Greater0 0.31667 0.75 0.7 0.66667 0.73333 0.78333
Greater1 0.016667 0.46667 0.23333 0.38333 0.45 0.46667

download these results as csv

Compressed Intervals

SCORE DB1 ULMS1 ULMS2 ULMS3 ULMS4 ULMS5
ADR 0.0164 0.5929 0.4172 0.5188 0.6539 0.6357
NRGB 0.0243 0.5192 0.3809 0.4407 0.5776 0.5600
AP 0.0049 0.4887 0.2392 0.4319 0.5035 0.4655
PND 0.0333 0.4857 0.3127 0.4468 0.5063 0.4841
Fine 25.6167 60.65 50.3667 54.4167 61.55 63.2
PSum 0.38333 1.3 0.96667 1.1667 1.3167 1.35
WCSum 0.43333 1.8 1.2167 1.6 1.8333 1.85
SDSum 0.48333 2.3 1.4667 2.0333 2.35 2.35
Greater0 0.33333 0.8 0.71667 0.73333 0.8 0.85
Greater1 0.05 0.5 0.25 0.43333 0.51667 0.5

download these results as csv

Friedman Test with Multiple Comparisons Results (p=0.05)

The Friedman test was run in MATLAB against the Fine summary data over the 30 queries.
Command: [c,m,h,gnames] = multcompare(stats, 'ctype', 'tukey-kramer','estimate', 'friedman', 'alpha', 0.05);

Row Labels DB1 ULMS1 ULMS2 ULMS3 ULMS4 ULMS5
q01 40.2 72.9 60.1 60.1 72.9 57.2
q01_1 29.1 72.9 54.2 57.5 72.9 57.2
q01_2 34.3 62 56.1 60.4 55.4 54.6
q01_3 35.5 57.1 56.1 55.8 50 55.2
q01_4 43 55.9 56.1 51.3 55.8 47.9
q02 7.7 48.4 44 55.2 48.4 59.8
q02_1 8.5 54.2 50.5 40.3 54.2 56.6
q02_2 7.5 49.6 35.9 52.8 49.6 56.4
q02_3 8 51.1 44 52.4 51.1 53.5
q02_4 7.8 48.4 44 54.6 48.4 59.8
q03 22 60.5 42.5 54 60.5 55.5
q03_1 25 63.5 42.5 48 63.5 50.5
q03_2 26 60.5 39.5 56 60.5 55.5
q03_3 21 60.5 42.5 54 60.5 55.5
q03_4 19.3 60.5 42.5 54 60.5 55.5
q04 21.2 65.1 37.3 35.1 65.1 65.8
q04_1 19.5 62.9 37.3 30.3 62.9 65.8
q04_2 20 57.1 37.3 40.1 65.8 65.8
q04_3 26.6 56.9 37.3 21 56.9 51.5
q04_4 16.8 60.4 39.2 34.9 60.4 65.8
q05 31.3 77.2 45.9 67.9 82.7 82.7
q05_1 32.2 79.1 45.9 69 85.1 85.1
q05_2 31.9 77.2 43.8 69 82.7 82.7
q05_3 28.9 46 45.9 59.8 46 80.8
q05_4 33.3 77.2 48.9 64.7 82.7 82.7
q06 33.5 71 71.5 69 71 62.5
q06_1 30 71 71.5 70 71 65.5
q06_2 26.5 76.5 71.5 69 76.5 61.5
q06_3 34.5 71 71.5 64 71 62
q06_4 33.5 61.5 71.5 67 61.5 67.5

download these results as csv

TeamID TeamID Lowerbound Mean Upperbound Significance
ULMS4 ULMS1 -1.3747 -0.0167 1.3414 FALSE
ULMS4 ULMS5 -1.0914 0.2667 1.6247 FALSE
ULMS4 ULMS3 -0.1247 1.2333 2.5914 FALSE
ULMS4 ULMS2 0.1086 1.4667 2.8247 TRUE
ULMS4 DB1 2.1919 3.5500 4.9081 TRUE
ULMS1 ULMS5 -1.0747 0.2833 1.6414 FALSE
ULMS1 ULMS3 -0.1081 1.2500 2.6081 FALSE
ULMS1 ULMS2 0.1253 1.4833 2.8414 TRUE
ULMS1 DB1 2.2086 3.5667 4.9247 TRUE
ULMS5 ULMS3 -0.3914 0.9667 2.3247 FALSE
ULMS5 ULMS2 -0.1581 1.2000 2.5581 FALSE
ULMS5 DB1 1.9253 3.2833 4.6414 TRUE
ULMS3 ULMS2 -1.1247 0.2333 1.5914 FALSE
ULMS3 DB1 0.9586 2.3167 3.6747 TRUE
ULMS2 DB1 0.7253 2.0833 3.4414 TRUE

download these results as csv

2012 sms fine scores friedmans.png