University of Illinois Graduate School of Library and Information Science ISRL

2005 MIREX Contest Results - Audio Melody Extraction (Contest wiki)


Goal: To extract melodic content from polyphonic audio.

Dataset: 25 phrase excerpts of 10-40 sec from the following genres: Rock, R&B, Pop, Jazz, Solo classical piano

Rank Participant A. Voicing Detection B. Voicing False Alarm C. Voicing d-prime D. Raw Pitch Accuracy E. Raw Chroma Accuracy F. Overall Accuracy Runtime (s) Machine
1 Dressler, K. 81.8% 17.3% 1.85 68.1% 71.4% 71.4% 32 R
2 Ryynänen & Klapuri 90.3% 39.5% 1.56 68.6% 74.1% 64.3% 10970 L
3
Poliner & Ellis
91.6% 42.7% 1.56 67.3% 73.4% 61.1% 5471 B 0
3 Paiva, R. 2 68.8% 23.2% 1.22 58.5% 62.0% 61.1% 45618 Y
5 Marolt, M. 72.7% 32.4% 1.06 60.1% 67.1% 59.5% 12461 F
6 Paiva, R. 1 83.4% 55.8% 0.83 62.7% 66.7% 57.8% 44312 G
7 Goto, M. 99.9% * 99.4% * 0.59 * 65.8% 71.8% 49.9% * 211 F
8 Vincent & Plumbley 1 96.1% * 93.7% * 0.23 * 59.8% 67.6% 47.9% * ? G
9 Vincent & Plumbley 2 99.6% * 96.4% * 0.86 * 59.6% 71.1% 46.4% * 251 G
10 Brossier, P. 99.2% * † 98.8% * † 0.14 * † 3.9% † 8.1% † 3.2% * † 41 B 0

Notes:
Bold numbers are the best in each column
* Goto, Vincent, and Brossier did not perform voiced/unvoiced detection, so the starred results cannot be meaningfully compared to other systems. The Voicing rates are not 100% because for a few files these systems reported small numbers of no-voicing frames because the pitch tracking terminated early, and the remainder of the time was padded with zeros.
† Scores for Brossier are artificially low due to an unsresolved algorithmic issue.


Explanation of Statistics

The task consists of two parts: Voicing detection (deciding whether a particular time frame contains a "melody pitch" or not), and pitch detection (deciding the most likely melody pitch for each time frame). We structured the submission to allow these parts to be done independently, i.e. it was possible (via a negative pitch value) to guess a pitch even for frames that were being judged unvoiced.

So consider a matrix of the per-frame voiced (Ground Truth or Detected values != 0) and unvoiced (GT, Det == 0) results, where the counts are:

                      Detected
unvx vx sum
---------------
Ground unvoiced | TN | FP | GU
Truth voiced | FN | TP | GV
---------------
sum DU DV TO

TP ("true positives", frames where the voicing was correctly detected) further breaks down into pitch correct and pitch incorrect, say TP = TPC + TPI

Similarly, the ability to record pitch guesses even for frames judged unvoiced breaks down FN ("false negatives", frames which were actually pitched but detected as unpitched) into pitch correct and pitch incorrect, say FN = FNC + FNI

In both these cases, we can also count the number of times the chroma was correct, i.e. ignoring octave errors, say TP = TPCch + TPIch and FN = FNCch + FNIch.

To assess the voicing detection portion, we use the standard tools of detection theory. Statistic A, Voicing Detection is the probability that a frame which is truly voiced is labeled as voiced i.e. TP/GV (also known as "hit rate").

Statistic B, Voicing False Alarm, is the probability that a frame which is not actually voiced is none the less labeled as voiced i.e. FP/GU.

Statistic C, Voicing d-prime, is a measure of the sensitivity of the detector that attempts to factor out the overall bias towards labeling any frame as voiced (which can move both hit rate and false alarm rate up and down in tandem). It converts the hit rate and false alarm into standard deviations away from the mean of an equivalent Gaussian distribution, and reports the difference between them. A larger value indicates a detection scheme with better discrimination between the two classes.

For the voicing detection, we pooled the frames from all excerpts to get an overall frame-level voicing detection performance. Because some excerpts had no unvoiced frames, averaging over the excerpts gave some misleading results.

Now we move on to the actual pitch detection. Statistic D, Raw Pitch Accuracy is the probability of a correct pitch value (to within ± ¼ tone) given that the frame is indeed pitched. This includes the pitch guesses for frames that were judged unvoiced i.e. (TPC + FNC)/GV.

Similarly, Statistic E, Raw Chroma Accuracy, is the probability that the chroma (i.e. the note name) is correct over the voiced frames. This ignores errors where the pitch is wrong by an exact multiple of an octave (octave errors). It is (TPCch + FNCch)/GV.

Finally, Statistic F, Overall Accuracy, combines both the voicing detection and the pitch estimation to give the proportion of frames that were correctly labeled with both pitch and voicing, i.e. (TPC + TN)/TO.

When averaging the pitch statistics, we calculated the performance for each of the 25 excerpts individually, then report the average of these measures. This helps increase the effective weight of some of the minority genres, which had shorter excerpts.

Statistical Significance

In a comparative evaluation of this kind, a criticial question is whether an observed difference in the performance of two systems is statistically significant, or whether two systems with equal underlying error rates would show this level of difference due only to random variations in successive measurements. One typical approach is to adopt a binomial model, where each system is assumed to have a fixed probability of making an error on each trial, which is being estimated by the proportion of errors made on the test set. Under the binomial model, the variance of any count is proportional to the size of the count, thus the standard error declines as the square-root of the number of (independent) trials.

A more precise approach is McNemar's test, which considers only trials in which two systems make different predictions (since no information on their difference is available from trials in which they report the same outcome). Some results from McNemar's test are presented below.

The problem with these significance tests is that they rely on the trials being independent. Our counts are based on 10 ms frames, which are likely to be highly dependent on their neighbors. One way to ensure trials which are more independent would be to space them more widely in time, for instance by recording the result from a 10 ms frame only every 250 ms. (This number comes from the idea that an average musical note might last around 250 ms, so this level of subsampling would tend to pick samples from different notes, which are closer to independent trials.) If we actually measured results only on these subselected samples, we would expect to see the same overall error rates, but the total count of trials would be many fewer.

Very roughly speaking, the 25 test signals of around 30 s each might contain around 2000 independent trials. Under this number of trials, at an accuracy of 60-70%, systems whose performance differs by less than about 2.5% would not be statistically different at the 5% level (using a simplistic, one-tailed binomial test). Thus, in the above table, the top two systems are significantly different from the others, but the next three are in a statistical dead heat.

Raw data

For flexibility in calculating other results, we are including the raw values that we used to caclulate the statistics above. Because the data are broken down by track, it's a lot of data. Thus we are making it available as an Microsoft Excel Workbook, with one sheet for each entry.

Original statistics results

Below are the original statistics reported for the task, for historical purposes. The important numbers are the same as the table above.

Rank Participant Average correctly transcribed voiced and unvoiced portions Averaged correctly transcribed voiced instants Averaged correctly transcribed voiced instants mapped to one octave Average estimation of temporal boundaries of the melodic segments Average F-measure Average F-measure mapped to one octave Average correctly transcribed instants (ignoring segmentation errors) Average correctly transcribed instants mapped to one octave (ignoring segmentation errors) Average correctly transcribed voiced and unvoiced portions mapped to one Octave Runtime (s) Machine
1 Dressler, K. 70.78% 67.48% 70.83% 80.20% 78.34% 81.48% 67.48% 70.83% 73.61% 32 R
2 Ryynänen & Klapuri 63.94% 65.99% 69.85% 81.87% 70.22% 73.85% 68.20% 73.70% 67.33% 10970 L
3 Paiva, R. 2 60.70% 58.16% 61.56% 70.92% 70.23% 74.06% 58.16% 61.56% 63.79% 45618 Y
4 Poliner & Ellis
60.61% 62.47% 67.95% 83.02% 66.18% 71.10% 66.75% 72.86% 65.14% 5471 B 0
5 Marolt, M. 59.18% 58.06% 63.13% 73.35% 67.53% 72.24% 59.76% 66.73% 63.25% 12461 F
6 Paiva, R. 1 57.32% 62.23% 66.20% 74.76% 65.42% 69.32% 62.23% 66.20% 60.76% 44312 G
7 Goto, M. 49.68% 65.58% 71.47% 77.18% 56.46% 61.99% 65.58% 71.47% 54.89% 211 F
8 Vincent, E. 45.98% 59.17% 70.64% 77.61% 51.98% 62.36% 59.17% 70.64% 55.52% 251 G
9 (See note)
Brossier, P.
3.2% 3.93% 8.06% 76.61% 3.63% 7.27% 3.93% 8.06% 6.43% 41 B 0

The statistics are:

1. Average Correctly transcribed voiced and unvoiced portions
= (TPC + TN) / TO (i.e. proportion of frames with voicing and pitch right)

2 . Averaged correctly transcribed voiced instants
= TPC / GV (proportion of truly-voiced frames with vx, pitch right)

3. Averaged correctly transcribed voiced instants mapped to one octave
= TPCch / GV (statistic 2 without octave errors)

4. Average estimation of temporal boundaries of the melodic segments
= (TP + TN) / TO (overall voicing detection frame accuracy)

5. Average F-measure
Kris defined PRECIS = (TPC + TN)/(TP + TN + FP)
and RECALL = (TPC + TN)/(TP + TN + FN), thus
= 2*PRECIS*RECALL/(PRECIS + RECALL) (a hybrid of pitch and voicing)

6. Average F-measure mapped to one octave
[as above with TPC replaced by TPCch]

7. Average Correctly transcribed instants (ignoring segmentation errors)
= (TPC + FNC) / GV (pitch accuracy on pitched frames ignoring voicing)

8. Average Correctly transcribed instants mapped to one octave (ignoring
segmentation errors)
= (TPCch + FNCch) / GV (statistic 7 without octave errors)

9. Average Correctly transcribed voiced and unvoiced portions mapped to
one Octave
= (TPCch + TN) / TO (statistic 1 without octave errors)



McNemars Test Results

Statistical probability that algorithms have same error function. Note: Results of less than 5% indicate significant differences, results of 1% or less indicate highly significant differences.


Ryynänen & Klapuri
Goto, M.
Brossier, P.
Vincent, E.
Poliner & Ellis
Paiva, R. (2)
Paiva, R. (1)
Marolt, M.
Dressler, K.
Ryynänen & Klapuri n/a







Goto, M.
0% n/a






Brossier, P.
0% 0% n/a





Vincent, E.
0% 0% 0% n/a




Poliner & Ellis
0% 0% 0% 0% n/a



Paiva, R. (2)
0% 0% 0% 0% 0% n/a


Paiva, R. (1)
0% 0% 0% 0% 0% 0% n/a

Marolt, M.
0% 0% 0% 0% 0% 2.71% 0% n/a
Dressler, K.
0% 0% 0% 0% 0% 0% 0% 0% n/a


Content prepared by Dan Ellis based on information provided by Kris West.
Maintained by :J Stephen Downie
Comments to : jdownie at uiuc dot edu
Last modified: 28 September 2005