2005:Audio Melody Extraction Results

From MIREX Wiki


These are the results for the 2005 running of the Audio Melody Extraction task set.


To extract melodic content from polyphonic audio.


25 phrase excerpts of 10-40 sec from the following genres: Rock, R&B, Pop, Jazz, Solo classical piano


Rank Participant A. Voicing Detection B. Voicing False Alarm C. Voicing d-prime D. Raw Pitch Accuracy E. Raw Chroma Accuracy F. Overall Accuracy Runtime (s) Machine
1 Dressler, K. 81.8% 17.3% 1.85 68.1% 71.4% 71.4% 32 R
2 Ryynänen & Klapuri 90.3% 39.5% 1.56 68.6% 74.1% 64.3% 10970 L
3 Poliner & Ellis 91.6% 42.7% 1.56 67.3% 73.4% 61.1% 5471 B 0
3 Paiva, R. 2 68.8% 23.2% 1.22 58.5% 62.0% 61.1% 45618 Y
5 Marolt, M. 72.7% 32.4% 1.06 60.1% 67.1% 59.5% 12461 F
6 Paiva, R. 1 83.4% 55.8% 0.83 62.7% 66.7% 57.8% 44312 G
7 Goto, M. 99.9% * 99.4% * 0.59 * 65.8% 71.8% 49.9% * 211 F
8 Vincent & Plumbley 1 96.1% * 93.7% * 0.23 * 59.8% 67.6% 47.9% * ? G
9 Vincent & Plumbley 2 99.6% * 96.4% * 0.86 * 59.6% 71.1% 46.4% * 251 G
10 Brossier, P. 99.2% * † 98.8% * † 0.14 * † 3.9% † 8.1% † 3.2% * † 41 B 0

Notes: Bold numbers are the best in each column

  • Goto, Vincent, and Brossier did not perform voiced/unvoiced detection, so the starred results cannot be meaningfully compared to other systems. The Voicing rates are not 100% because for a few files these systems reported small numbers of no-voicing frames because the pitch tracking terminated early, and the remainder of the time was padded with zeros.

† Scores for Brossier are artificially low due to an unresolved algorithmic issue.

2006 melody.jpg

Explanation of Statistics

The task consists of two parts: Voicing detection (deciding whether a particular time frame contains a "melody pitch" or not), and pitch detection (deciding the most likely melody pitch for each time frame). We structured the submission to allow these parts to be done independently, i.e. it was possible (via a negative pitch value) to guess a pitch even for frames that were being judged unvoiced.

So consider a matrix of the per-frame voiced (Ground Truth or Detected values != 0) and unvoiced (GT, Det == 0) results, where the counts are:

unvx vx sum
Ground unvoiced TN FP GU
Truth voiced FN TP GV

TP ("true positives", frames where the voicing was correctly detected) further breaks down into pitch correct and pitch incorrect, say TP = TPC + TPI

Similarly, the ability to record pitch guesses even for frames judged unvoiced breaks down FN ("false negatives", frames which were actually pitched but detected as unpitched) into pitch correct and pitch incorrect, say FN = FNC + FNI

In both these cases, we can also count the number of times the chroma was correct, i.e. ignoring octave errors, say TP = TPCch + TPIch and FN = FNCch + FNIch.

To assess the voicing detection portion, we use the standard tools of detection theory. Statistic A, Voicing Detection is the probability that a frame which is truly voiced is labeled as voiced i.e. TP/GV (also known as "hit rate").

Statistic B, Voicing False Alarm, is the probability that a frame which is not actually voiced is none the less labeled as voiced i.e. FP/GU.

Statistic C, Voicing d-prime, is a measure of the sensitivity of the detector that attempts to factor out the overall bias towards labeling any frame as voiced (which can move both hit rate and false alarm rate up and down in tandem). It converts the hit rate and false alarm into standard deviations away from the mean of an equivalent Gaussian distribution, and reports the difference between them. A larger value indicates a detection scheme with better discrimination between the two classes.

For the voicing detection, we pooled the frames from all excerpts to get an overall frame-level voicing detection performance. Because some excerpts had no unvoiced frames, averaging over the excerpts gave some misleading results.

Now we move on to the actual pitch detection. Statistic D, Raw Pitch Accuracy is the probability of a correct pitch value (to within ± ¼ tone) given that the frame is indeed pitched. This includes the pitch guesses for frames that were judged unvoiced i.e. (TPC + FNC)/GV.

Similarly, Statistic E, Raw Chroma Accuracy, is the probability that the chroma (i.e. the note name) is correct over the voiced frames. This ignores errors where the pitch is wrong by an exact multiple of an octave (octave errors). It is (TPCch + FNCch)/GV.

Finally, Statistic F, Overall Accuracy, combines both the voicing detection and the pitch estimation to give the proportion of frames that were correctly labeled with both pitch and voicing, i.e. (TPC + TN)/TO.

When averaging the pitch statistics, we calculated the performance for each of the 25 excerpts individually, then report the average of these measures. This helps increase the effective weight of some of the minority genres, which had shorter excerpts.

Statistical Significance

In a comparative evaluation of this kind, a criticial question is whether an observed difference in the performance of two systems is statistically significant, or whether two systems with equal underlying error rates would show this level of difference due only to random variations in successive measurements. One typical approach is to adopt a binomial model, where each system is assumed to have a fixed probability of making an error on each trial, which is being estimated by the proportion of errors made on the test set. Under the binomial model, the variance of any count is proportional to the size of the count, thus the standard error declines as the square-root of the number of (independent) trials.

A more precise approach is McNemar's test, which considers only trials in which two systems make different predictions (since no information on their difference is available from trials in which they report the same outcome). Some results from McNemar's test are presented below.

The problem with these significance tests is that they rely on the trials being independent. Our counts are based on 10 ms frames, which are likely to be highly dependent on their neighbors. One way to ensure trials which are more independent would be to space them more widely in time, for instance by recording the result from a 10 ms frame only every 250 ms. (This number comes from the idea that an average musical note might last around 250 ms, so this level of subsampling would tend to pick samples from different notes, which are closer to independent trials.) If we actually measured results only on these subselected samples, we would expect to see the same overall error rates, but the total count of trials would be many fewer.

Very roughly speaking, the 25 test signals of around 30 s each might contain around 2000 independent trials. Under this number of trials, at an accuracy of 60-70%, systems whose performance differs by less than about 2.5% would not be statistically different at the 5% level (using a simplistic, one-tailed binomial test). Thus, in the above table, the top two systems are significantly different from the others, but the next three are in a statistical dead heat.

Raw data

For flexibility in calculating other results, we are including the raw values that we used to caclulate the statistics above. Because the data are broken down by track, it's a lot of data. Thus we are making it available as an Microsoft Excel Workbook, with one sheet for each entry.

Original statistics results

Below are the original statistics reported for the task, for historical purposes. The important numbers are the same as the table above.

Rank Participant Average correctly transcribed voiced and unvoiced portions Averaged correctly transcribed voiced instants Averaged correctly transcribed voiced instants mapped to one octave Average estimation of temporal boundaries of the melodic segments Average F-measure Average F-measure mapped to one octave Average correctly transcribed instants (ignoring segmentation errors) Average correctly transcribed instants mapped to one octave (ignoring segmentation errors) Average correctly transcribed voiced and unvoiced portions mapped to one Octave Runtime (s) Machine
1 Dressler, K. 70.78% 67.48% 70.83% 80.20% 78.34% 81.48% 67.48% 70.83% 73.61% 32 R
2 Ryynänen & Klapuri 63.94% 65.99% 69.85% 81.87% 70.22% 73.85% 68.20% 73.70% 67.33% 10970 L
3 Paiva, R. 2 60.70% 58.16% 61.56% 70.92% 70.23% 74.06% 58.16% 61.56% 63.79% 45618 Y
4 Poliner & Ellis 60.61% 62.47% 67.95% 83.02% 66.18% 71.10% 66.75% 72.86% 65.14% 5471 B 0
5 Marolt, M. 59.18% 58.06% 63.13% 73.35% 67.53% 72.24% 59.76% 66.73% 63.25% 12461 F
6 Paiva, R. 1 57.32% 62.23% 66.20% 74.76% 65.42% 69.32% 62.23% 66.20% 60.76% 44312 G
7 Goto, M. 49.68% 65.58% 71.47% 77.18% 56.46% 61.99% 65.58% 71.47% 54.89% 211 F
8 Vincent, E. 45.98% 59.17% 70.64% 77.61% 51.98% 62.36% 59.17% 70.64% 55.52% 251 G
9 (See note) Brossier, P. 3.2% 3.93% 8.06% 76.61% 3.63% 7.27% 3.93% 8.06% 6.43% 41 B 0

The statistics are:

1. Average Correctly transcribed voiced and unvoiced portions

  = (TPC + TN) / TO   (i.e. proportion of frames with voicing and pitch right)

2 . Averaged correctly transcribed voiced instants

  = TPC / GV          (proportion of truly-voiced frames with vx, pitch right)

3. Averaged correctly transcribed voiced instants mapped to one octave

  = TPCch / GV        (statistic 2 without octave errors)

4. Average estimation of temporal boundaries of the melodic segments

  = (TP + TN) / TO    (overall voicing detection frame accuracy) 

5. Average F-measure

  Kris defined PRECIS = (TPC + TN)/(TP + TN + FP)
           and RECALL = (TPC + TN)/(TP + TN + FN), thus 
  = 2*PRECIS*RECALL/(PRECIS + RECALL)   (a hybrid of pitch and voicing)

6. Average F-measure mapped to one octave

  [as above with TPC replaced by TPCch]

7. Average Correctly transcribed instants (ignoring segmentation errors)

  = (TPC + FNC) / GV  (pitch accuracy on pitched frames ignoring voicing)

8. Average Correctly transcribed instants mapped to one octave (ignoring

  segmentation errors)
  = (TPCch + FNCch) / GV  (statistic 7 without octave errors)

9. Average Correctly transcribed voiced and unvoiced portions mapped to

  one Octave
  = (TPCch + TN) / TO     (statistic 1 without octave errors)

McNemars Test Results

Statistical probability that algorithms have same error function. Note: Results of less than 5% indicate significant differences, results of 1% or less indicate highly significant differences.

Ryynänen & Klapuri Goto, M. Brossier, P. Vincent, E. Poliner & Ellis Paiva, R. (2) Paiva, R. (1) Marolt, M. Dressler, K.
Ryynänen & Klapuri n/a
Goto, M. 0% n/a
Brossier, P. 0% 0% n/a
Vincent, E. 0% 0% 0% n/a
Poliner & Ellis 0% 0% 0% 0% n/a
Paiva, R. (2) 0% 0% 0% 0% 0% n/a
Paiva, R. (1) 0% 0% 0% 0% 0% 0% n/a
Marolt, M. 0% 0% 0% 0% 0% 2.71% 0% n/a
Dressler, K. 0% 0% 0% 0% 0% 0% 0% 0% n/a