2014:Audio Chord Estimation Results

Introduction

This page contains the results of these new evaluations for the Isophonics dataset, a.k.a. the MIREX 2009 dataset. It comprises the collected Beatles, Queen, and Zweieck datasets from Queen Mary, University of London, and has been used for audio chord estimation in MIREX for many years.

Why evaluate differently?

Researchers interested in automatic chord estimation have been dissatisfied with the traditional evaluation techniques used for this task at MIREX.

Numerous alternatives have been proposed in the literature (Harte, 2010; Mauch, 2010; Pauwels & Peeters, 2013).

At ISMIR 2010 in Utrecht, a group discussed alternatives and developed the Utrecht Agreement for updating the task, but until this year, nobody had implemented any of the suggestions.

What’s new?

More precise recall estimation

MIREX typically uses chord symbol recall (CSR) to estimate how well the predicted chords match the ground truth: the total duration of segments where the predictions match the ground truth divided by the total duration of the song.

In previous years, MIREX has used an approximate CSR by sampling both the ground-truth and the automatic annotations every 10 ms.

Following Harte (2010), we view the ground-truth and estimated annotations instead as continuous segmentations of the audio because (1) this is more precise and also (2) more computationally efficient.

Moreover, because pieces of music come in a wide variety of lengths, we believe it is better to weight the CSR by the length of the song. This final number is referred to as the weighted chord symbol recall (WCSR).

Advanced chord vocabularies

We computed WCSR with five different chord vocabulary mappings:

Chord root note only;
Major and minor;
Seventh chords;
Major and minor with inversions; and
Seventh chords with inversions.

With the exception of no-chords, calculating the vocabulary mapping involves examining the root note, the bass note, and the relative interval structure of the chord labels.

A mapping exists if both the root notes and bass notes match, and the structure of the output label is the largest possible subset of the input label given the vocabulary.

For instance, in the major and minor case, G:7(#9) is mapped to G:maj because the interval set of G:maj, {1,3,5}, is a subset of the interval set of the G:7(#9), {1,3,5,b7,#9}. In the seventh-chord case, G:7(#9) is mapped to G:7 instead because the interval set of G:7 {1, 3, 5, b7} is also a subset of G:7(#9) but is larger than G:maj.

Our recommendations are motivated by the frequencies of chord qualities in the Billboard corpus of American popular music (Burgoyne et al., 2011).

Most Frequent Chord Qualities in the *Billboard* Corpus
Quality	Freq.	Cum. Freq.
maj	52	52
min	13	65
7	10	75
min7	8	83
maj7	3	86

Evaluation of segmentation

The chord transcription literature includes several other evaluation metrics, which mainly focus on the segmentation of the transcription.

We propose to include the directional Hamming distance in the evaluation. The directional Hamming distance is calculated by finding for each annotated segment the maximally overlapping segment in the other annotation, and then summing the differences (Abdallah et al., 2005; Mauch, 2010).

Depending on the order of application, the directional Hamming distance yields a measure of over- or under-segmentation. To keep the scaling consistent with WCSR values (1.0 is best and 0.0 is worst), we report 1 – over-segmentation and 1 – under-segmentation, as well as the harmonic mean of these values (cf. Harte, 2010).

Software

All software used for the evaluation has been made open-source. The evaluation framework is described by Pauwels and Peeters (2013). The corresponding code repository can be found on GitHub and the used measures are available as presets. The raw algorithmic output provided below makes it possible to calculate the additional measures from the paper (separate results for tetrads, etc.), in addition to those presented below. More help can be found in the readme.

The statistical comparison between the different submissions is explained in Burgoyne et al. (2014). The software is available at BitBucket. It uses the detailed results provided below as input.

Submissions

	Abstract	Contributors
KO1 (shineChords)	PDF	Maksim Khadkevich, Maurizio Omologo
CM3 (Chordino)	PDF	Chris Cannam, Matthias Mauch
JR2	PDF	Jean-Baptiste Rolland

Results

Summary

All figures can be interpreted as percentages and range from 0 (worst) to 100 (best). The table is sorted on WCSR for the major-minor vocabulary. Algorithms that conducted training are marked with an asterisk; all others were submitted pre-trained.

MIREX Chord 2009

Algorithm	Root	MajMin	MajMinBass	Sevenths	SeventhsBass	Average combined Hamming measure	Average under-segmentation	Average over-segmentation
CM3	78.56	75.41	72.48	54.67	52.26	85.9	87.17	86.09
JR2	68.0	63.31	51.11	56.42	45.39	76.69	81.59	75.51
KO1	82.93	82.19	79.61	76.04	73.43	87.69	85.66	91.24

download these results as csv

Billboard 2012

Algorithm	Root	MajMin	MajMinBass	Sevenths	SeventhsBass	Average combined Hamming measure	Average under-segmentation	Average over-segmentation
CM3	74.15	72.22	70.21	55.35	53.39	83.63	85.31	83.39
JR2	64.3	60.37	48.72	45.74	36.56	75.14	81.53	72.37
KO1	77.45	75.58	73.51	57.68	55.82	84.16	82.8	87.44

download these results as csv

Billboard 2013

Algorithm	Root	MajMin	MajMinBass	Sevenths	SeventhsBass	Average combined Hamming measure	Average under-segmentation	Average over-segmentation
CM3	71.16	67.28	65.2	48.99	47.17	81.54	83.11	82.63
JR2	62.95	57.95	47.56	43.71	35.59	74.12	80.65	72.33
KO1	75.36	71.39	69.43	53.57	51.78	81.63	79.61	87.75

download these results as csv

Comparative Statistics

An analysis of the statistical difference between all submissions can be found on the following pages:

Detailed Results

More details about the performance of the algorithms, including per-song performance and supplementary statistics, are available from these archives:

Algorithmic Output

The raw output of the algorithms are available in the archives below. This can be used to experiment with alternative evaluation measures and statistics.

Notes

The evaluation procedure of this year was exactly the same as that of last year, so the results can be compared with each other. Even more, two of the three submissions were resubmissions of last year: KO1 (2014) = KO1 (2013) and CM3 (2014) = CF2 (2013) and consequently have the same scores.

2014:Audio Chord Estimation Results

Contents

Introduction

Why evaluate differently?

What’s new?

More precise recall estimation

Advanced chord vocabularies

Evaluation of segmentation

Software

Submissions

Results

Summary

MIREX Chord 2009

Billboard 2012

Billboard 2013

Comparative Statistics

Detailed Results

Algorithmic Output

Notes

Navigation menu

Views

Personal tools

MIREX by Year

Results by Year

Account Request

Search

Navigation

Tools