Difference between revisions of "2009:Audio Chord Detection"

From MIREX Wiki
Line 115: Line 115:
 
the performance increase for doing a training for chord detection  seems to be  insignificant.
 
the performance increase for doing a training for chord detection  seems to be  insignificant.
 
Would you consider dropping the test train part of the task this year?
 
Would you consider dropping the test train part of the task this year?
 +
 +
== I/O Format ==
 +
 +
The I/O format described in [5] will be used.
 +
 +
<csv>mirexfiles/chords.csv</csv>
 +
 +
  
 
== Submission Format ==
 
== Submission Format ==

Revision as of 14:36, 17 August 2009

Description

The text of this section is copied from the 2008 page. This task was first run in 2008. Please add your comments and discussions for 2009.

For many applications in music information retrieval, extracting the harmonic structure is very desirable, for example for segmenting pieces into characteristic segments, for finding similar pieces, or for semantic analysis of music.

The extraction of the harmonic structure requires the detection of as many chords as possible in a piece. That includes the characterisation of chords with a key and type as well as a chronological order with onset and duration of the chords.

Although some publications are available on this topic [1,2,3,4,5], comparison of the results is difficult, because different measures are used to assess the performance. To overcome this problem an accurately defined methodology is needed. This includes a repertory of the findable chords, a defined test set along with ground truth and unambiguous calculation rules to measure the performance.

Regarding this we suggest to introduced the new evaluation task Audio Chord Detection.

The deadline for this task is TBA.

Discussions for 2009

Data

As this is intended for music information retrieval, the analysis should be performed on real world audio, not resynthesized MIDI or special renditions of single chords. We suggest the test bed consists of WAV-files in CD quality (with a sampling rate of 44,1kHz and a solution of 16 bit). A representative test bed should consist of more than 50 songs of different genres like pop, rock, jazz and so on.

For each song in the test bed, a ground truth is needed. This should comprise all detectable chords in this piece with their tonic, type and temporal position (onset and duration) in a machine readable format that is still to be specified.

To define the ground truth, a set of detectable chords has to be identified. We propose to use the following set of chords build upon each of the twelve semitones.

Triads: major, minor, diminished, augmented, suspended4
 Quads: major-major 7, major-minor 7, major add9, major maj7/#5 
        minor-major 7, minor-minor 7, minor add9, minor 7/b5
        maj7/sus4, 7/sus4


Christopher Harte`s Beatles dataset is used for the evaluations. This dataset consists of 12 Beatles albums [6]. An approach for text annotation of musical chords is presented in [6].

Evaluation

Two common measures from field of information retrieval are recall and precision. They can be used to evaluate a chord detection system.

Recall: number of time units where the chords have been correctly identified by the algorithm divided by the number of time units which contain detectable chords in the ground truth.

Precision: number of time units where the chords have been correctly identified by the algorithm divided by the total number of time units where the algorithm detected a chord event.



Points to discuss:

  • The Precision measure has not been used last year, and I believe it should not because (unlike in beat extraction) we can assume a contiguous sequence of chords, i.e. all time units should feature a chord label. --Matthias 11:04, 27 June 2009 (UTC)
  • I would like to disagree on Matthias' previous point: I think we cannot assume that there is a chord present in every frame, one can think for instance of a drum solo, an acapella break, ethnic music or simply the beginning and ending of a file. In melody extraction or beat detection, there also isn't a continuity assumption. I must say that at the moment, our system isn't able to generate a no-chord either, so it is not in my personal interest to add this to the evaluation, but I feel this should be part of a general chord extraction system. I've also learned from some premature experiments that with the current frame-based evaluation, it is actually not even beneficial to include such a no-chord generator, because of the inequality of prior chances between a chord and a no-chord (14% for a N.C. in our little dataset, I suspect it to be even less for the Beatles set). The consequence is that the chord/no-chord distinction must be very accurate in order to increase the performance.

A related, minor topic is the naming of this task. Why isn't it "audio chord extraction" just like "melody extraction". For me "chord detection" is making the distinction between chords and no-chords and "chord extraction" is naming detected chords. Anyway, just nitpicking on that one. --Johan 15:43, 16 July 2009 (CET)

  • I think we can assume a contiguous sequence of chords if we treat "no chord" as a chord. --Matthias 16:01, 7 August 2009 (UTC)
  • I believe we should move forward in two ways to get a more meaningful evaluation:
    1. evaluate separate recall measures for several chord classes, my proposal is major, minor, diminished, augmented, dominant (meaning major chords with a minor seventh). A final recall score can then be calculated as a (weighted) average of the recall on different chords. --Matthias 11:04, 27 June 2009 (UTC)
    2. We use just triads major, minor, diminished, augmented, dominant which I think is a more sensible distinction. Once you start using quads, why limit yourself to dominant 7 and not use minor 7, major 7, full diminished, etc. So I'm more in favour of just triads (maybe add sus too) or more quads. --Johan 15:48, 16 July 2009 (CET)
    3. Segmentation should be considered. For example, a chord extraction algorithm that has reasonable recall may still be heavily fragmented thus producing an output difficult to read for humans. One measure to check for similarity in segmentation is directional Hamming distance (or divergence). --Matthias 11:04, 27 June 2009 (UTC)
    4. Agree, while the frame-based evaluation is certainly easy, it is not the most musically sensible. An evaluation on note-level or chord-segment basis might be a little too complicated for now, but this is a start. --Johan 15:51, 16 July 2009 (CET)
    5. Do you think we could consider several evaluation cases for chord detection with various chord dictionnaries? (For instance one with major/minor triads, one with major/minor/diminished/augmented/dominant etc.) so that each particpant can choose a case that can be handled by his/her algorithm? --Helene (IRCAM)
    6. to Helene: Several (maybe two) evaluation cases /could/ be good, but I think everyone's algorithm should be tested on every task, I think choosing the one you want would mean you can't compare it to other people's method.
    7. to Johan: in your chord list, did you mean to give the same list as I did? Anyway, I like to add "dominant" because it is used often, and musically (Jazz harmony theory) there's a functional difference between "dominant" and "major" (not between "minor" and "minor 7", and not between "major" and "major 7" or "major 6"). --Matthias 16:01, 7 August 2009 (UTC)
    8. to Matthias: I think that using the label "dominant" as you suggest here is not the correct use of that musical term in this context. In music theory (both in classical and jazz harmony) it is true that a dominant-seventh chord is always the major triad + minor 7th chord shape i.e. (1,3,5,b7). However, a (1,3,5,b7) chord does not always function as a dominant. Whether such a chord can be labelled dominant or not is entirely based on the chord's position in a given chord sequence relative to other chords which define its function in the progression. It is precisely for that reason that I argue against the use of the term 'dominant' for context-free chord labelling in the ISMIR05 chord labels paper [6]. --Chrish 17:00, 10 August 2009 (UTC)
    9. It seems counter-intuitive to have an evaluation that includes some quads but not all. It is fair to evaluate across the triad shapes major, minor, augmented, diminished and suspended (although sus2 and sus4 are inversions of each other) as these are the naturally occurring triad shapes in western harmony. It would seem sensible to include all the other quads that are labelled "7th" chords if including the (1,3,5,b7) shape in an evaluation. Given this problem, one possible way to compare algorithms that detect different sets of chord labels would be to split the results between triad recognition and quad/quint etc recognition. All algorithms can be tested on the triad evaluation - any algorithm that can detect quads can be compared directly against an algorithm that deals only with triads simply by taking the equivalent first three intervals of each chord label in the transcription as a triad. Only those algorithms that can recognise quads need to be evaluated in results that include quad chords. To make this process easy, it might be sensible for the labels that chord recognition algorithms produce to be given in terms of the intervals in the chords themselves rather than a chord name - i.e. a C major could be "C:(1,3,5)" and C major seventh "C:(1,3,5,7)" - both would evaluate as C major in a triad evaluation but the second one could also be evaluated in a test that looked at quads as well. This would also take away problems with possible labelling ambiguities that chord name labels could introduce. A list of the triads and quads that are acceptable in each level of evaluation would need to be drawn up but there is a list something like that at the top of this page already. --Chrish 17:00, 10 August 2009 (UTC)
    10. I think keeping the MIREX2008 evaluation procedure would make sense. I like the idea of a "merged maj/min" score in order to evaluate the root precision, and then, maybe another score taking only into account the root and mode(maj/min). Then, we could progressively extend the chord dictionary, by adding triads and quads (as suggested by Chris and Johan) and calculating new scores based on these extension.--Thomas 12:45, 12 August 2009 (UTC)
    11. I like the idea of using a segmentation measure based on directional hamming distance - We must make sure that the measure captures both over-segmentation (fragmentation) and under-segmentation though. The fragmentation measure (1-f) based on the inverse directional hamming distance described by Abdallah et al [8] only measures fragmentation by measuring the distance d_MG between the estimated sequence (M) and the ground-truth annotation (G). They also propose an under-segmentation measure (1-m) which uses the forward directional hamming distance d_GM. Using the fragmentation measure alone would mean that a chord recognition algorithm that output one chord label for the entire piece would score 100% correct for fragmentation. Both of these measures give a value between 0 and 1 so we could combine them to give an overall "chord segmentation" measure: 1-max(f,m) (f and m are not independent so it's probably best to just use the worst one rather than combine them geometrically). I think this measure could complement the frame-based recall quite well. --Chrish 10:00, 11 August 2009 (UTC)
    12. anyone know how to make the LaTeX maths work on the wiki? --Chrish 17:01, 10 August 2009 (UTC)
  • Something to consider when broadening the scope of used chords, is the inequality in prior chance of different chords (much like the problem with chords/no-chords I mentioned above). When looking for augmented and diminished triads in the Beatles set in addition to major and minor, I'm quite positive the (or at least my) overall performance will decrease. Some processing/selecting could level the priors, but just limitting the data set to the duration of the least frequent chord won't leave us with much data, I'm afraid. The thing is of course that the inequality is also there in reality, so I'm not really convinced myself that this should be done. Another option is not changing the data, but letting the evaluation take it into account. --Johan 16:23, 16 July 2009 (CET)
  • Should chord data be expressed in absolute (aka "F major-minor 7") or relative (aka "C: IV major-minor 7") terms?
  • Should different inversions of chords be considered in the evaluation process?
  • What temporal resolution should be used for ground truth and results?
  • How should enharmonic and other confusions of chords be handled?
  • How will Ground Truth be determined?
  • What degree of chordal/tonal complexity will the music contain?
  • Will we include any atonal or polytonal music in the Ground Truth dataset?
  • What is the maximal acceptable onset deviation between ground truth and result?
  • What file format should be used for ground truth and output?

Comment`s by MB 08.12

First I want to make clear that we are using the Christopher`s Harte`s Beatles dataset which includes quad chords. Last year the evaluations were based only on major,minor and non chords. This year if there are enough participants we can extend the evaluations to the rest of the triads diminished, augmented suspended and to quads. Please vote:

<poll> How would you like to evaluations to be performed ? Same as last year evaluate on major, minor and non chord All triads: major, minor, diminished, augmented, suspended + non chord All triads + quads + non chord </poll>

If we decide to extend this task, we have to change the I/O format. For ex: C(1,3,5,7) so that we can evaluate same results against triads or quads easily. In terms of evaluations, we performed a really simple one last year. This year we welcome evaluations scripts written by the community.

A possible simplification of the data output could be to use chromatic numeric notation for the intervals. For example, C(1,3,5,7) would be C(0,4,7,11) or to be a bit more pure, something like 0(0,4,7,11) is cool but a bit redundant, leading us to 0,4,7,11. Dr. Downie prefers the chromatic numeric notation as it instantaneously gets rid of the enharmonic spelling problem.



Looking at the last years results for Chord Detection: https://www.music-ir.org/mirex/2008/index.php/Audio_Chord_Detection_Results

the performance increase for doing a training for chord detection seems to be insignificant. Would you consider dropping the test train part of the task this year?

I/O Format

The I/O format described in [5] will be used.

file /nema-raid/www/mirex/results/mirexfiles/chords.csv not found


Submission Format

Submissions have to conform to the specified format below:

extractFeaturesAndTrain  "/path/to/trainFileList.txt"  "/path/to/scratch/dir"  

Where fileList.txt has the paths to each wav file. The features extracted on this stage can be stored under "/path/to/scratch/dir" The ground truth files for the supervised learning will be in the same path with a ".txt" extension at the end. For example for "/path/to/trainFile1.wav", there will be a corresponding ground truth file called "/path/to/trainFile1.wav.txt" .

For testing:

doChordID.sh "/path/to/testFileList.txt"  "/path/to/scratch/dir" "/path/to/results/dir"  

If there is no training, you can ignore the second argument here. In the results directory, there should be one file for each testfile with same name as the test file + .txt . The results file should be structured as below described by Matti.


Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.

Potential Participants

  • Johan Pauwels/Ghent University, Belgium (firstname.lastname@elis.ugent.be) (still interested on Aug, 5th but also on holiday from 8-20 Aug)
  • Matthias Mauch, Centre for Digital Music, Queen Mary, University of London --Matthias 10:33, 27 June 2009 (UTC)
  • Laurent Oudre, TELECOM ParisTech, France (firstname.lastname@telecom-paristech.fr) (still interested, probably 2 algorithms)
  • Maksim Khadkevich, Fondazione Bruno Kessler, Italy (lastname_at_fbk_dot_eu) (still interested, 1 algorithm)
  • Thomas Rocher, LaBRI Universit├⌐ Bordeaux 1, France (firstname.lastname@labri.fr)
  • Yushi Ueda, The University of Tokyo, Japan (lastname@hil.t.u-tokyo.ac.jp)
  • Christopher Harte, Centre for Digital Music, Queen Mary, University of London (firstname_dot_lastname_at_elec_dot_qmul_dot_ac_dot_uk)
  • Helene Papadopoulos, IRCAM (firstname_dot_lastname_at_ircam.fr)
  • Adrian Weller and Daniel Ellis, Columbia University, NY, USA (aw2506@columbia.edu)
  • Your name here

Bibliography

1.Harte,C.A. and Sandler,M.B.(2005). Automatic chord identification using a quantised chromagram. Proceedings of 118th Audio Engineering Society's Convention.

2.Sailer,C. and Rosenbauer K.(2006). A bottom-up approach to chord detection. Proceedings of International Computer Music Conference 2006.

3.Shenoy,A. and Wang,Y.(2005). Key, chord, and rythm tracking of popular music recordings. Computer Music Journal 29(3), 75-86.

4.Sheh,A. and Ellis,D.P.W.(2003). Chord segmentation and recognition using em-trained hidden markov models. Proceedings of 4th International Conference on Music Information Retrieval.

5.Yoshioka,T. et al.(2004). Automatic Chord Transcription with concurrent recognition of chord symbols and boundaries. Proceedings of 5th International Conference on Music Information Retrieval.

6.Harte,C. and Sandler,M. and Abdallah,S. and G├│mez,E.(2005). Symbolic representation of musical chords: a proposed syntax for text annotations. Proceedings of 6th International Conference on Music Information Retrieval.

7.Papadopoulos,H. and Peeters,G.(2007). Large-scale study of chord estimation algorithms based on chroma representation and HMM. Proceedings of 5th International Conference on Content-Based Multimedia Indexing.

8.Samer Abdallah, Katy Noland, Mark Sandler, Michael Casey & Christophe Rhodes: Theory and Evaluation of a Bayesian Music Structure Extractor (pp. 420-425) Proc. 6th International Conference on Music Information Retrieval, ISMIR 2005.