2005:Audio Tempo Extraction
Martin F. McKinney (Philips) email@example.com
Automatic tempo extraction
This contest will compare current methods for the extraction of tempo from musical audio. We distinguish between notated tempo and perceptual tempo and will test for the extraction of perceptual tempo. We will also test for tempo following if there is enough interest.
We differentiate between notated tempo and perceived tempo. If you have the notated tempo (e.g., from the score) it is straightforward attach a tempo annotation to an excerpt and run a contest for algorithms to predict the notated tempo. For excerpts for which we have no "official" tempo annotation, we can also annotate the *perceived* tempo. This is not a straightforward task and needs to be done carefully. If you ask a group of listeners (including skilled musicians) to annotate the tempo of music excerpts, they can give you different answers (they tap at different metrical levels) if they are unfamiliar with the piece. For some excerpts the perceived pulse or tempo is less ambiguous and everyone taps at the same metrical level, but for other excerpts the tempo can be quite ambiguous and you get a complete split across listeners.
The annotation of perceptual tempo can take several forms: a probability density function as a function of tempo; a series of tempos, ranked by their respective perceptual salience; etc. These measures of perceptual tempo can be used as a ground truth on which to test algorithms for tempo extraction. The dominant perceived tempo is sometimes the same as the notated tempo but not always. A piece of music can "feel" faster or slower than it's notated tempo in that the dominant perceived pulse can be a metrical level higher or lower than the notated tempo.
There are several reasons to examine the perceptual tempo, either in place of or in addition to the notated tempo. For many applications of automatic tempo extractors, the perceived tempo of the music is more relevant than the notated tempo. An automatic playlist generator or music navigator, for instance, might allow listeners to select or filter music by its (automatically extracted) tempo. In this case, the "feel", or perceptual tempo may be more relevant than the notated tempo. An automatic DJ apparatus might also perform better with a representation of perceived tempo rather than notated tempo.
A more pragmatic reason for using perceptual tempo rather than notated tempo as a ground truth for our contest is that we simply do not have the notated tempo of our test set. If we notate it by having a panel of expert listeners tap along and label the excerpts, we are by default dealing with the perceived tempo. The handling of this data as ground truth must be done with care.
Last years' participants and organizers (Unconfirmed!):
- Fabien Gouyon (firstname.lastname@example.org)
- Miguel Alonso (email@example.com)
- Simon Dixon (firstname.lastname@example.org)
- Christian Uhle (email@example.com)
- George Tzanetakis (firstname.lastname@example.org)
- Anssi Klapuri (email@example.com)
This section focuses on the mechanics of the method while we discuss the data (music excerpts and perceptual data) in the next section. There are two general steps to the method: 1) collection of perceptual tempo annotations; and 2) evaluation of tempo extraction algorithms.
1) Perceptual tempo data collection The following procedure is described in more detail in McKinney and Moelants (2004) and Moelants and McKinney (2004). Listeners will be asked to tap to the beat of a series of musical excerpts. Responses will be collected and their perceived tempo will be calculated. For each excerpt, a distribution of perceived tempo will be generated. A relatively simple form of perceived tempo is proposed for this contest: The two highest peaks in the perceived tempo distribution for each excerpt will be taken, along with their respective heights (normalized to sum to 1.0) as the two tempo candidates for that particular excerpt. The height of a peak in the distribution is assumed to represent the perceptual salience of that tempo. In addition to tempo, the phase and tapping times of listeners will also be recorded for possible evaluation of phase-locking and tempo following of tempo-extraction algorithms.
- McKinney, M.F. and Moelants, D. (2004), Deviations from the resonance theory of tempo induction, Conference on Interdisciplinary Musicology, Graz. URL: http://gewi.kfunigraz.ac.at/~cim04/CIM04_paper_pdf/McKinney_Moelants_CIM04_proceedings_t.pdf
- Moelants, D. and McKinney, M.F. (2004), Tempo perception and musical content: What makes a piece slow, fast, or temporally ambiguous? International Conference on Music Perception & Cognition, Evanston, IL. URL: http://www.northwestern.edu/icmpc/proceedings/ICMPC8/PDF/AUTHOR/MP040237.PDF
2) Evaluation of tempo extraction algorithms Algorithms will process musical excerpts and be rated on the following tasks:
- Ability to identify the most salient (primary) tempo (to within 3%)
- Ability to identify the 2nd most salient (secondary) tempo (to within 3%)
- Ability to identify an integer multiple of the primary tempo (to within 3%) (this task is a given if task 1 is performed correctly)
- Ability to identify an integer multiple of secondary tempo (to within 3%) (this task is a given if task 2 is performed correctly)
- (optional) Ability to correctly identify phase of tempo
- (optional) Ability to follow tempo on excerpts with varying tempo
Relevant Test Collections
From previous studies on tempo perception (see references) we have 3 sets of annotated musical excerpts:
- 24 10-second excerpts annotated by 33 subjects, excerpts were taken primarily from Western popular music.
- 60 30-second excerpts annotated by 24 subjects, excerpts were taken from Western popular, classical and world ethnic music.
- 50 30-second excerpts annotated by 40 subjects, excerpts were taken from a broad range of musical styles.
The 10-second excerpts from our first set may be too short. I think we might want to stick with longer excerpts (15 seconds or longer). In addition, we could conduct further listening/tapping sessions in order to supplement the current set of annotations. Source material could come from different sources:
- Miguel Alonso (ENST) has a database with several hundred excerpts
- (?) Fabien Gouyon (?) Last years database (?)
- Other labs working on tempo extraction
- Public music databases
We will also provide some measures of statistical significance to the results, most likely through bootstrapping the test data.
Concerning copyright issues: I'm not sure if there will be any issues here if all music is simply collected in one place and then the contest algorithms are run there. In addition, I've heard that it is legal to use/distribute short excerpts of recorded audio without violation. Can anyone confirm/deny or provide more info on copyright issues for short excerpts?
I think that your proposal is clearly written and definitely appropriate for ISMIR. I agree with your justification for the analysis of perceptual tempo, especially for applications related to human interaction. However, in order to build upon last year's contest I would support the inclusion of 'phase locking' and 'tempo following' as areas to investigate under this proposal, in addition to some further consideration of the evaluation procedures.
I think your list of participants is realistic - equally I believe many of these people have published work on beat tracking as well as tempo analysis, which suggests there should be support for an expanded proposal.
In terms of the data to be analyzed, I agree that longer excerpts are necessary, especially if the proposal is to be expanded to incorporate tempo following and phase information. I wonder if it might be interesting to classify input signals not only by genre (as you suggest), but also by the presence or absence of percussion. This would be another way to demonstrate the generality of the entered algorithms, but might also provide some further insight into those signals for which the perceptual tempo is most open to subjective interpretation - put simply: is there more agreement (computationally and in annotations) when drums are present? I would also like to see some consideration given to examples that aren't in 4/4 time, as well those which are heavily syncopated (if not already present in the proposed databases).
I have a couple of concerns regarding the evaluation criteria, particularly related to the second most salient level:
- Should it be mandatory for participants to look for more than one appropriate tempo for a given input signal?
- I can see from your description of the data collection that extracting two levels is not too hard, however is it generally intuitive whether this secondary level is faster or slower than the primary level? I wonder if it might be more valuable to find something more explicit, like the tatum (fastest metrical level), or time-signature. I'm not sure if these would count as perceptual tempi if no one actually chooses to tap that quickly or slowly.
- In cases (if any) where there is complete agreement on the perceptual level, how would the second most salient level be defined?
I think you're right to suggest a tempo dependent threshold, but I'm interested as to where this value of 3% comes from. Might it be a little too strict? Was this the value suggested for last year's contest?
Given that your annotated data for perceptual tempo is derived from subjects 'tapping' along to music, it seems worthwhile expanding the scope of this proposal to include phase information and tempo following. Perhaps optionally making this a tempo and beat tracking contest. Again I'm aware of the potential problems in deriving a globally acceptable strategy for the evaluation of beat locations, (current examples include: Goto-97 the longest continuous correctly tracked segment, or Scheirer-98 RMS deviation between algorithm and annotated beats) but I think this is a factor which should be addressed.
The problem is a relevant MIR task which is clearly defined. The proposed participants seem likely to participate indeed.
I do not have much to say, since this proposal is already very solid.
I appreciate the fact that the potential participants already own a large amount of annotated data, so that the work to annotate new data will be limited. However it seems that a large number of listeners is needed for annotation, because several perceptual tempos are taken into account for evaluation. Would it be possible to propose evaluation measures that are relevant whatever the number of annotaters (probably less than five annotaters will be available for new annotations) ? Or to evaluate performance differently on each file depending on the amount of annotaters ?