2008:Multiple Fundamental Frequency Estimation & Tracking

From MIREX Wiki


That a complex music signal can be represented by the F0 contours of its constituent sources is a very useful concept for most music information retrieval systems. There have been many attempts at multiple (aka polyphonic) F0 estimation and melody extraction, a related area. The goal of multiple F0 estimation and tracking is to identify the active F0s in each time frame and to track notes and timbres continuously in a complex music signal. In this task, we would like to evaluate state-of-the-art multiple-F0 estimation and tracking algorithms. Since F0 tracking of all sources in a complex audio mixture can be very hard, we are restricting the problem to 3 cases:

1. Estimate active fundamental frequencies on a frame-by-frame basis.

2. Track note contours on a continuous time basis. (as in audio-to-midi). This task will also include a piano transcription sub task.

3. Track timbre on a continous time basis.

The deadline For this task is AUGUST 22nd.


A woodwind quintet transcription of the fifth variation from L. van Beethoven's Variations for String Quartet Op.18 No. 5. Each part (flute, oboe, clarinet, horn, or bassoon) was recorded separately while the performer listened to the other parts (recorded previously) through headphones. Later the parts were mixed to a monaural 44.1kHz/16bits file.

Synthesized pieces using RWC MIDI and RWC samples. Includes pieces from Classical and Jazz collections. Polyphony changes from 1 to 4 sources.

Polyphonic piano recordings generated using a disklavier playback piano.

So, there are 6, 30-sec clips for each polyphony (2-3-4-5) for a total of 30 examples, plus there are 10 30-sec polyphonic piano clips. Please email me about your estimated running time (in terms of n times realtime), if we believe everybodyΓÇÖs algorithm is fast enough, we can increase the number of test samples. (There were 90 x real-time algo`s for melody extraction tasks in the past.)

All files are in 44.1kHz / 16 bit wave format. The development set can be found at Development Set for MIREX 2007 MultiF0 Estimation Tracking Task.

Send an email to mertbay@uiuc.edu for the username and password.


This year, We would like to discuss different evaluation methods. From last year`s result, it can be seen that on note tracking, algorithms performed poorly when evaluated using note offsets. Below is the evaluation methods we used last year:

For Task 1 (frame level evaluation), systems will report the number of active pitches every 10ms. Precision (the portion of correct retrieved pitches for all pitches retrieved for each frame) and Recall (the ratio of correct pitches to all ground truth pitches for each frame) will be reported. A Returned Pitch is assumed to be correct if it is within a half semitone (+ - 3%) of a ground-truth pitch for that frame. Only one ground-truth pitch can be associated with each Returned Pitch. Also as suggested, an error score as described in Poliner and Ellis p.g. 5 will be calculated. The frame level ground truth will be calculated by YIN and hand corrected.

For Task 2 (note tracking), again Precision (the ratio of correctly transcribed ground truth notes to the number of ground truth notes for that input clip) and Recall (ratio of correctly transcribed ground truth notes to the number of transcribed notes) will be reported. A ground truth note is assumed to be correctly transcribed if the system returns a note that is within a half semitone (+ - 3%) of that note AND the returned note`s onset is within a 50ms range( + - 25ms) of the onset of the ground truth note, and its offset is within 20% range of the ground truth note`s offset. Again, one ground truth note can only be associated with one transcribed note.

The ground truth for this task will be annotated by hand. An amplitude threshold relative to the file/instrument will be determined. Note onset is going to be set to the time where its amplitude rises higher than the threshold and the offset is going to be set to the the time where the note`s amplitude decays lower than the threshold. The ground truth is going to be set as the average F0 between the onset and the offset of the note. In the case of legato, the onset/offset is going to be set to the time where the F0 deviates more than 3% of the average F0 through out the the note up to that point. There is not going to be any vibrato larger than a half semitone in the test data.

Different statistics can also be reported if agreed by the participants.

Submission Format

Submissions have to conform to the specified format below:

doMultiF0 "path/to/file.wav"  "path/to/output/file.F0" 

path/to/file.wav: Path to the input audio file.

path/to/output/file.F0: The output file.

Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.

For each task, the format of the output file is going to be different: For the first task, F0-estimation on frame basis, the output will be a file where each row has a time stamp and a number of active F0s in that frame, separated by a tab for every 10ms increments.

Example :

time	F01	F02	F03	
time	F01	F02	F03	F04
time	...	...	...	...

which might look like:

0.78	146.83	220.00	349.23
0.79	349.23	146.83	369.99	220.00	
0.80	...	...	...	...

For the second task, for each row, the file should contain the onset, offset and the F0 of each note event separated by a tab, ordered in terms of onset times:

onset	offset F01
onset	offset F02
...	... ...

which might look like:

0.68	1.20	349.23
0.72	1.02	220.00
...	...	...

The DEADLINE is Friday August 31.

Antonio's comments 24/07/08

First of all, thanks to the Mirex team for their effort to make it possible again. Just some comments about the evaluation this year.

As the first multiple f0 contest took place last year, it was very welcomed and many researchers submitted their algorithms, so the results in Mirex07 provided a very valuable resource for comparing different approaches. However, the participation will probably not be so massive this year so, in case that the database used for evaluation will not be the same than last year, it would be nice to report (if possible) both the results obtained using the Mirex07 database and the results with the Mirex08 database, to directly compare the new approaches with the algorithms presented last year.

Emmanuel's comments 25/07/08

As last year, we will participate in tasks 1 and 2 only.

For task 2, I understand that the proposed annotation method will take reverberation into account, so that for instance the offset of one note will happen after the onset of the following note in a legato context. Is that true? Computing the amplitudes of the notes is not trivial in the presence of overlapping partials, so I wonder if Mert could tell us a bit more.

Results could also be evaluated with the onset-only metric used last year.

Mert`s comments

Thanks for the comments. Antonio, the new dataset will be the previous year`s plus some more. So last year`s labs can compare their new methods. Emmanuel, the current ground truth is annotated in a non overlapping way. So within the source, the offset of the previous note can not happen after the onset of the current one. The offset range in the evaluation criteria should be enough to not the cause a false negative because of the reverberation. However, we can come up with a better criteria for evaluating the with the offsets.

Antonio's comments 30/07/08

We will participate in tasks 1 and 2 too. Mert, any news about the deadline to send the algorithms?

Jean-Louis's comments 30/07/08

We will probably participate to task 1. One question about the development set: I know the groundtruth is annotated at frame level, with a hop size of 10ms between the windows. However, could we know the size of the windows and the weighting window that were used?

I would also like to know whether the first window starts at 0s or is centered around 0s. The ISMIR 2004 database had time stamps giving the center of each window. I wanted to check whether the annotation protocole was the same or not.

Mert`s comments 05/08/08

The Deadline will be 22th of August Friday. It will be announced on the lists soon. The ground truth on the frame level uses 10ms skip rate with 46ms hanning window. First window is centered at 23ms, second at 33ms and so on. If this is a problem for the community let me know, I can readjust (or reinterpolate) the frame centers to match 10ms,20ms,.....

Matti's comments 06/08/08

Hi all, I'm glad to see a re-run of this task and also potential new teams for this year. Should we fix the submission format so that people could start to prepare their submissions (I guess that the format is the same as last year)?

Mert`s Comments

Let`s use the same I/O formats from last year. The deadline for this task will be August 22nd.

Gustavo's Comments

Greetings! I would like to know what is the submission format for task 3. Thanks in advance.

Jean-Louis's Comments

Hi everyone,
My concerns are for the evaluation of task 1: I just noticed that the definition of the accuracy on the wiki page (2007:Multiple_Fundamental_Frequency_Estimation_&_Tracking_Results. I guess the latter is right (seems to make more sense too), i.e. Acc = TP/(TP + FP + FN).
I d also want to warn people about some mistake I ve been doing, computing the Precision and Recall measures: I have been calculating them following the usual TP/(TP+FP) and TP/(TP+FN), but I think for this task, it s a little bit more tricky... Especially for the recall, for which this way of computing does not take into account some of the FP that actually are substitutions (and not additional positives, where there should have been a 0, say)... Seeing the way it was computed for the audio melody extraction task, a few years ago, I dont think my mistake was done during the 2007 evaluation. I would say the afore-mentionned accuracy does not suffer from this "mistake", such that the formula can be used as stated (even if the FP in it is somehow ambiguous). Am I right?
I was wondering: would it be possible to know more precisely how the criteria are computed? Especially, what do you count as FP? I know in audio melody extraction (for which the groundtruth is "monophonic", such that the problems are not exactly the same either), the distinction was made between incorrect pitch (say "IP"): the frame was pitched but the estimated pitch was incorrect, and false positive (the FP): the frame is unpitched, but a pitch was given (instead of 0, for instance). All in all, I mean that distinguishing between substitutions and additions seems to be necessary to obtain relevant measures. I guess everyone will agree about that, since the metrics by Ellis and Poliner were already taking into account this fact... Well, anyway, I am interested to know how you guys at MIREX are doing this! That would help me to "tune" my stuff the right way ! :D
Concerning the output format, is it possible to put 0s instead of nothing for each frame? The system we are working on, based on source separation, outputs a fixed number of pitches (sources) per frame, giving 0 if a source is considered silent. Will it be taken into account in the evaluation or are all the 0s in our output going to count as "FPs"? :)

Mert`s Comments

Hi Jean-Louis, Accuracy was calculated as Acc=TP/(TP+FP+FN). There was typo in the page. It is fixed now. The evaluation is something we should discuss more this year. We can evaluate with many different criterias. As I look at my scripts, the recall is calculated as TP/Nref where Nref is the number of nonzero elements in the ground truth vector. FP was calculated for each frame as the difference between the number of non zero elements in the detected F0 vector and number of non zero elements in the intersection of the detected F0 vector with the ground truth F0 vector. Then summed accross all frames.

BTW, you can put 0`s, it is no problem.

Gustavo's Comments

Hi everyone,

I just noticed that on the wiki page (2007:Multiple_Fundamental_Frequency_Estimation_&_Tracking_Results) as well as in this wiki page it is stated that: "returned note`s onset is within a 50ms range( + - 25ms) of the onset of the ground truth note".

Which one is correct: +-25ms or +-50ms?

Potential Participants

If you might consider participating, please add your name and email address here and also please sign up for the Multi-F0 mail list: Multi-F0 Estimation Tracking email list

1. Gustavo Reis (Polytecnic Institute of Leiria, Portugal) and Francisco Fernandez (University of Extremadura, Spain) and Anibal Ferreira (University of Porto, Portugal) (gustavo.reis (at) estg.ipleiria.pt, fcofdez (at) unex.es, ajf (at) fe.up.pt)
2. Antonio Pertusa and José M. Iñesta (University of Alicante, Spain) (pertusa@ua.es, inesta@dlsi.ua.es)
3. Pablo Cancela (pcancela@gmail.com)
4. Emmanuel Vincent (emmanuel.vincent (at) irisa_fr) and Nancy Bertin (nancy.bertin (at) enst_fr)
5. Jean-Louis Durrieu (durrieu AT enst DOT fr) (task 1)
6. Matti Ryynänen and Anssi Klapuri (Tampere University of Technology) (matti.ryynanen (at) tut.fi, anssi.klapuri (at) tut.fi)
7. Koji Egashira (University of Tokyo, Japan) (egashira (at) hil.t.u-tokyo.ac.jp)
8. Ruohua Zhou and Josh Reiss (Queen Mary University of London) ( zhou.ruohua, Josh.Reiss@elec.qmul.ac.uk)
9. Chunghsin Yeh, Axel Roebel (IRCAM) (cyeh, roebel (at) ircam dot fr) and Wei-Chen Chang (wcchang (at) gmail dot com)
10. Valentin Emiya (TELECOM ParisTech - ENST) (valentin.emiya (at) enst_fr)
11. Chuan Cao and Ming Li (ThinkIT Lab., IOA), ccao <at> hccl.ioa.ac.cn, mli <at> hccl.ioa.ac.cn
12. Michael Groble (mg2467@columbia.edu)