Difference between revisions of "2009:Audio Melody Extraction"

From MIREX Wiki
(New page: [this page is for now a pale copy/paste of MIREX06 webpage: [http://www.music-ir.org/mirex/2006/index.php/Audio_Melody_Extraction Audio_Melody_Extraction]] =Goal= To extract the melody lin...)
 
m (Robot: Automated text replacement (-\[\[([A-Z][^:]+)\]\] +2009:\1))
 
(45 intermediate revisions by 12 users not shown)
Line 1: Line 1:
[this page is for now a pale copy/paste of MIREX06 webpage: [https://www.music-ir.org/mirex/2006/index.php/Audio_Melody_Extraction Audio_Melody_Extraction]]
+
==Description==
=Goal=
 
To extract the melody line from polyphonic audio.
 
  
The deadline for this task is AUGUST 22nd.
+
The aim of the MIREX audio melody extraction evaluation is to identify the melody pitch contour from polyphonic musical audio.  
  
=Description=
+
The task consists of two parts:  
The aim of the MIREX audio melody extraction evaluation is to identify the melody pitch contour from polyphonic musical audio. The task consists of two parts: Voicing detection (deciding whether a particular time frame contains a "melody pitch" or not), and pitch detection (deciding the most likely melody pitch for each time frame). We structure the submission to allow these parts to be done independently, i.e. it is possible (via a negative pitch value) to guess a pitch even for frames that were being judged unvoiced. Algorithms which don't perform a discrimination between melodic and non-melodic parts are also welcome!
+
* Voicing detection (deciding whether a particular time frame contains a "melody pitch" or not),
 +
* pitch detection (deciding the most likely melody pitch for each time frame).  
  
(The audio melody extraction evaluation will be essentially a re-run of last years contest i.e. the same test data is used.)
+
We structure the submission to allow these parts to be done independently, i.e. it is possible (via a negative pitch value) to guess a pitch even for frames that were being judged unvoiced. Algorithms which don't perform a discrimination between melodic and non-melodic parts are also welcome!
  
'''Dataset''':
+
== Discussions for 2009 ==
* MIREX05 database : 25 phrase excerpts of 10-40 sec from the following genres: Rock, R&B, Pop, Jazz, Solo classical piano
+
 
* ISMIR04 database : 20 excerpts of about 20s each
+
 
* CD-quality (PCM, 16-bit, 44100 Hz)
+
Your comments here.
* single channel (mono)
+
 
* manually annotated reference data (10 ms time grid)
+
=== New evaluations for 2009? ===
 +
We would like to know if there would be potential participants for this year's evaluation on Audio Melody Extraction.
 +
 
 +
There has also been an interest last year in evaluating the results at note levels (and not at a frame by frame level), following the multipitch evaluation. However, it has not been done, probably because of both a lack of participants and of database. Would there be more people this year?
 +
 
 +
cheers,
 +
Jean-Louis, 9th July 2009
 +
 
 +
=== Chao-Ling's Comments 14/07/2009 ===
 +
 
 +
Hi everyone. I would like to suggest that we have a separate evaluation on the songs where the main melody is carried by the human singing voice as opposed to other musical instruments (like Vishu's comment in MIREX2008). We proposed a pitch extraction approach for singing voices and may not be likely to perform well for other instruments.
 +
 
 +
In addition, we have prepared a dataset called [http://unvoicedsoundseparation.googlepages.com/mir-1k MIR-1K] and would like to add it as part of the training/evaluation dataset. It contains 1000 song clips recorded at 16 kHz sample rate with 16-bit resolution. The duration of each clip ranges from 4 to 13 seconds, and the total length of the dataset is 133 minutes. These clips were extracted from 110 karaoke songs which contain a mixed track and a music accompaniment track.
 +
 
 +
=== Vishu's Comments 20/07/2009 ===
 +
Hi Everyone.
 +
 
 +
Chao-Ling, your dataset sounds exciting. I think the community would benefit greatly from the addition of such a large database.
 +
Last year we too had contributed some data (Indian classical music). Here are some points with respect to your current dataset to make it conform to previous data formats.
 +
* We need to have mixed (voice + accompaniment) mono tracks, so if you would like to mix it yourself (based on some SNR considerations) or if you think just giving equal weight to left and right channels of your files is acceptable, let us know which.
 +
* I see that your ground-truth's (.pv files) are in semitones. Since all the previous reference pitch values are in Hz either you could convert them to Hz and pass them on or pass on the exact conversion formula (Hz to cents) you have used to the evaluators (IMMERSEL).
 +
* Finally, the previous ground-truth pitch files were in two column format (TimeStamp_sec  PitchValue_Hz) available every 10 ms. Your files are single-column. Please let us know at exactly which time-instant is your first window centered and the conversion can be done accordingly.
 +
Thanks again for your efforts.
 +
 
 +
=== Chao-Ling's Comments 21/07/2009 ===
 +
 
 +
Hi Vishu and everybody! Thank you for your suggestions. My responses are as follows:
 +
* The left and right channels were adjusted to have equal weight. I prefer to provide both channels because they are good for evaluating algorithms at different SNR settings.
 +
* I will provide pitch ground-truth in Hz.
 +
* The first window is centered in 20ms (the window size is 40ms and the overlap is 20ms). Note that the last window is discarded if its time is less than 40ms, so some .pv files might have one less point.
 +
I will provide new .pv files that have both ground-truth in Hz and time stamp column. I will also make this dataset smaller that discards the parts that are unrelated to this task.
 +
 
 +
=== Chao-Ling's Comments 22/07/2009 ===
 +
 
 +
Okay guys, here is the new files of the dataset: [http://140.114.88.80/dataset/Dataset_for_MIREX.rar MIR-1K for MIREX]. Plesea feel free to let me know if there are problems.
 +
 
 +
=== About MIR-1K - Jean-Louis, 26/07/2009 ===
 +
 
 +
Hi all!
 +
 
 +
Such a big database indeed is good news for the relevance of the evaluation. However, in the spirit of MIREX (if any), it may have been good to keep some part of it "hidden" to the participants, so as to perform evaluation on a test database, on which no one could have tuned their algorithms.
 +
 
 +
Do you, by any chance, have another 1000 songs that we could use for that purpose? :p Well, otherwise, that still makes it a good database for evaluation and comparison.
 +
 
 +
By the way, I have trouble checking the above mentioned rar archive file: on my ubuntu, it says the archive type is not supported. Any idea?
 +
 
 +
One last thing: is anyone interested in evaluating note-wise transcription (as in the multi-f0 evaluation task)? If so, is there any annotation of that type for MIR-1K?
 +
 
 +
=== Chao-Ling's Comments 27/07/2009 ===
 +
 
 +
Hi Jean-Louis and everyone!
 +
 
 +
Unfortunately, I donΓÇÖt have another dataset. Even if I do, it is not ΓÇ£hiddenΓÇ¥ from me :S.
 +
 
 +
The rar file can be extracted in ubuntu by this program: [http://www.rarlab.com/download.htm winrar for Linux]. However, MIR-1K does not contain the annotation for evaluating note-wise transcription.
 +
 
 +
=== Andreas Ehmann's Comments 12/08/2009 ===
 +
Hi guys!
 +
We are quietly ramping up for this year's MIREX. The dataset is quite exciting. Although it's not 'hidden' I think it's more than useable. ADC04 isn't quite withheld either. I guess my main concern from a logistics point of view is that it is pretty big! Some of the melody algorithms are on the slow side, so crunching through that many minutes of audio might tie up our machines quite a bit. So I guess our options are try and make sure the algorithms are fast enough, or we can maybe choose a subset of the 1000 to evaluate against.
 +
 
 +
Cheers!
 +
-Andreas
 +
 
 +
=== Jean-Louis' Comments 17/08/2009 ===
 +
Hi everyone,
 +
It feels like andreas' comment on the speed of some algorithms was sort of referring to Pablo's and my program from last year. I can't however guarantee that this year's algorithms will be any faster...
 +
 
 +
I guess working on subsets of the database is a good option. Maybe a few 100 snippets from it. Chao-Ling: is the database homogeneous, such that one can grab randomly any excerpt and get a representative dataset, or is there a smart way of choosing these excerpts?
 +
 
 +
=== Andreas Ehmann's Comments 18/08/2009 ===
 +
Hey gang,
 +
 
 +
I think I am going to sample rate convert the MIR-1K dataset to 44.1kHz (from the 16kHz it is now). Naturally we will have dead space in the spectrum, but I am already envisioning systems having hard coded (in samples) frame lengths and hops. Sound reasonable? That or everyone has to ensure they are robust to multiple SR's. It's easy to do on my end though, and that way everything will be 44.1kHz.
 +
 
 +
=== Vishu's comments 19/08/2009 ===
 +
Hi all.
 +
 
 +
Andreas, wrt to our specific entry/entries, they will be SR independent. So 16 kHz or 44.1 kHz doesn't really matter. In the interest of data homogeneity however, maybe 44.1 kHz is preferable.
 +
 
 +
I have a question regarding deadlines. Are we following the Sept. 8 deadline, as posted on the MIREX 2009 homepage, or do you think we could push this up a bit. I ask this because we intend to submit two algorithms this year and the second one may not be ready by Sept 8.
 +
 
 +
=== Chao-Ling's Comments 02/09/2009 ===
 +
Hi Jean-Louis and all,
 +
 
 +
Sorry for my late reply. I worked very hard to build a "hidden" dataset for this task. The length of the dataset is around 167 minutes (374 clips with length 20~40 secs each). The dataset was built in the same way as MIR-1K with the same format (16kHz,16bits). I would like to know how do we evaluate our algorithms with this dataset and how do I provide it to the committee?
 +
 
 +
=== Vishu's comments 03/09/2009 ===
 +
Chao-Ling: Last year we had contributed an Indian classical music dataset. I had corresponding with Mert Bay (mertbay@gmail.com) at that time.
 +
 
 +
=== Chao-Ling's Comments 03/09/2009 ===
 +
Thx Vishu. Andreas Ehmann will set up a dropbox account for me to upload the dataset. Besides, I would like to know what SNR value should be used to mix the singing voice and accompaniment for the evaluation. Any suggestion?
 +
 
 +
=== Vishu's comments 03/09/2009 ===
 +
Chao-Ling: From my point of view, it would be useful to divide the dataset into two halves. One with an audibly acceptable SNR (between 5 and 10 dB) and the other with a lower, and therefore tougher, SNR (0 dB). Of course, a lot also depends on the nature of the accompaniment i.e. lower SNR on simple (solo) accompaniment, like a single flute, may not provide as much of a challenge as a relatively higher SNR but with more complex accompaniment, like rock music. Note that I use the terms 'simple' and 'complex' purely from a point of view of signal complexity. Since you are most familiar with the data, you could divide it as you see fit.
 +
 
 +
=== Morten's comments 05/09/2009 ===
 +
Chao-Ling and Vishu: I think that it would be preferable if a part of the dataset is mixed both at an "audibly acceptable" SNR and at a "toughter" SNR. It would be difficult to conclude anything if the two mixing levels was used on two different datasets - does the results then depend on the mixing level or on the dataset?
 +
 
 +
=== Jean-Louis' comments 06/09/2009 ===
 +
Hi everyone,
 +
 
 +
I agree with Morten. If we want to compare the results with varying mixing level, we should evaluate the algorithms on the same songs, with different "SNR"s (if one can indeed consider the background music as "noise" :-) ).
 +
 
 +
By the way, any possible deadline extension? It is going to be a bit tight for us, I'm afraid. But we try our best, because we might also provide the slowest programs. We should probably not delay that too much :)
 +
 
 +
Oh, I was also thinking, in the article by Poliner /et al/, "Melody Transcription From Music Audio: Approaches and Evaluation." (IEEE Transactions on Audio, Speech & Language Processing 15(4): 1247-1256 (2007)), they managed to gather information concerning the error distribution of the algorithms, i.e. the histogram of differences (in note on the Western musical scale) between the estimated pitches and the ground-truth. Would you think, at IMIRSEL, that it would be possible to provide that sort of statistics after the evaluation?
 +
 
 +
Also, I'd like some practical information about the databases: with the former databases (ADC04 and MIREX05), the fundamental frequency range is around 60Hz to 1000Hz. Note that there were some errors in my version of the annotations for MIREX05 (the minimum value for f0 was something like 8Hz). What about MIREX08 and MIR-1K? I believe MIR-1K is going from 80 to 700Hz. Should we expect something else from the newly contributed data?
 +
 
 +
At last, we'd like to thank Chao-Ling for his efforts in building this hidden dataset! Can you give some details on it? Are they the same singers? The same songs?
 +
 
 +
regards,
 +
 
 +
jl
 +
 
 +
=== Chao-Ling's Comments 07/09/2009 ===
 +
Hi all,
 +
 
 +
I agree with Morten and Jean-Louis about the evaluation at different SNRs. By the way, to show the histogram of differences is a good idea. It can let us know more about the performance of each algorithm. About the new dataset, I am thinking about releasing it after the evaluation. It is good for the researchers in this field but might be disadvantageous for the MIREX in the future unless someone can provide a new dataset. Any suggestion?
 +
 
 +
Here are some details about the new dataset:
 +
Singers: 15 people->6 females, 9males (corpus size from females and males are more or less the same). Only one of the female singers is included in MIR-1K, and she contributed about 2% of the new dataset. (If you guys prefer a completely singer independent dataset, we can remove this part.)
 +
 
 +
Song numbers: 167 minutes in length, comprised of 374 clips from 102 songs, with 20 to 40 seconds in each clip. 15 songs of them also appear in MIR-1K.
 +
 
 +
Fundamental frequency range: 46Hz to 784Hz, 99.95% of them are in the range of 80Hz to 640Hz.
 +
 
 +
 
 +
=== Vishu's Comments 07/09/2009 ===
 +
Hi all,
 +
 
 +
Different SNRs on the ''same'' dataset definitely would be the way to go - Thanks Morten and JL.
 +
 
 +
Details of the '''MIREX08''' dataset: The data consists of four 1-min long clips from different segments of 2 north Indian classical vocal performances (1 male and 1 female performer), so that makes a total of eight minutes of audio. Typically a single performance lasts for 20-25 min., with singing and instrumental playing style variations over the course of the performance.
 +
Musical sound sources present in the recordings:
 +
* Singing voice
 +
* Perpetually present drone. This is an Indian instrument called the ''tanpura''.
 +
* Secondary melodic instrument : This is an instrument called the ''harmonium'' similar to the accordion.
 +
* Tonal percussion: These are a pair of Indian drums called ''Tablas'' capable of producing pitched sounds.
 +
So there may be a maximum of four pitches and a minimum of one pitch present at any given time.
 +
F0 range is from 100 to 600 Hz.
 +
 
 +
Finally, taking a cue from JL and encouraged by Stephen Downie's most recent email, I propose we extend the deadline to 15th September. Is that acceptable?
 +
 
 +
=== Jean-Louis' Comments 08/09/2009 ===
 +
hi everyone!
 +
 
 +
It would definitely be hard for us to provide our systems today... As soon as we have something, we will upload it. Could we push September 15th as a new deadline? Which date would be a hard deadline after which you could not (at IMIRSEL) process the programs anymore?
 +
 
 +
Concerning the task itself, and especially the new dataset: after listening to some of the excerpts and after some tests, it felt that this data is suited for "singing voice f0 estimation", especially using the provided "SNR" (at 0dB). For some reason, it is difficult, even for a listener to focus on the singer. Maybe the general balance of the singing voice is not only depending on the "SNR" itself? Of course, if you design your algorithm in order to specifically estimate the singing voice (with a classification scheme at one stage), then you might perform better on this database. There are some excerpts in MIR-1K where, for instance, the singing voice is backed by some flute at one or two octaves higher. The balance between these two contributions is quite even, and one could argue the fact that the voice is indeed intended to be the "lead melody instrument". I guess one could say that there is an indeterminacy, there... which is addressed only if you consider that the task is to transcribe the _vocal_ part.
 +
 
 +
Chao-Ling, could you confirm whether our understanding of this new dataset is correct? What were the purpose and the assumptions that led to the development of your database? I think it is important to understand these well, especially when analyzing the results of our algorithms on such a database.
 +
 
 +
By the way, I guess for the previous datasets, we also had a few examples on which the concept of "main melody" was not quite well defined... Historically, it seems that the ADC04 was allowing a rather wide definition of "leading melody" (including other instruments such as the saxophone as leading instrument), while the following evaluations were focusing more on popular songs, expecting the algorithms to extract the singing voice instead (without explicit notice, though).
 +
 
 +
I guess for forth-coming evaluations at MIREX, "Audio Melody Extraction" should be rephrased, or at least the assumptions on the melody to be extracted should be clearer. That may need some brain-storming, so I hope we can talk about it in Kobe, together :-D
 +
 
 +
For the time being, let us have the evaluation done the old way. I guess there is not enough time left to change everything right now!
 +
 
 +
Cheers,
 +
 
 +
Jean-Louis
 +
 
 +
P.S.: I added descriptions of the new dataset and MIREX08 dataset in the "dataset" and "relevant development set" sections. I renamed the latter, from "test" to "development", since this term is better suited for the content of the section... Please feel free to correct or extend this section!
  
'''Output Format''':
+
=== Chao-Ling's Comments 08/09/2009 ===
* In order to allow for generalization among potential approaches (i.e. frame size, hop size, etc), submitted algorithms should output pitch estimates, in Hz, at discrete instants in time
 
* so the output file successively contains the time stamp [space or tab] the corresponding frequency value [new line]
 
* the time grid of the reference file is 10 ms, yet the submission may use a different time grid as output (for example 5.8 ms)
 
* Instants which are identified unvoiced (there is no dominant melody) can either be scored as 0 Hz or as a negative pitch value. If negative pitch values are given the statistics for Raw Pitch Accuracy and Raw Chroma Accuracy may be improved.
 
  
'''Relevant Test Collections'''
+
Hi JL and everyone,
* For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse of audio segments and corresponding melody transcriptions including audio excerpts from such genres as Rock, R&B, Pop, Jazz, Opera, and MIDI. (full test set with the reference transcriptions (28.6 MB))
 
* Graham's collection: you find the test set here and further explanations on the pages http://www.ee.columbia.edu/~graham/mirex_melody/ and http://labrosa.ee.columbia.edu/projects/melody/
 
  
=Potential Participants=
+
We developed the dataset according to the article by Poliner et al.,"Melody Transcription from Music Audio: Approaches and Evaluation." (IEEE Transactions on Audio, Speech & Language Processing 15(4): 1247-1256 (2007)) which defines that the lead vocal constitutes the melody. However, I agree with JL that the definition is not so clear in some cases. Maybe ΓÇ£singing voice f0 estimationΓÇ¥ is more appropriate. This is also the reason that I proposed to have a subtask evaluation for vocal songs earlier. Just like what JL said, we should talk about it here and in Kobe together :).
* Jean-Louis Durrieu (TELECOM ParisTech, formerly ENST), durrieu@enst.fr
 
* Pablo Cancela (pcancela@gmail.com)
 
* Vishweshwara Rao and Preeti Rao (Indian Institute of Technology Bombay), vishu_rao@iitb.ac.in, prao@ee.iitb.ac.in
 
* Karin Dressler (kadressler@gmail.com)
 
* Matti Ryynänen and Anssi Klapuri (Tampere University of Technology), matti.ryynanen <at> tut.fi, anssi.klapuri <at> tut.fi
 
* Chuan Cao and Ming Li (ThinkIT Lab., IOA), ccao <at> hccl.ioa.ac.cn, mli <at> hccl.ioa.ac.cn
 
  
=RESULTS=
 
The results for the audio melody extraction task are available on the following page: [[Audio_Melody_Extraction_Results]]
 
  
=JL's Comments 11/07/08=
 
We propose to re-run the Audio Melody Extraction task this year.
 
It was dropped last year, but since 2006, there were probably other research on this topic. Anyone interested ?
 
  
=Vishu's comments 14/07/08=
+
=== Sihyun-Joo's Comments 08/09/2009 ===
May I also suggest that we additionally have a separate evaluation for cases where the main melody is carried by the human singing voice as opposed to other musical instruments? I ask this for two reasons, the first being that for most popular music the melody is indeed carreid by the human voice. And the second reason is that, while our predominant F0 detector is quite generic, our voicing detector is 'tuned' to the human voice and so less likely to perform well for other instruments.
+
Hi, everyone.
  
=JL's Comments 15/07/08=
+
It's already due date. Do you finish your works well?  I just left a message to ask about the due date. Actually, I want to postpone the due date for a week or a couple of days just like JL. I sent e-mail to request a new deadline, I haven't got the reply yet. However, as I mentioned before, it's already 8th September. So, if anybody knows whether it is possible to change the deadline, please let me know it by answer comment or e-mail (redj4620@kaist.ac.kr).
Concerning the vocal/non-vocal distinction: this has been done in previous evaluations of audio melody extraction (see https://www.music-ir.org/mirex/2006/index.php/Audio_Melody_Extraction_Results for the results of the MIREX06 task).
 
I guess separated results for vocal and vocal+non-vocal should be possible once again.
 
  
I had another concern: does anyone know of some extra corpus ? It could be nice to have some more material to test the algorithms. Maybe some more classical excerpts? Does anyone know a way to obtain such data, I mean, with separated track of the main melody so that the work can be half-way done by some automatic algorithm?
+
=== Vishu's Comments 11/09/2009 ===
 +
Hi all
  
=Vishu's comments : Multi-track Audio available 22/07/08=
+
Chao-Ling and Andreas: I have one (maybe redundant) concern about the MIR-1k data-set. I noticed that in the ground-truth (.pv) files, the first time-stamp is at 0.01 sec. For the ground-truth files of the previous data-sets, the first time-stamp was always at 0.0 sec. I hope this does not affect the evaluation-code since even a 1-frame offset in comparison can lead to significantly different accuracies. Or maybe a dummy pitch value could be inserted at 0.0 sec in the MIR-1k ground-truth files.
We are in possession of about 4 min 15 sec of Indian classical vocal performances with separated tracks of the main melody. For a 10 ms hop, there are about 21000 vocal frames. Would this data be of interest?
 
  
=Karin's comments 22/07/08=
+
== '''Dataset''' ==
Hi Vishu and others! Any new data is appreciated - and a classical Indian performance would definitely add an interesting new genre :-) I have only made minor changes to my own melody extraction algorithm since I have shifted my priorities to midi note estimation (onset/offset and tone height) of the melody voice. Anyway, I am interested in a new evalutation of my algorithm.
+
* [http://unvoicedsoundseparation.googlepages.com/mir-1k MIR-1K database] : [http://140.114.88.80/dataset/Dataset_for_MIREX.rar dataset for Mirex], Karaoke recordings of Chinese songs. Instruments: singing voice (male, female), synthetic accompaniment.
I know that the ISMIR 2004 dataset has annotated midi notes available. Maybe we could also evaluate the extracted midi melody notes - at least for this data set! Is there anyone else interested in this evaluation?
+
* MIREX08 database : 4 excerpts of 1 min. from "north Indian classical vocal performances", instruments: singing voice (male, female), tanpura (Indian instrument, perpetual background drone), harmonium (secondary melodic instrument) and tablas (pitched percussions).
 +
* MIREX05 database : 25 phrase excerpts of 10-40 sec from the following genres: Rock, R&B, Pop, Jazz, Solo classical piano.
 +
* ISMIR04 database : 20 excerpts of about 20s each.
 +
* CD-quality (PCM, 16-bit, 44100 Hz)
 +
* single channel (mono)
 +
* manually annotated reference data (10 ms time grid)
  
=JL's Comments 30/07/08=
+
== '''Output Format''' ==  
Hi everyone!
+
* In order to allow for generalization among potential approaches (i.e. frame size, hop size, etc), submitted algorithms should output pitch estimates, in Hz, at discrete instants in time
<br>
+
* so the output file successively contains the time stamp [space or tab] the corresponding frequency value [new line]
A few comments...
+
* the time grid of the reference file is 10 ms, yet the submission may use a different time grid as output (for example 5.8 ms)
<br>
+
* Instants which are identified unvoiced (there is no dominant melody) can either be scored as 0 Hz or as a negative pitch value. If negative pitch values are given the statistics for Raw Pitch Accuracy and Raw Chroma Accuracy may be improved.
To Vishu: could you upload anything to mert? I would also like to know how you annotated the data. The people who did the groundtruth for ISMIR2004 (E. Gomez in particular) told me that they used 46.44ms long windows (for 44.1kHz sampling rate, that s 2048 samples, hence the "strange" number), with 5.8ms hopsize. This groundtruth has been modified by Andreas (Ehmann) such that the hopsize became 10ms in MIREX05.
 
<br>
 
The groundtruth for both collections give as first column the time stamp of the _center_ of the window (at least, that s what they did for ISMIR04), and as the second column the corresponding frequency in Hz.
 
<br>
 
To Karin: It s nice to see former participants coming again risking their algorithms on the same task! I think that s also rather important for further studies: that way, we can directly compare ourselves to the state of the art!
 
=Vishu's Comments 04/08/08=
 
Sorry for the delay but I was travelling for a bit. I just uploaded our data to Mert. The ground truth format is the same as MIREX05, except that instead of every 10ms we generate Ground truth values every 10.023ms. This is because our data is sampled at 22.05 kHz and 10ms corressponds to 220.5 samples at that sampling frequency. So this had to be rounded off to a hop of 221 samples (10.023ms).
 
<br>
 
Regarding the window size for ground truth generation, for each of the four excerpts we used a window length that results in a main lobe width that is reliably able to resolve adjacent harmonics of the lowest expected F0 (known apriori) for that excerpt.
 
  
=JL's Comments 05/08/08=
+
== '''Relevant Development Collections''' ==
Hi Vishu, hi all !
+
* [http://unvoicedsoundseparation.googlepages.com/mir-1k MIR-1K]: [http://140.114.88.80/dataset/Dataset_for_MIREX.rar MIREX 09 dataset].
<br>
 
I was wondering if it would be possible to include some of our test set in the development set, so that we know what it is about. Maybe some excerpts of 30s each? Do you think that would be feasible?
 
<br>
 
I am not sure about what you say for the window length... Could you be more precise? I was lately struggling a little bit with the multiF0 dev set, which led me to notice that the groundtruth sequences for the instruments were not completely aligned... I think for the sake of comparison that we should opt for a given window length that all the participants will use. In ISMIR04, that window length was 46.44 ms, which gave 2048 samples @ 44100Hz. This value seems reasonable to me, even though it might look rather long for our purpose (the pitch can evolve rather fast and even during 50ms, one can "see" this effect on the spectrogram of a _small_ chirp, where the lobes of the higher peaks - in frequency - are wider than those of the lower peaks). Most of the groundtruth was generated with windows this size, so I guess it would make more sense if everyone used this size. It might of course not be optimal in some ways, especially if one uses other representations (CQ transform for instance), in which case the participant would be penalized, even if that could lead to better results. Anyway! what do you all think about having one window size for all the participants?
 
<br>
 
A last thing (for today :D), maybe we should convert all the files to the same sampling rate, for the sake of simplicity? of course one can do it online, with matlab's (bad) resample function. That, again, is about to compare the systems and just them: one should get rid of the potential processings needed (like the resampling step). Should we convene of a specific sampling rate for all the songs?
 
  
=Mert`s Comments 05/08/08=
+
* Graham's collection: you find the test set here and further explanations on the pages http://www.ee.columbia.edu/~graham/mirex_melody/ and http://labrosa.ee.columbia.edu/projects/melody/
  
Hi everyone thanks for writing your comments. JL we appreciate your data set also. The deadline for this task will be August 22nd. Yes Vishu uploaded the data. It consists of human singing a background instrument and a percussive instrument. I`ll reinterpolate the ground truth to match the 10ms hop size and also upsample it to 44100 khz. I can also recreate the ground truth using yin/wavesurfer/praat to have an 10ms hopsize 46ms window at 44100 khz if you want.
+
* For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse of audio segments and corresponding melody transcriptions including audio excerpts from such genres as Rock, R&B, Pop, Jazz, Opera, and MIDI. (full test set with the reference transcriptions (28.6 MB))
  
=Vishu's Comments 07/08/08=
+
==Potential Participants==
Hi JL and others!
+
* Vishweshwara Rao & Preeti Rao (Indian Institute of Technology Bombay, India)
As far as I understand, the window length and alignment of ground truth values
+
* Jean-Louis Durrieu, Gaël Richard and Bertrand David (Institut Télécom, Télécom ParisTech, CNRS LTCI, Paris, France): 2 systems submitted (2009/09/19)
are independant. The alignment would depend on the hop size and nothing
+
* Chao-Ling Leon Hsu, Jyh-Shing Roger Jang, and Liang-Yu Davidson Chen (Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan)
else. <Br>
+
* Morten Wendelboe (Institute of Computer Science, Copenhagen University, Denmark)
Regarding the window length, ideally for the ground truth computation the
+
* Sihyun Joo & Seokhwan Jo & Chang D. Yoo (Korea Advanced Institute of Science and Technology, Daejeon, Korea)
shortest possible window around an analysis time instant should be used in
+
* Pablo Cancela (pcancela@gmail.com) Montevideo, Uruguay
order to be robust to fast pitch modulations. The best option is to have a
+
* Hideyuku Tachibana, Takuma Ono, Nobutaka Ono, and Shigeki Sagayama (The University of Tokyo, Japan)
pitch-adaptive window. I would think that this would make your
 
ground-truth all the more 'truthful' (Especially since the ground truth
 
computation is also making use of some PDAs (YIN, PRAAT etc.). If this is
 
the case then I do not think it would be fair to impose a standard
 
window-length on all participants, since this might negatively affect their
 
algorithm performance.<Br>
 
For the ground-truth values for our Indian music dataset, we have used
 
shorter windows (23 ms) for female singers and longer (46 ms) windows for
 
male singers. This reduces the effect of the faster (Hz/sec) modulations
 
of the female singers, since they generally have higher pitch. <Br>
 
However, if the ground-truth values themselves are being extracted using
 
some fixed analysis window length (eg. 46 ms) then I think it would be in
 
the participants' best interests to use the same window length for their
 
analysis.
 
  
=JL's Comments 11/08/08=
+
==Results==
''the window length and alignment of ground truth values are independant. The alignment would depend on the hop size and nothing else.'':
+
[[2009:Audio_Melody_Extraction_Results]]
Once you have the hopsize, I agree that the alignment is straightforward... but only given a certain offset that, according to me, depends on the window size - that s really just a matter of aligning the first window. At least, that is relevant to the way we are annotating the groundtruth in MIREX, if I understand well.
 
<Br>
 
For your database, does it mean that the time at the center of the first window, for the female sung excerpts, is 11.5ms, while it is 23ms for the male sung ones? I guess we just need to know that so that we can evaluate accordingly.
 
<br>
 
I would say the difference in window lengths for the male and female excerpts first helps to have a better resolution in frequency, the "tracking" ability of the groundtruth being more related to the hopsize you choose. As I understand it, what you mean is that the approximation saying that the pitch is constant within one analysis window is less false if the windows are small. I guess we just need a trade-off (a window size of 46ms seems right to me, but 23ms aint bad either!) between this approximation and the precision in estimating the f_0 in the window.
 
<br>
 
I think we can think of 2 types of eventual scenarii for the analysis windows: one in which the f_0 is constant, and the other one for which the f_0 varies. For the first one, I'd say, no problem to annotate. For the second type, I would say the most "human" way of annotating it would be to choose the "mean" of the fundamental frequencies that are present. I may be talking about silly things here, sounds a little bit stupid, but I was wondering whether other people had been thinking about that... If we wanted to annotate correctly those frames, we should give the instantaneous frequency, with the associated instantaneous time, and also give the slope (say the first order derivative of the instantaneous frequency), which is what people sometimes want to estimate. Giving only one f_0 transforms the problem: check the opera excerpts from the ISMIR04 database, with their deep vibrato, on "transition" frames, the FFT is clearly different from a perfect "spectral comb", as it would be the case with a constant f_0. Defining the f_0 for such frames as the maximum of the first lobe (first harmonic) may seem natural, but that is yet another convention.
 
<br>
 
Another interesting point with those opera excerpts: some of the high frequency components on male performance have variations almost as fast as the ones for the female performances. That means that even if the fundamental frequencies for the male performers do not evolve as fast as for the female performers, their log_2 variations actually are quite close to the latter ones. And since the evaluation criteria are based on the musical scale, that remark has its importance, I think.
 
<br>
 
But again, for our purpose, I guess the way the annotations has been done is more than sufficient! And new data for evaluation is always welcome! Thank you again for your effort!
 
= Vishu's comments 12/08/08=
 
Hi all! <Br>
 
I apologise if I was not clear enough before. When I said that "The alignment would depend on the hop size and nothing else.", I assumed that the center of the first analysis window is at 0 sec. This means that irrespective of window length, the window centers would always be at the same time instants i.e. 0, 10ms, 20ms...<Br>
 
On observing the ground truth files for the ISMIR 2004 testing dataset and the MIREX 2005 training dataset, this seems to be the convention they too follow since the time-stamp of the first ground truth value is 0 sec, which should correspond to the center of the first analysis window. This is the convention that we have followed for our Indian music data. Hope this clarifies things.
 

Latest revision as of 13:33, 13 May 2010

Description

The aim of the MIREX audio melody extraction evaluation is to identify the melody pitch contour from polyphonic musical audio.

The task consists of two parts:

  • Voicing detection (deciding whether a particular time frame contains a "melody pitch" or not),
  • pitch detection (deciding the most likely melody pitch for each time frame).

We structure the submission to allow these parts to be done independently, i.e. it is possible (via a negative pitch value) to guess a pitch even for frames that were being judged unvoiced. Algorithms which don't perform a discrimination between melodic and non-melodic parts are also welcome!

Discussions for 2009

Your comments here.

New evaluations for 2009?

We would like to know if there would be potential participants for this year's evaluation on Audio Melody Extraction.

There has also been an interest last year in evaluating the results at note levels (and not at a frame by frame level), following the multipitch evaluation. However, it has not been done, probably because of both a lack of participants and of database. Would there be more people this year?

cheers, Jean-Louis, 9th July 2009

Chao-Ling's Comments 14/07/2009

Hi everyone. I would like to suggest that we have a separate evaluation on the songs where the main melody is carried by the human singing voice as opposed to other musical instruments (like Vishu's comment in MIREX2008). We proposed a pitch extraction approach for singing voices and may not be likely to perform well for other instruments.

In addition, we have prepared a dataset called MIR-1K and would like to add it as part of the training/evaluation dataset. It contains 1000 song clips recorded at 16 kHz sample rate with 16-bit resolution. The duration of each clip ranges from 4 to 13 seconds, and the total length of the dataset is 133 minutes. These clips were extracted from 110 karaoke songs which contain a mixed track and a music accompaniment track.

Vishu's Comments 20/07/2009

Hi Everyone.

Chao-Ling, your dataset sounds exciting. I think the community would benefit greatly from the addition of such a large database. Last year we too had contributed some data (Indian classical music). Here are some points with respect to your current dataset to make it conform to previous data formats.

  • We need to have mixed (voice + accompaniment) mono tracks, so if you would like to mix it yourself (based on some SNR considerations) or if you think just giving equal weight to left and right channels of your files is acceptable, let us know which.
  • I see that your ground-truth's (.pv files) are in semitones. Since all the previous reference pitch values are in Hz either you could convert them to Hz and pass them on or pass on the exact conversion formula (Hz to cents) you have used to the evaluators (IMMERSEL).
  • Finally, the previous ground-truth pitch files were in two column format (TimeStamp_sec PitchValue_Hz) available every 10 ms. Your files are single-column. Please let us know at exactly which time-instant is your first window centered and the conversion can be done accordingly.

Thanks again for your efforts.

Chao-Ling's Comments 21/07/2009

Hi Vishu and everybody! Thank you for your suggestions. My responses are as follows:

  • The left and right channels were adjusted to have equal weight. I prefer to provide both channels because they are good for evaluating algorithms at different SNR settings.
  • I will provide pitch ground-truth in Hz.
  • The first window is centered in 20ms (the window size is 40ms and the overlap is 20ms). Note that the last window is discarded if its time is less than 40ms, so some .pv files might have one less point.

I will provide new .pv files that have both ground-truth in Hz and time stamp column. I will also make this dataset smaller that discards the parts that are unrelated to this task.

Chao-Ling's Comments 22/07/2009

Okay guys, here is the new files of the dataset: MIR-1K for MIREX. Plesea feel free to let me know if there are problems.

About MIR-1K - Jean-Louis, 26/07/2009

Hi all!

Such a big database indeed is good news for the relevance of the evaluation. However, in the spirit of MIREX (if any), it may have been good to keep some part of it "hidden" to the participants, so as to perform evaluation on a test database, on which no one could have tuned their algorithms.

Do you, by any chance, have another 1000 songs that we could use for that purpose? :p Well, otherwise, that still makes it a good database for evaluation and comparison.

By the way, I have trouble checking the above mentioned rar archive file: on my ubuntu, it says the archive type is not supported. Any idea?

One last thing: is anyone interested in evaluating note-wise transcription (as in the multi-f0 evaluation task)? If so, is there any annotation of that type for MIR-1K?

Chao-Ling's Comments 27/07/2009

Hi Jean-Louis and everyone!

Unfortunately, I donΓÇÖt have another dataset. Even if I do, it is not ΓÇ£hiddenΓÇ¥ from me :S.

The rar file can be extracted in ubuntu by this program: winrar for Linux. However, MIR-1K does not contain the annotation for evaluating note-wise transcription.

Andreas Ehmann's Comments 12/08/2009

Hi guys! We are quietly ramping up for this year's MIREX. The dataset is quite exciting. Although it's not 'hidden' I think it's more than useable. ADC04 isn't quite withheld either. I guess my main concern from a logistics point of view is that it is pretty big! Some of the melody algorithms are on the slow side, so crunching through that many minutes of audio might tie up our machines quite a bit. So I guess our options are try and make sure the algorithms are fast enough, or we can maybe choose a subset of the 1000 to evaluate against.

Cheers! -Andreas

Jean-Louis' Comments 17/08/2009

Hi everyone, It feels like andreas' comment on the speed of some algorithms was sort of referring to Pablo's and my program from last year. I can't however guarantee that this year's algorithms will be any faster...

I guess working on subsets of the database is a good option. Maybe a few 100 snippets from it. Chao-Ling: is the database homogeneous, such that one can grab randomly any excerpt and get a representative dataset, or is there a smart way of choosing these excerpts?

Andreas Ehmann's Comments 18/08/2009

Hey gang,

I think I am going to sample rate convert the MIR-1K dataset to 44.1kHz (from the 16kHz it is now). Naturally we will have dead space in the spectrum, but I am already envisioning systems having hard coded (in samples) frame lengths and hops. Sound reasonable? That or everyone has to ensure they are robust to multiple SR's. It's easy to do on my end though, and that way everything will be 44.1kHz.

Vishu's comments 19/08/2009

Hi all.

Andreas, wrt to our specific entry/entries, they will be SR independent. So 16 kHz or 44.1 kHz doesn't really matter. In the interest of data homogeneity however, maybe 44.1 kHz is preferable.

I have a question regarding deadlines. Are we following the Sept. 8 deadline, as posted on the MIREX 2009 homepage, or do you think we could push this up a bit. I ask this because we intend to submit two algorithms this year and the second one may not be ready by Sept 8.

Chao-Ling's Comments 02/09/2009

Hi Jean-Louis and all,

Sorry for my late reply. I worked very hard to build a "hidden" dataset for this task. The length of the dataset is around 167 minutes (374 clips with length 20~40 secs each). The dataset was built in the same way as MIR-1K with the same format (16kHz,16bits). I would like to know how do we evaluate our algorithms with this dataset and how do I provide it to the committee?

Vishu's comments 03/09/2009

Chao-Ling: Last year we had contributed an Indian classical music dataset. I had corresponding with Mert Bay (mertbay@gmail.com) at that time.

Chao-Ling's Comments 03/09/2009

Thx Vishu. Andreas Ehmann will set up a dropbox account for me to upload the dataset. Besides, I would like to know what SNR value should be used to mix the singing voice and accompaniment for the evaluation. Any suggestion?

Vishu's comments 03/09/2009

Chao-Ling: From my point of view, it would be useful to divide the dataset into two halves. One with an audibly acceptable SNR (between 5 and 10 dB) and the other with a lower, and therefore tougher, SNR (0 dB). Of course, a lot also depends on the nature of the accompaniment i.e. lower SNR on simple (solo) accompaniment, like a single flute, may not provide as much of a challenge as a relatively higher SNR but with more complex accompaniment, like rock music. Note that I use the terms 'simple' and 'complex' purely from a point of view of signal complexity. Since you are most familiar with the data, you could divide it as you see fit.

Morten's comments 05/09/2009

Chao-Ling and Vishu: I think that it would be preferable if a part of the dataset is mixed both at an "audibly acceptable" SNR and at a "toughter" SNR. It would be difficult to conclude anything if the two mixing levels was used on two different datasets - does the results then depend on the mixing level or on the dataset?

Jean-Louis' comments 06/09/2009

Hi everyone,

I agree with Morten. If we want to compare the results with varying mixing level, we should evaluate the algorithms on the same songs, with different "SNR"s (if one can indeed consider the background music as "noise" :-) ).

By the way, any possible deadline extension? It is going to be a bit tight for us, I'm afraid. But we try our best, because we might also provide the slowest programs. We should probably not delay that too much :)

Oh, I was also thinking, in the article by Poliner /et al/, "Melody Transcription From Music Audio: Approaches and Evaluation." (IEEE Transactions on Audio, Speech & Language Processing 15(4): 1247-1256 (2007)), they managed to gather information concerning the error distribution of the algorithms, i.e. the histogram of differences (in note on the Western musical scale) between the estimated pitches and the ground-truth. Would you think, at IMIRSEL, that it would be possible to provide that sort of statistics after the evaluation?

Also, I'd like some practical information about the databases: with the former databases (ADC04 and MIREX05), the fundamental frequency range is around 60Hz to 1000Hz. Note that there were some errors in my version of the annotations for MIREX05 (the minimum value for f0 was something like 8Hz). What about MIREX08 and MIR-1K? I believe MIR-1K is going from 80 to 700Hz. Should we expect something else from the newly contributed data?

At last, we'd like to thank Chao-Ling for his efforts in building this hidden dataset! Can you give some details on it? Are they the same singers? The same songs?

regards,

jl

Chao-Ling's Comments 07/09/2009

Hi all,

I agree with Morten and Jean-Louis about the evaluation at different SNRs. By the way, to show the histogram of differences is a good idea. It can let us know more about the performance of each algorithm. About the new dataset, I am thinking about releasing it after the evaluation. It is good for the researchers in this field but might be disadvantageous for the MIREX in the future unless someone can provide a new dataset. Any suggestion?

Here are some details about the new dataset: Singers: 15 people->6 females, 9males (corpus size from females and males are more or less the same). Only one of the female singers is included in MIR-1K, and she contributed about 2% of the new dataset. (If you guys prefer a completely singer independent dataset, we can remove this part.)

Song numbers: 167 minutes in length, comprised of 374 clips from 102 songs, with 20 to 40 seconds in each clip. 15 songs of them also appear in MIR-1K.

Fundamental frequency range: 46Hz to 784Hz, 99.95% of them are in the range of 80Hz to 640Hz.


Vishu's Comments 07/09/2009

Hi all,

Different SNRs on the same dataset definitely would be the way to go - Thanks Morten and JL.

Details of the MIREX08 dataset: The data consists of four 1-min long clips from different segments of 2 north Indian classical vocal performances (1 male and 1 female performer), so that makes a total of eight minutes of audio. Typically a single performance lasts for 20-25 min., with singing and instrumental playing style variations over the course of the performance. Musical sound sources present in the recordings:

  • Singing voice
  • Perpetually present drone. This is an Indian instrument called the tanpura.
  • Secondary melodic instrument : This is an instrument called the harmonium similar to the accordion.
  • Tonal percussion: These are a pair of Indian drums called Tablas capable of producing pitched sounds.

So there may be a maximum of four pitches and a minimum of one pitch present at any given time. F0 range is from 100 to 600 Hz.

Finally, taking a cue from JL and encouraged by Stephen Downie's most recent email, I propose we extend the deadline to 15th September. Is that acceptable?

Jean-Louis' Comments 08/09/2009

hi everyone!

It would definitely be hard for us to provide our systems today... As soon as we have something, we will upload it. Could we push September 15th as a new deadline? Which date would be a hard deadline after which you could not (at IMIRSEL) process the programs anymore?

Concerning the task itself, and especially the new dataset: after listening to some of the excerpts and after some tests, it felt that this data is suited for "singing voice f0 estimation", especially using the provided "SNR" (at 0dB). For some reason, it is difficult, even for a listener to focus on the singer. Maybe the general balance of the singing voice is not only depending on the "SNR" itself? Of course, if you design your algorithm in order to specifically estimate the singing voice (with a classification scheme at one stage), then you might perform better on this database. There are some excerpts in MIR-1K where, for instance, the singing voice is backed by some flute at one or two octaves higher. The balance between these two contributions is quite even, and one could argue the fact that the voice is indeed intended to be the "lead melody instrument". I guess one could say that there is an indeterminacy, there... which is addressed only if you consider that the task is to transcribe the _vocal_ part.

Chao-Ling, could you confirm whether our understanding of this new dataset is correct? What were the purpose and the assumptions that led to the development of your database? I think it is important to understand these well, especially when analyzing the results of our algorithms on such a database.

By the way, I guess for the previous datasets, we also had a few examples on which the concept of "main melody" was not quite well defined... Historically, it seems that the ADC04 was allowing a rather wide definition of "leading melody" (including other instruments such as the saxophone as leading instrument), while the following evaluations were focusing more on popular songs, expecting the algorithms to extract the singing voice instead (without explicit notice, though).

I guess for forth-coming evaluations at MIREX, "Audio Melody Extraction" should be rephrased, or at least the assumptions on the melody to be extracted should be clearer. That may need some brain-storming, so I hope we can talk about it in Kobe, together :-D

For the time being, let us have the evaluation done the old way. I guess there is not enough time left to change everything right now!

Cheers,

Jean-Louis

P.S.: I added descriptions of the new dataset and MIREX08 dataset in the "dataset" and "relevant development set" sections. I renamed the latter, from "test" to "development", since this term is better suited for the content of the section... Please feel free to correct or extend this section!

Chao-Ling's Comments 08/09/2009

Hi JL and everyone,

We developed the dataset according to the article by Poliner et al.,"Melody Transcription from Music Audio: Approaches and Evaluation." (IEEE Transactions on Audio, Speech & Language Processing 15(4): 1247-1256 (2007)) which defines that the lead vocal constitutes the melody. However, I agree with JL that the definition is not so clear in some cases. Maybe ΓÇ£singing voice f0 estimationΓÇ¥ is more appropriate. This is also the reason that I proposed to have a subtask evaluation for vocal songs earlier. Just like what JL said, we should talk about it here and in Kobe together :).


Sihyun-Joo's Comments 08/09/2009

Hi, everyone.

It's already due date. Do you finish your works well? I just left a message to ask about the due date. Actually, I want to postpone the due date for a week or a couple of days just like JL. I sent e-mail to request a new deadline, I haven't got the reply yet. However, as I mentioned before, it's already 8th September. So, if anybody knows whether it is possible to change the deadline, please let me know it by answer comment or e-mail (redj4620@kaist.ac.kr).

Vishu's Comments 11/09/2009

Hi all

Chao-Ling and Andreas: I have one (maybe redundant) concern about the MIR-1k data-set. I noticed that in the ground-truth (.pv) files, the first time-stamp is at 0.01 sec. For the ground-truth files of the previous data-sets, the first time-stamp was always at 0.0 sec. I hope this does not affect the evaluation-code since even a 1-frame offset in comparison can lead to significantly different accuracies. Or maybe a dummy pitch value could be inserted at 0.0 sec in the MIR-1k ground-truth files.

Dataset

  • MIR-1K database : dataset for Mirex, Karaoke recordings of Chinese songs. Instruments: singing voice (male, female), synthetic accompaniment.
  • MIREX08 database : 4 excerpts of 1 min. from "north Indian classical vocal performances", instruments: singing voice (male, female), tanpura (Indian instrument, perpetual background drone), harmonium (secondary melodic instrument) and tablas (pitched percussions).
  • MIREX05 database : 25 phrase excerpts of 10-40 sec from the following genres: Rock, R&B, Pop, Jazz, Solo classical piano.
  • ISMIR04 database : 20 excerpts of about 20s each.
  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono)
  • manually annotated reference data (10 ms time grid)

Output Format

  • In order to allow for generalization among potential approaches (i.e. frame size, hop size, etc), submitted algorithms should output pitch estimates, in Hz, at discrete instants in time
  • so the output file successively contains the time stamp [space or tab] the corresponding frequency value [new line]
  • the time grid of the reference file is 10 ms, yet the submission may use a different time grid as output (for example 5.8 ms)
  • Instants which are identified unvoiced (there is no dominant melody) can either be scored as 0 Hz or as a negative pitch value. If negative pitch values are given the statistics for Raw Pitch Accuracy and Raw Chroma Accuracy may be improved.

Relevant Development Collections

  • For the ISMIR 2004 Audio Description Contest, the Music Technology Group of the Pompeu Fabra University assembled a diverse of audio segments and corresponding melody transcriptions including audio excerpts from such genres as Rock, R&B, Pop, Jazz, Opera, and MIDI. (full test set with the reference transcriptions (28.6 MB))

Potential Participants

  • Vishweshwara Rao & Preeti Rao (Indian Institute of Technology Bombay, India)
  • Jean-Louis Durrieu, Ga├½l Richard and Bertrand David (Institut T├⌐l├⌐com, T├⌐l├⌐com ParisTech, CNRS LTCI, Paris, France): 2 systems submitted (2009/09/19)
  • Chao-Ling Leon Hsu, Jyh-Shing Roger Jang, and Liang-Yu Davidson Chen (Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan)
  • Morten Wendelboe (Institute of Computer Science, Copenhagen University, Denmark)
  • Sihyun Joo & Seokhwan Jo & Chang D. Yoo (Korea Advanced Institute of Science and Technology, Daejeon, Korea)
  • Pablo Cancela (pcancela@gmail.com) Montevideo, Uruguay
  • Hideyuku Tachibana, Takuma Ono, Nobutaka Ono, and Shigeki Sagayama (The University of Tokyo, Japan)

Results

2009:Audio_Melody_Extraction_Results