Difference between revisions of "2005:Symbolic Melodic"

From MIREX Wiki
(Training data and ground truth available)
Line 21: Line 21:
  
 
- Rainer
 
- Rainer
 
==Proposer==
 
 
Rainer Typke (Universiteit Utrecht) rtypke@cs.uu.nl
 
 
 
==Title==
 
 
Retrieving melodically similar incipits from 500,000 notated real world compositions
 
 
==Description==
 
 
Retrieve the most similar incipits from the RISM A/II collection, given one of the incipits as a query, and rank them by similarity.
 
 
Expected results: For 11 queries, a ground truth has been established (see Typke et al. 2005, http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/TR_groundtruth.pdf ). This ground truth makes it possible to measure the search results in an absolute way, taking into consideration not only relevance, but also the correct ranking.
 
This ground truth has been obtained by combining ranked lists that were created by 35 music experts. The resulting ground truth has the form of ranked groups of incipits; the groups contain incipits whose differences in rankings were not statistically significant, but the ranking of the groups is statistically significant. MIR systems that perform well should return incipits such that the members of the groups are retrieved without violating the order of the groups.
 
By using this ground truth, an absolute measure of quality is attainable. However, there is a danger of overfitting algorithms to just the 11 queries for which a ground truth is known. Therefore, there should also be an evaluation based on another set of queries that are chosen only after the algorithms have been submitted. This second part of the evaluation would only allow a comparison, no absolute measurement, but any dependance of the algorithms on the queries could be avoided.
 
 
==Participants==
 
* Alexandra Uitdenbogerd and Iman Suyoto (RMIT), sandrau@rmit.edu.au, s3009028@student.rmit.edu.au
 
* Kjell Lemstr├╢m (University of Helsinki), klemstro@cs.helsinki.fi
 
* Maarten Grachten, Josep Lluis Arcos and Ramón López de Mántaras (IIIA-CSIC), maarten@iiia.csic.es
 
* Nicola Orio (University of Padova), orio@dei.unipd.it
 
* Klaus Frieler and Daniel M├╝llensiefen (University of Hamburg), kgf@omniversum.de
 
* Rainer Typke, Frans Wiering and Remco C. Veltkamp (University of Utrecht), rainer.typke@cs.uu.nl, frans.wiering@cs.uu.nl, remcov@cs.uu.nl
 
 
==Other Potential Participants==
 
 
Names of people who have expressed interest are underlined. Names of people who have been contacted, but haven't expressed an opinion yet, are marked with a copyright symbol (©).
 
 
<u>Tuomas Eerola</u> (ptee@cc.jyu.fi)/Petri Toiviainen  (Jyv├ñskyl├ñ, Finland) -- high
 
 
<u>J├╝rgen Kilian</u> (kilian@noteserver.org)/Holger Hoos      (Darmstadt/British Columbia, Canada) -- high
 
 
Shyamala Doraisamy/Stefan Rueger &copy;    (Malaysia/London) -- unknown
 
 
Anna Pienimäki (Helsinki, Finland) -- not going to participate
 
 
Craig Sapp/Yi Wen Liu/<u>Eleanor Selfridge-Field</u>  (Stanford, US) -- medium
 
 
Anna Lubiw &copy;/Luke Tanur &copy;  (Waterloo, Canada) -- unknown
 
 
Michael Clausen &copy; (Bonn) -- unknown
 
 
<u>Ludger Hofmann-Engl</u> (hofmann-engl@chameleongroup.org.uk) -- high
 
 
<u>Jeremy Pickens</u> (jeremy@dcs.kcl.ac.uk), Tim Crawford (t.crawford@gold.ac.uk) -- high
 
 
The other groups have developed algorithms that would be interesting to compare, so their likelihood of entering is not too low.
 
 
==Evaluation Procedures==
 
 
===1) Absolute measurement using the ground truth===
 
For each of the 11 queries from the ground truth, we have a certain number of groups of incipits, where we know the correct ranking of the groups. Let us assume that the total number of incipits in these groups is N. For each ranking returned by a method to be evaluated, take the top N incipits. At every border between groups, calculate the precision. To get one single measure, one can now integrate over this precision curve and divide the result by N. This measure will be a number between 0 and 1, where 0 means total failure and 1 means complete agreement with the ground truth. This measure will reflect not only the ability of algorithms to find as many relevant incipits as possible, but also to order them correctly.
 
 
For example, let us assume that the ground truth has delivered the following three groups, with N=6:
 
 
(a,b),(c,d,e),(f).
 
 
Method X returns:
 
 
b,a,  g,c,d,  e.
 
 
Since there are 3 groups, we determine precision and recall at three points:
 
 
after position 2:  precision 1 (because both relevant documents a and b have been found. The correct order of a and b is not known since they are in the same group, therefore no penalty should be applied for them being in the opposite order).
 
 
after position 5: precision 4/5=0.8 (because only g should not be among the first 5 incipits)
 
 
after position 6: precision 5/6=0.83 (because only g should not be among the first 6)
 
 
For an illustration, see the last 3 slides of http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/dir05pres.pdf
 
 
For a more formal and detailed description, see http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/measure.pdf
 
 
===2) Comparison of algorithms without using the ground truth===
 
To avoid the overfitting of the algorithms to the ground truth, select another group of queries after all algorithms have been submitted. Put all incipits returned by the algorithms into one pool. Remove doubles from the pool, divide the remaining incipits into portions of equal sizes, and distribute them back to the participants for relevance judgements (on a binary or ternary scale). Once the relevance is known for all returned incipits, use standard TREC measures for comparing the algorithms. This second comparison has some disadvantages: the complete set of relevant incipits won't be known, and neither will be the correct ranking. But the main disadvantage of the first comparison, the fact that the correct answer is known in advance, is avoided.
 
 
Concrete suggestion: use ternary scale (not relevant/relevant/highly relevant). Determine precision-recall curves for positions 1 to N, where N is the total number of known relevant answers for the given query, separately for only the highly relevant documents and for relevant and highly relevant.
 
As comparable measure, use recall at N.
 
 
==Relevant Test Collections==
 
 
RISM A/II collection (large collection of real-world compositions for which a ground truth has already been established; the incipits are symbolically encoded music. On the RISM CD, they are stored in the plaine&easie format. The incipits typically are monophonic and contain between 10 and 40 notes, or about 3 to 6 measures, with considerable variation).
 
Possible copyright issue: Permission from RISM for distributing the test collection or parts of it to participants and using it for EvalFest is probably necessary.
 
Obtaining the data is not difficult: the RISM CD is available in good libraries, and the data can be exported from the CD in plaine&easie format. Alternatively, the existing result of converting the data into MusicXML could be distributed.
 
If RISM does not agree to the data being distributed, maybe it is possible to convince them to at least agree to a solution similar to the Naxos audio collection, where the data are quarantained, and only the software travels.
 
 
If no permission to use the RISM data can be obtained, one could consider to:
 
* run the algorithms only on the incipits whose ranking is known from the ground truth (about 20-50 incipits per query instead of half a million) and apply the same measure.
 
* use other data (such as data from M├╝llensiefen and Frieler), establish a ground truth for those, and apply the same measure.
 
 
<b>RISM UK has signaled that permission to use a five-digit number of their incipits can be obtained.</b>
 
 
==Review 1==
 
 
Melodic similarity would have to be precisely defined, i.e. how was the ground truth established.
 
 
The mentioned possible participants are really likely to participate (to my knowledge at least 10 out of the 14 listed).
 
 
Part 1 - Absolute measurement using ground truth
 
Can the number of queries be increased?  11 queries for a collection of 50000 incipits seem rather small (This comment can be ignored if there is difficulty in establishing the ground truth from such a large collection).
 
Part 2 - Comparison of algorithms without using the ground truth
 
Any ideas of how queries would be obtained ? Real-world queries ? Queries pooled from each participant ?
 
 
The relevance assumptions have to be explicitly stated for the judging process by the various participants using this pooling approach.
 
If ternary scale is used, details of this scale would be needed.
 
What evaluation measure would be adopted from TREC?
 
 
== R. Typke about Review 1==
 
<i>Part 1 - Absolute measurement using ground truth Can the number of queries be increased? 11 queries for a collection of 50000 incipits seem rather small (This comment can be ignored if there is difficulty in establishing the ground truth from such a large collection). </i>
 
 
We are planning to collect more data if circumstances allow. We are offering this task to students as part of their required work in Utrecht. But it is hard to do this quickly, on demand, since we need to actually find someone who is willing to do this work.
 
 
<i>Part 2 - Comparison of algorithms without using the ground truth Any ideas of how queries would be obtained ? Real-world queries ? Queries pooled from each participant ?</i>
 
 
I had in mind something like just randomly selecting some incipits from the collection and using them as queries. This would have the advantage that the query is somewhat similar to the data, so we would not compare apples with pears. After all, we want to concentrate on the various methods' abilities to come close to human ideas of similarity without adding extra complications such as having to search a database of notated music for a hummed query. Especially, we don't want to introduce any bias towards one method by choosing queries it can handle particularly well.
 
 
<i>The relevance assumptions have to be explicitly stated for the judging process by the various participants using this pooling approach. If ternary scale is used, details of this scale would be needed. What evaluation measure would be adopted from TREC? </i>
 
 
(still open)
 
 
==Review 2==
 
 
A very well- thought out proposal. First off, he has a database in mind, a large list of participants and thought carefully about evaluation. However, the number of likely participants is probably not as high as he believes and there are still issues with evaluation.
 
 
I'm not totally clear about some aspects. These 500000 notated real world compositions- are they audio files or midi? How long are they and are they mono or polyphonic? How is the notation done? Judging from most of the names of participants, I'd guess he's talking about midi.
 
 
His set of 11 queries with groundtruth is actually quite small, and there is still considerable work involved in getting groundtruth for other queries, especially since it requires feedback from many music experts.
 
 
His method of evaluation through use of a groundtruth established by music experts should be compared with Jeremy's suggestion (check previous MIREX postings or just communicate with him directly) of using variations to identify melodic similarity.
 
 
His section on evaluation is quite good, and precision/recall is the preferred method in the Info Retrieval community (just ask Stephen). Both algorithms seem feasible assuming the other issues have been worked out. If possible, both his suggested evaluation procedures should be implemented. Would be interesting to see if they give similar results, and if not, why?
 
If the RISM A/II collection is really available from many libraries, then most of the work is done. Stephen's group can keep it quarantined and just run the algorithms that are provided. I think we don't even need to ask RISM for permission.
 
 
I think we should provisionally accept this, under the conditions that we do have a suitable number of participants, that Stephen agrees with the evaluation procedures, that we are able to extend the 11 queries and verify what the 'music experts' say, and that
 
we have no problems with RISM.
 
 
== R. Typke about Review 2==
 
 
<i>I'm not totally clear about some aspects. These 500000 notated real world compositions- are they audio files or midi? How long are they and are they mono or polyphonic? How is the notation done? Judging from most of the names of participants, I'd guess he's talking about midi.</i>
 
 
* The incipits are symbolically encoded music. On the RISM CD, they are stored in the plaine&easie format. I wrote a script for converting that to MusicXML. Conversion to MIDI or other formats is straightforward from there, but involves some loss of data (grace notes, articulation marks, and similar things cannot always be properly represented in MIDI).
 
* The incipits typically are monophonic and contain between 10 and 40 notes, or about 3 to 6 measures, with considerable variation.
 
 
<i>His set of 11 queries with groundtruth is actually quite small, and there is still considerable work involved in getting groundtruth for other queries, especially since it requires feedback from many music experts.</i>
 
 
Yes. It is very expensive to establish ground truths for more incipits. We used about 70 hours of experts' time for the 11 queries, (35 experts, 2 hours per expert), and these 2 hours were not even enough to get through all 11 queries for most experts.
 
 
 
==Downie's Comments==
 
 
1. A classic retrieval task! I am game. Some hurdles to jump but I always thought RISM was a good place to start.
 
 
2. Do think Jeremy Pickens would have some good ideas to help this along.
 
 
3. I wonder if the Meldex folks would like to play?
 
 
==M├╝llensiefen&Frieler's Comments==
 
 
We appreciate the well-thought proposal of the Utrecht group and we are
 
highly willing to participate. But, of course, we have some comments.
 
 
1. The whole evaluation is more about database query algorithms for short
 
phrases and not about the whole field of melodic similarity. E.g. in
 
gathering the ground truth data (GTD), the experts were asked the judge
 
"containedness" as "identity". Furthermore, the RISM database contains only
 
incipits and no longer, structured melodies with several phrases. This
 
certainly has effects on human similarity judgments. This evaluation may not
 
to give the 'whole picture', but is definitely a relevant task from a
 
musicological and especially MIR perspective.
 
 
2. We are wondering if there is a need for giving the ground truth data to
 
the participants for optimizing their algorithms, which in turn makes a
 
second evaluation as proposed by Utrecht group preferable. Any good
 
algorithm should and is likely to be optimized or trained on some data and
 
should be ready to use for unseen data. So, if the software travels to
 
Utrecht (or another appropiate place), we would avoid copyright problems
 
with the RISM publishers, and every algorithm has the same chances without
 
prior optimisation on the test data set. Then, a second evaluation without
 
GTD would not be necessary in our opinion, though interesting in it's own
 
right.
 
 
3. Furthermore, we wonder if there might be a bias towards the Utrecht group,
 
if the RISM collection, which they were working on for years, will be used.
 
E.g in gathering the ground truth the earth mover distance was used
 
for preselection. This possible bias should be kept in mind. For this reason, the Utrecht Group may be seen in the role of a referee and
 
organising team in this evaluation.
 
 
4. Their ranking evaluation procedure seems to be the most straightforward
 
and simplest one in light of the structure of the GTD.  But some remarks can
 
be made. First, what if the participating algorithms (PA) find incipits, not
 
contained in the rank list of the GTD at all, but which could be (at least)
 
theoretically be judged similar by human experts. How can be accounted for
 
that? Second, one could imagine of getting a rank list for the
 
melodies in the GTD by the PA by direct comparison, and comparing this RL
 
(in a to be determined way) to the GT rank list. E.g. if the rank list for a
 
query q contains melodies (a,b),(c,d,e),(f), then a PA could give a rank list
 
like  b, f, d, e, a, c, which can be compared to the GTDRL. Does this make any
 
sense or give further insight?
 
 
5. NB. The integrated precision results should better be divided by the number
 
of groups, not by the number of  incipits N.
 
 
 
==Rainer Typke's comments on M├╝llensiefen and Frieler's Comments==
 
Thanks for your comments!
 
 
<b>ad 2.:</b> establishing a ground truth is a lot of work, therefore it can't be done quickly, ad hoc, for some randomly selected queries. So, if we use the ground truth, we have to live with the fact that both the queries and the correct answers are known in advance, but on the other hand, we get closer to an absolute truth than by just pooling the participants' answers and hoping that all relevant answers are among them.
 
 
To get the best of both worlds, I think we have to do both evaluations. Luckily, using the already existing ground truth is very little work.
 
 
<b>ad 3.:</b> There is indeed a higher probability that incipits which can be found with the C-Brahms methods P2 and P3 from Helsinki, with the Utrecht method, and, to some degree, with the Bonn method (which is similar to C-Brahms) are included in the ground truth because after filtering out dissimilar incipits, we used these 3 methods for putting back incipits that were filtered out, but should not have been filtered out.
 
 
However, this translates into a bias towards these methods only if additional similar incipits that are found by some other method and should have been in the ground truth, but were accidentally filtered out by us, are not added to the ground truth once they are found.
 
 
This leads to the open question (related to <b>4.</b>):
 
 
<b>What should we do in cases where additional similar incipits are found that were erroneously excluded from the ground truth?</b> They should probably be added, but we would then need some panel of music experts who decide what is really similar and what isn't. Open questions: who recruits and interviews experts to judge new candidates? which and how many experts - maybe just the participants? should they be shown only the new candidates or the old lists of candidate incipits plus the new candidates?
 
 
<b>ad 5.:</b> what would be advantages of dividing by the number of groups?
 
I see only disadvantages:
 
 
The number of groups is probably not always meaningful. The group boundaries depend on:
 
* number of expert opinions. More opinions &rarr; higher statistical significance &rarr; more groups can be recognized.
 
* the threshold for statistical significance. Higher threshold &rarr; more groups.
 
* admittedly, there is also some dependence on the properties of the melodies.
 
 
So, if we divide by the number of groups, we have these disadvantages:
 
* dependence on some arbitrary or meaningless numbers such as the number of experts and the taste of whoever picked the threshold for statistical significance
 
* the measure is no longer a number between 0 and 1, therefore the results for different queries can no longer be directly compared. If we divide by the number of groups, the measure has no upper bound anymore (consider the extreme case of just one group, but lots of ranked incipits).
 
 
 
== Open points ==
 
* Do we have a problem with RISM? If yes, what can we do about it? In particular, what would be the best way of convincing them to grant us the right to extract the A/II collection from the CD and distribute it to the participating researchers?
 
* Evaluation with ground truth: How to treat the case of a method finding good matches that are not included in the ground truth because they were accidentally not shown to the experts who established the ground truth. Of course, there is no problem with incipits that were rejected by the experts. But for incipits that had been filtered out incorrectly, we need to establish a proper procedure for amending the ground truth.
 
* Evaluation without ground truth: We need to agree on a suitable relevance scale and suitable TREC-like measures.
 
* One little detail for the measure to be used with the ground truth: what exactly should the curve look like over which we integrate? Let's assume we have <math>n</math> groups, with borders after <math>x_i</math> documents (<math>i \in 1..n</math>) and precisions <math>x_i</math>.
 
Should we
 
** let the curve start at (0,1), then continue to (x_1,p_1)</math>, <math>(x_2,p_2)</math>...<math>(x_n,p_n)</math>
 
** instead of a continuous curve, draw horizontal line segments covering each group (kind of like a bar diagram)
 
** do something else?
 
If we don't do the bar diagram thing, the measure might not really be in the interval [0,1].
 
 
== Comment on the collection (from ALU) ==
 
* With TREC I believe people pay for the CDs of data.  What does RISM cost?  At worst, maybe we pay a licence fee to RISM? (Not that my school has any budget ATM... ):
 
* Regarding IP of collections, at least we're dealing with a single copyright holder for this, unlike the case of MIDI files. The only other symbolic collections of promise are the Digital Tradition/Schaffrath folk songs, or the classical collections held by MIT - but these currently have only been used with known item searches, not relevance judgements AFAICR.
 
== ==
 
The CD costs about 650 EUR. But once you have the CD, you still need to export the data and convert the plaine&easie data into a format you can actually use. If every participant has to do this, instead of being able to receive a data set in MusicXML together with the C++ source code for a parser (we could provide that), participating is more work than necessary. (and, of course, it might also be debatable if using the CD in this way conforms to the copyright rules).
 
 
== Comments from Grachten, Arcos, and Mantaras ==
 
Sorry for the much delayed reaction.
 
 
In general, I think the evaluation procedure is well thought out. However, I share M├╝llensiefen and Frieler's opinion that the filtering method applied to reduce the number of candidates imposes a (risk of) bias. Even when the filtered-out incipits returned by the algorithms are included, it may be hard to garantuee that all the incipits (initially included and included afterwards) receive an equal amount of attention and judgment (i.e. different number of musical experts, different number of candidates to choose from, etc).
 
 
A more general comment (extending M├╝llensiefen and Frieler's first point) is that judging a melodic similarity measure by its capability to identify the most similar candidates to a particular query (the database querying problem) is just one aspect. Alternatively, it might be valuable e.g. to be able to meaningfully estimate distances among melodies that are not supposed to be very similar. A measure that is capable of the first task isn't necessarily good at the second, and vice versa. (This a comment, not a criticism of the currently proposed approach)
 

Revision as of 16:06, 31 August 2005

Training data and ground truth available

Here are links to 11 queries and their expected results, along with a script for calculating ADR.

http://teuge.labs.cs.uu.nl/Ruu/mirex/gt_queries_1/ - 11 queries

http://www.iiia.csic.es/~maarten/mirex/gt_queries_1_midi.tar.gz - MIDI versions of above queries (with grace notes removed)


http://teuge.labs.cs.uu.nl/Ruu/mirex/gt_cand_1/ - candidates for matches (this is the collection to search)

http://www.iiia.csic.es/~maarten/mirex/gt_cand_1_midi.tar.gz - MIDI versions of above candidates (with grace notes removed)


http://teuge.labs.cs.uu.nl/Ruu/mirex/training_ground_truth/ - expected results for the 11 training queries (one result would be, for example, http://teuge.labs.cs.uu.nl/Ruu/mirex/training_ground_truth/train/330044.htm )


http://teuge.labs.cs.uu.nl/Ruu/mirex/training_ground_truth/adr_pl.txt - script for calculating ADR, given one ground truth html file such as 330044.htm and a list of IDs of matches.


Still preparing the ground truths that will be used for the actual evaluation.

- Rainer