2005:Symbolic Melodic

From MIREX Wiki
Revision as of 09:40, 3 February 2005 by 131.211.81.134 (talk | contribs) (Rainer Typke's comments on M├╝llensiefen and Frieler's Comments)

Proposer

Rainer Typke (Universiteit Utrecht) rtypke@cs.uu.nl


Title

Retrieving melodically similar incipits from 500,000 notated real world compositions

Description

Retrieving the most similar incipits from the RISM A/II collection, given one of the incipits as a query.

Expected results: For 11 queries, a ground truth has been established (see Typke et al. 2005, http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/TR_groundtruth.pdf ). This ground truth makes it possible to measure the search results in an absolute way, taking into consideration not only relevance, but also the correct ranking. This ground truth has been obtained by combining ranked lists that were created by 35 music experts. The resulting ground truth has the form of ranked groups of incipits; the groups contain incipits whose differences in rankings were not statistically significant, but the ranking of the groups is statistically significant. MIR systems that perform well should return incipits such that the members of the groups are retrieved without violating the order of the groups. By using this ground truth, an absolute measure of quality is attainable. However, there is a danger of overfitting algorithms to just the 11 queries for which a ground truth is known. Therefore, there should also be an evaluation based on another set of queries that are chosen only after the algorithms have been submitted. This second part of the evaluation would only allow a comparison, no absolute measurement, but any dependance of the algorithms on the queries could be avoided.


Potential participants and their likelihood of entering

Names of people who have expressed interest are underlined. Names of people who have been contacted, but haven't expressed an opinion yet, are marked with a copyright symbol (©).

Tuomas Eerola (ptee@cc.jyu.fi)/Petri Toiviainen (Jyväskylä, Finland) -- high

J├╝rgen Kilian (kilian@noteserver.org)/Holger Hoos (Darmstadt/British Columbia, Canada) -- high

Shyamala Doraisamy/Stefan Rueger © (Malaysia/London) -- unknown

Maarten Grachten (maarten@iiia.csic.es)/Josep Lluis Arcos/Ramón López de Mántaras (Bellaterra, Spain) -- high

Giovanna Neve/Nicola Orio Pado © (hopefully I found the right e-mail address, I tried orio@dei.unipd.it) (Padova, Italia ) -- unknown

Anna Pienim├ñki ©/Kjell Lemstr├╢m © (Helsinki, Finland) -- unknown

Craig Sapp/Yi Wen Liu/Eleanor Selfridge-Field (Stanford, US) -- medium

Daniel M├╝llensiefen/Klaus Frieler (Hamburg) -- high

Anna Lubiw ©/Luke Tanur © (Waterloo, Canada) -- unknown

Michael Clausen © (Bonn) -- unknown

Ludger Hofmann-Engl (hofmann-engl@chameleongroup.org.uk) -- high

Alexandra Uitdenbogerd (sandrau@rmit.edu.au, RMIT) -- high

Rainer Typke, Frans Wiering, Remco C. Veltkamp (Universiteit Utrecht, {rainer.typke|frans.wiering|remcov}@cs.uu.nl) -- high

Jeremy Pickens (jeremy@dcs.kcl.ac.uk), Tim Crawford (t.crawford@gold.ac.uk) -- high

The other groups have developed algorithms that would be interesting to compare, so their likelihood of entering is not too low.

Evaluation Procedures

1) Absolute measurement using the ground truth

For each of the 11 queries from the ground truth, we have a certain number of groups of incipits, where we know the correct ranking of the groups. Let us assume that the total number of incipits in these groups is N. For each ranking returned by a method to be evaluated, take the top N incipits. At every border between groups, calculate the precision. To get one single measure, one can now integrate over this precision curve and divide the result by N. This measure will be a number between 0 and 1, where 0 means total failure and 1 means complete agreement with the ground truth. This measure will reflect not only the ability of algorithms to find as many relevant incipits as possible, but also to order them correctly.

For example, let us assume that the ground truth has delivered the following three groups, with N=6:

(a,b),(c,d,e),(f).

Method X returns:

b,a, g,c,d, e.

Since there are 3 groups, we determine precision and recall at three points:

after position 2: precision 1 (because both relevant documents a and b have been found. The correct order of a and b is not known since they are in the same group, therefore no penalty should be applied for them being in the opposite order).

after position 5: precision 4/5=0.8 (because only g should not be among the first 5 incipits)

after position 6: precision 5/6=0.83 (because only g should not be among the first 6)

For an illustration, see the last 3 slides of http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/dir05pres.pdf

2) Comparison of algorithms without using the ground truth

To avoid the overfitting of the algorithms to the ground truth, select another group of queries after all algorithms have been submitted. Put all incipits returned by the algorithms into one pool. Remove doubles from the pool, divide the remaining incipits into portions of equal sizes, and distribute them back to the participants for relevance judgements (on a binary or ternary scale). Once the relevance is known for all returned incipits, use standard TREC measures for comparing the algorithms. This second comparison has some disadvantages: the complete set of relevant incipits won't be known, and neither will be the correct ranking. But the main disadvantage of the first comparison, the fact that the correct answer is known in advance, is avoided.

Relevant Test Collections

RISM A/II collection (large collection of real-world compositions for which a ground truth has already been established). Possible copyright issue: Permission from RISM for distributing the test collection or parts of it to participants and using it for EvalFest is probably necessary. Obtaining the data is not difficult: the RISM CD is available in good libraries, and the data can be exported from the CD in plaine&easie format. Alternatively, the existing result of converting the data into MusicXML could be distributed. If RISM does not agree to the data being distributed, maybe it is possible to convince them to at least agree to a solution similar to the Naxos audio collection, where the data are quarantained, and only the software travels.

Review 1

Melodic similarity would have to be precisely defined, i.e. how was the ground truth established.

The mentioned possible participants are really likely to participate (to my knowledge at least 10 out of the 14 listed).

Part 1 - Absolute measurement using ground truth Can the number of queries be increased? 11 queries for a collection of 50000 incipits seem rather small (This comment can be ignored if there is difficulty in establishing the ground truth from such a large collection). Part 2 - Comparison of algorithms without using the ground truth Any ideas of how queries would be obtained ? Real-world queries ? Queries pooled from each participant ?

The relevance assumptions have to be explicitly stated for the judging process by the various participants using this pooling approach. If ternary scale is used, details of this scale would be needed. What evaluation measure would be adopted from TREC?

Review 2

A very well- thought out proposal. First off, he has a database in mind, a large list of participants and thought carefully about evaluation. However, the number of likely participants is probably not as high as he believes and there are still issues with evaluation.

I'm not totally clear about some aspects. These 500000 notated real world compositions- are they audio files or midi? How long are they and are they mono or polyphonic? How is the notation done? Judging from most of the names of participants, I'd guess he's talking about midi.

His set of 11 queries with groundtruth is actually quite small, and there is still considerable work involved in getting groundtruth for other queries, especially since it requires feedback from many music experts.

His method of evaluation through use of a groundtruth established by music experts should be compared with Jeremy's suggestion (check previous MIREX postings or just communicate with him directly) of using variations to identify melodic similarity.

His section on evaluation is quite good, and precision/recall is the preferred method in the Info Retrieval community (just ask Stephen). Both algorithms seem feasible assuming the other issues have been worked out. If possible, both his suggested evaluation procedures should be implemented. Would be interesting to see if they give similar results, and if not, why? If the RISM A/II collection is really available from many libraries, then most of the work is done. Stephen's group can keep it quarantined and just run the algorithms that are provided. I think we don't even need to ask RISM for permission.

I think we should provisionally accept this, under the conditions that we do have a suitable number of participants, that Stephen agrees with the evaluation procedures, that we are able to extend the 11 queries and verify what the 'music experts' say, and that we have no problems with RISM.

Downie's Comments

1. A classic retrieval task! I am game. Some hurdles to jump but I always thought RISM was a good place to start.

2. Do think Jeremy Pickens would have some good ideas to help this along.

3. I wonder if the Meldex folks would like to play?

M├╝llensiefen&Frieler's Comments

We appreciate the well-thought proposal of the Utrecht group and we are highly willing to participate. But, of course, we have some comments.

1. The whole evaluation is more about database query algorithms for short phrases and not about the whole field of melodic similarity. E.g. in gathering the ground truth data (GTD), the experts were asked the judge "containedness" as "identity". Furthermore, the RISM database contains only incipits and no longer, structured melodies with several phrases. This certainly has effects on human similarity judgments. This evaluation may not to give the 'whole picture', but is definitely a relevant task from a musicological and especially MIR perspective.

2. We are wondering if there is a need for giving the ground truth data to the participants for optimizing their algorithms, which in turn makes a second evaluation as proposed by Utrecht group preferable. Any good algorithm should and is likely to be optimized or trained on some data and should be ready to use for unseen data. So, if the software travels to Utrecht (or another appropiate place), we would avoid copyright problems with the RISM publishers, and every algorithm has the same chances without prior optimisation on the test data set. Then, a second evaluation without GTD would not be necessary in our opinion, though interesting in it's own right.

3. Furthermore, we wonder if there might be a bias towards the Utrecht group, if the RISM collection, which they were working on for years, will be used. E.g in gathering the ground truth the earth mover distance was used for preselection. This possible bias should be kept in mind. For this reason, the Utrecht Group may be seen in the role of a referee and organising team in this evaluation.

4. Their ranking evaluation procedure seems to be the most straightforward and simplest one in light of the structure of the GTD. But some remarks can be made. First, what if the participating algorithms (PA) find incipits, not contained in the rank list of the GTD at all, but which could be (at least) theoretically be judged similar by human experts. How can be accounted for that? Second, one could imagine of getting a rank list for the melodies in the GTD by the PA by direct comparison, and comparing this RL (in a to be determined way) to the GT rank list. E.g. if the rank list for a query q contains melodies (a,b),(c,d,e),(f), then a PA could give a rank list like b, f, d, e, a, c, which can be compared to the GTDRL. Does this make any sense or give further insight?

5. NB. The integrated precision results should better be divided by the number of groups, not by the number of incipits N.


Rainer Typke's comments on M├╝llensiefen and Frieler's Comments

Thanks for your comments! We'll try to respond soon.

For now just one response:

ad 5.: what would be advantages of dividing by the number of groups? I see only disadvantages:

The number of groups is probably not always meaningful. The group boundaries depend on:

  • number of expert opinions. More opinions → higher statistical significance → more groups can be recognized.
  • the threshold for statistical significance. Higher threshold → more groups.
  • admittedly, there is also some dependence on the properties of the melodies.

So, if we divide by the number of groups, we have these disadvantages:

  • dependence on some arbitrary or meaningless numbers such as the number of experts and the taste of whoever picked the threshold for statistical significance
  • the measure is no longer a number between 0 and 1, therefore the results for different queries can no longer be directly compared.