2005:Symbolic Melodic

From MIREX Wiki
Revision as of 15:45, 31 January 2005 by Admin (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Rainer Typke (Universiteit Utrecht) rtypke@cs.uu.nl


Retrieving melodically similar incipits from 500000 notated real world compositions


Retrieving the most similar incipits from the RISM A/II collection, given one of the incipits as a query.

Expected results: For 11 queries, a ground truth has been established (see Typke et al. 2005, http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/TR_groundtruth.pdf ). This ground truth makes it possible to measure the search results in an absolute way, taking into consideration not only relevance, but also the correct ranking. This ground truth has been obtained by combining ranked lists that were created by 35 music experts. The resulting ground truth has the form of ranked groups of incipits; the groups contain incipits whose differences in rankings were not statistically significant, but the ranking of the groups is statistically significant. MIR systems that perform well should return incipits such that the members of the groups are retrieved without violating the order of the groups. By using this ground truth, an absolute measure of quality is attainable. However, there is a danger of overfitting algorithms to just the 11 queries for which a ground truth is known. Therefore, there should also be an evaluation based on another set of queries that are chosen only after the algorithms have been submitted. This second part of the evaluation would only allow a comparison, no absolute measurement, but any dependance of the algorithms on the queries could be avoided.

Potential Participants

Tuomas Eerola/Petri Toiviainen Jyväskylä, Finland Jürgen Kilian/Holger Hoos Darmstadt/British Columbia, Canada Shyamala Doraisamy/Stefan Rueger Malaysia/London Maarten Grachten/Josep Lluis Arcos/Ramón López de Mántaras Bellaterra, Spain Giovanna Neve/Nicola Orio Pado Padova, Italia Anna Pienimäki/Kjell Lemström Helsinki, Finland Craig Sapp/Yi Wen Liu/Eleanor Selfridge-Field Stanford, US Daniel Müllensiefen/Klaus Frieler Hamburg Anna Lubiw/Luke Tanur Waterloo, Canada Michael Clausen Bonn Alexandra Uitdenbogerd, RMIT Rainer Typke, Frans Wiering, Remco C. Veltkamp, Universiteit Utrecht, {rainer.typke|frans.wiering|remcov}@cs.uu.nl Jeremy Pickens - jeremy@dcs.kcl.ac.uk Tim Crawford - t.crawford@gold.ac.uk

The likelihood of entering is unknown for all groups except Utrecht and Goldsmiths/King's; for Utrecht and Goldsmiths/King's, it is high. The other groups have developed algorithms that would be interesting to compare, so their likelihood of entering is not too low.

Evaluation Procedures

1) Absolute measurement using the ground truth For each of the 11 queries from the ground truth, we have a certain number of groups of incipits, where we know the correct ranking of the groups. Let us assume that the total number of incipits in these groups is N. For each ranking returned by a method to be evaluated, take the top N incipits. At every border between groups, calculate the precision. To get one single measure, one can now integrate over this precision curve and divide the result by N. This measure will be a number between 0 and 1, where 0 means total failure and 1 means complete agreement with the ground truth. This measure will reflect not only the ability of algorithms to find as many relevant incipits as possible, but also to order them correctly.

For example, let us assume that the ground truth has delivered the following three groups, with N=6: (a,b),(c,d,e),(f).

Method X returns: b,a, g,c,d, e.

Since there are 3 groups, we determine precision and recall at three points:

after position 2: precision 1 (because both relevant documents a and b have been found. The correct order of a and b is not known since they are in the same group, therefore no penalty should be applied for them being in the opposite order).

after position 5: precision 4/5=0.8 (because only g should not be among the first 5 incipits)

after position 6: precision 5/6=0.83 (because only g should not be among the first 6)

For an illustration, see the last 3 slides of http://teuge.labs.cs.uu.nl/Ruu/orpheus/groundtruth/dir05pres.pdf

2) Comparison of algorithms without using the ground truth To avoid the overfitting of the algorithms to the ground truth, select another group of queries after all algorithms have been submitted. Put all incipits returned by the algorithms into one pool. Remove doubles from the pool, divide the remaining incipits into portions of equal sizes, and distribute them back to the participants for relevance judgements (on a binary or ternary scale). Once the relevance is known for all returned incipits, use standard TREC measures for comparing the algorithms. This second comparison has some disadvantages: the complete set of relevant incipits won't be known, and neither will be the correct ranking. But the main disadvantage of the first comparison, the fact that the correct answer is known in advance, is avoided.

Relevant Test Collections

RISM A/II collection (large collection of real-world compositions for which a ground truth has already been established). Possible copyright issue: Permission from RISM for distributing the test collection or parts of it to participants and using it for EvalFest is probably necessary. Obtaining the data is not difficult: the RISM CD is available in good libraries, and the data can be exported from the CD in plaine&easie format. Alternatively, the existing result of converting the data into MusicXML could be distributed. If RISM does not agree to the data being distributed, maybe it is possible to convince them to at least agree to a solution similar to the Naxos audio collection, where the data are quarantained, and only the software travels.

Review 2

A very well- thought out proposal. First off, he has a database in mind, a large list of participants and thought carefully about evaluation. However, the number of likely participants is probably not as high as he believes and there are still issues with evaluation.

I'm not totally clear about some aspects. These 500000 notated real world compositions- are they audio files or midi? How long are they and are they mono or polyphonic? How is the notation done? Judging from most of the names of participants, I'd guess he's talking about midi.

His set of 11 queries with groundtruth is actually quite small, and there is still considerable work involved in getting groundtruth for other queries, especially since it requires feedback from many music experts.

His method of evaluation through use of a groundtruth established by music experts should be compared with Jeremy's suggestion (check previous MIREX postings or just communicate with him directly) of using variations to identify melodic similarity.

His section on evaluation is quite good, and precision/recall is the preferred method in the Info Retrieval community (just ask Stephen). Both algorithms seem feasible assuming the other issues have been worked out. If possible, both his suggested evaluation procedures should be implemented. Would be interesting to see if they give similar results, and if not, why? If the RISM A/II collection is really available from many libraries, then most of the work is done. Stephen's group can keep it quarantined and just run the algorithms that are provided. I think we don't even need to ask RISM for permission.

I think we should provisionally accept this, under the conditions that we do have a suitable number of participants, that Stephen agrees with the evaluation procedures, that we are able to extend the 11 queries and verify what the 'music experts' say, and that we have no problems with RISM.