2006:Important threads for Audio Similarity
This page is devoted to discussions of the evaluation of Audio Music Similarity algorithms at MIREX 2006. Discussions have already begun on the MIREX 06 "AudioSim06" contest planning list and will be briefly digested here. A full digest of the discussions is available to subscribers from the MIREX 06 "AudioSim06" contest planning list archives.
- 1 Important threads on the discussion list
- 1.1 Context related issues in music similarity evaluation
- 1.1.1 Viewpoints
- 126.96.36.199 George Tzanetakis
- 188.8.131.52 Anders Meng
- 184.108.40.206 Elias Pampalk
- 220.127.116.11 Dan Ellis
- 18.104.22.168 Kris West
- 22.214.171.124 Fabio Vignoli
- 126.96.36.199 Fabian M├╢rchen
- 188.8.131.52 Kris West
- 1.1.1 Viewpoints
- 1.2 Types of evaluation
- 1.3 Factors to evaluate
- 1.4 Objective statistics based upon album, artist and genre labels proposal
- 1.5 Subjective evaluation
- 1.1 Context related issues in music similarity evaluation
Important threads on the discussion list
- Kris West
- Paul Lamere
- Elias Pampalk
- Fabian M├╢rchen
- George Tzanetakis
- Dan Ellis
- Stephen Green
- Rebecca Fiebrink
- Mark Levy
- Hamish Allan
- Anders Meng
- Adam Lindsay
1) Is the notion of music similarity consistent between different humans/cultures/music education etc ? One thing I know for sure is that to all of you most pieces of folk music from the island of Crete would sound very similar whereas to people from Crete (including me) they sound completely unique.
2) Does it even make sense to speak of similarity as a one dimensional quantity ? For example is the dance version of Carmina Burana more similar to a classical recording of Carmina Burana or to another dance piece using the same drum loop.
3) Can similarity be context-independent ? Similarity only makes sense relative to a particular context. Billie Holiday is very different from Ella Fitzgerald in a context of female jazz singer however might be perceived as very similar in a general context of female singers including Britney Spears and Anni DiFranco.
Each of us consider similarity in a very different manner. Consider the scenario of human ranking of playlists according to similarity. The ranking we would get out of this "non-guided" (flat prior) similarity evaluation would be a kind of average ranking, since each user ranks after his preferences: E.g. user 1 might rank after vocal similarity, while user 2 ranks after instrument similarity, and so on. Perhaps it would be beneficial for the end user of some fancy music retrieval system in the future either to find music based on a kind of "average" similarity (which I guess a lot of people would be happy with) or perhaps be able to be select his/her dimension of similarity, say : "Vocal similarity, like bono....". Perhaps a multidimensional similarity evaluation will be possible next year, but would almost certainly have to involve the generation of ground-truth through subjecctive similarity judgements.
perceived similarity is:
- context dependant
I don't think anyone disagrees :-)
yet, I have no doubts, that by comparing the performances of algorithms which compute:
- context independent,
similarity ratings, we could benefit a lot.
such algorithms are something very practical. e.g. MTG recently announced that they were making money with such a technology as part of their "music surfer" product.
I'd really like to argue for keeping things as simple as possible. It's the first year we are trying this. I doubt that we could possibly keep it too simple, but if we do we could easily improve things next time.
The issue of 'context-dependent' similarity is not so hard to deal with. If I give you one track and ask for the most similar ones, there's no context. But if I give you 20, the spread within the set defines the context, and the algorithms can try to infer it. So if those tracks all have Paul McCartney singing, or if they all use cellos, or if they all use a simple tonic-subdominant-dominant chord progression, it seems a well-formed problem to ask an algorithm to infer the correct aspect on which to perform matching.
Predicting human-generated playlists is one possible way to frame this. By trawling the web, or scraping people's iTunes databases, you can get a lot of sets of songs that people have lumped together for one reason or another, representative of the real spread of 'contexts' relevant to users. Having algorithms attempt to complete the rest of a playlist given the 1st half is something you can measure, even if performance is bound to be low in absolute terms. I'm not frightened of low absolute performance provided there is still measurable difference between different systems.
Another observation worth making is that a relatively small dataset (e.g. ~5000 tracks), where both data and metadata come from a common source (e.g. a single record label), defines its own limited context as you can reasonably expect that the genre classifications (used to organise the collection) have been applied in a consistent manner without outliers. The likelihood of this being the case decreases as the collection size rises, as it would require more editors and more judgements to organise.
Collection specific learning
It appears that music similarity estimators can be roughly divided into three groups, based the types of data that they leverage: purely content-based, augmented with behavioural data (such as skipping behaviour, playlist co-occurence etc.), and trained (content-based with Collection specific parameters estimated from a labelled subset of the database or an independent database). Purely content-based estimators appear to be the default mode for this evaluation. However, evaluation of trained submissions should be possible. Assuming an evaluation database size of 4000+ examples, 1000 - 1500 examples could be held back for training. Examples should be selected from the database in the same proportions that they occur. It might also be interesting to evaluate two copies of trained algorithms, one trained on the subset from the database and another trained on a separate dataset (greater differences in performance based on the two training sets may indicate overfitting to the collection, while smaller differences may indicate better generalisation).
The focus on context is in the previous paragraph referred to the other songs available in the collections. There is however an other type of context that should be taken into account when comparing music similarity measures and it has been mixed up: the application context. What is the purpose of a music similarity algorithm? I think one of the following:
- Browsing a music collection that is known to the user
- Exploring a music collection that is unknown to the user
- Creating playlists
The 3rd point has a completely different flavour from the other two. --Vignoli 08:28, 20 December 2005 (CST)
I consider collection specific learning very important. The more recent discussion on this list on using several ground truths also seems to support this. I doubt that someone can find a description of musical audio content that solves all the described problems on all types of music. A good system for audio similarity should be able to learn a ground truth given genre or artist or user preference or something else as ground truth and be able to approximately reproduce it on similar data, as long as the ground truth is music related. ;)
A better definition of context?
Fabio makes an excellent point above, context is poorly defined for the discussion of Music query-by-example systems. It seems to me that a much clearer division can be drawn between a specific query and the context within which it is performed. The context of a query may be composed of one or more of the following:
- A User (the user's musical knowledge, tastes or behaviour)
- A specific collection (How is music organised/categorised/clustered within the collection? Catalog owner's musical knowledge/behaviour, closely related to above)
- Culture or Genre (Musical knowledge/conventions within the culture or genre, e.g. Western music, Western Classical music, Rock music, Greek traditional music. No sense making a Pop music query within the context of Greek traditional music unless trying to emulate tastes of user almost exclusively interested in Greek traditional music)
and will often be fixed for a given retrieval system. If the end user's cultural context can be assummed (e.g. western music), you might call these "objective" similarity estimators.
I believe any other context-like restrictions on the query are part of the specific query and may be explicitly defined or implied. Explicitly defined queries (which appear to be confused with contexts more often than implicitly defined queries) include:
- Find songs that are rhythmically similar to ...
- Find songs that have female vocals in them
- Find songs that are acoustically similar to (sound like) ...
whereas implicitly defined contexts iclude:
- Find songs similar to A (our basic query)
- Find songs similar to A || B || C ... (basic OR query)
- Find songs similar to A && B && C... (AND query, what is similar about A, B and C, what other other songs are similar in this way?)
- Could also be formulated as: Find songs that are to A as B is to C (B & C define a relationship, i.e they may be rhythmically similar or have the same vocalist or instrumentation, this relationship is then used to find songs related to A in this way)
- Find songs dis-similar to A, !A (NOT query, filtering?)
- Find songs similar to A and dis-similar to B, A && !B
What can we evaluate?
Evaluation of a system that works within the context of an individual user's tastes is problematic as any judgements about performance must be subjective. Evaluation of systems working within the context of a collection is can be objectively evaluated on their ability to reproduce the organisation of the collection (this approach could also be used to evaluate systems working with the context of user, by asking the user to orgaise the collection and attempting to reproduce that organisation). This implies that collection specific learning (and therefore a training set) should be allowed. Systems intended to work within the context of a music culture (such as western music) can also be evaluated on their ability to reproduce the organisation of a collection or set of collections of music. In this case a training set would not be used, or a training set which is completely independent of the test set is used (i.e. from a different source).
Assuming the training set is ignored by systems intended to work within the context of western musical culture, we can directly compare results from collection specific systems and so-called "objective" systems on the same collection (althought the distinction should be indicated in the results.
Types of evaluation
There have been a number of papers describing similarity evaluation, including those by Whitman, Berenzweig, Ellis and Logan. The methods used generally fall into the following buckets:
- Subjective precision via user tests
- Expert opinion (similar artist lists from music editors like All Music Guide)
- Playlist Co-occurrence
- User Collection Co-occurrence
- objective statistics based upon album, artist and genre labels. (TopN, average distance)
For a standard, annual evaluation like MIREX, the first four types of evaluations seem problematic.
1 - subjective precision - is very expensive to collect this data for alarge music collection, and would likely be unreliable unless many users were evaluated.
2 - Expert opinion - expert opinion will usually rate similarity of artists but not songs. Also, not transitive, coldplay may sound like the Beatles, but no one ever says the beatles sound like coldplay. This data generally only exists for popular artists (i.e. not for artists typically found in 'free' collections of music).
3,4 - Playlist Co-occurrence, User Collection co-occurrence - works for popular music, but usually not enough coverage for less popular music, generally not suitable for our test collections (such as magnatune or epitonic), since this music is not listened to by enough people.
Factors to evaluate
- feature extraction time
- distance computation time
- memory consumption durring feature extraction
- memeory consumption durring distance computation
objective statistics based upon genre (with artist filter), artist & album labels:
- closest 1 (ratio of pieces in same genre/artist/album as query)
- closest 5 (-"-)
- closest 10 (-"-)
- closest 20 (-"-)
- clustering performance (e.g. ratio of the average intra-genre distance to the average inter-genre distance, see below)
3 and 4 are difficult to evaluate: (KW) These may be slightly problematic due to the very varied formats in which algortihms are submitted. I'll see what I can come up with. Comments from smart Linux/Unix/Mac bods on how to do this sort of measurement (perhaps even with manual entry of PID) would be welcomed. I haven't the faintest idea how to do it on Windows other than manually with the task manager.
Objective statistics based upon album, artist and genre labels proposal
Use methods described in Logan & Salomon (A Music Similarity Function Based on Signal Analysis, ICME2001), where the most similar 5, 10 or 20 songs are retrieved and the average number of songs in the same genre, from the same artist and from the same album are calculated.
An additional evalaution of the clustering performance of an algorithm could be calculated as the ratio of the average intra-genre distance (within-class scatter or cohesion) to the average inter-genre distance (between class scatter or separation).
Justification for using artist and album labels
At Sun Labs we've been auditioning a number of different similarity models. We've had some models that behaved in ways that are apparently similar to your 'spectrum histogram', in that they yield good objective scores (a high percentage of songs in the top 20 are of the proper genre, the average intra-genre distance is low compared to the overall average distance), but when using the models for actual playlist generation or visualization we'd experience the similar 'space-time distortions'. The 75% of the songs might be of the proper genre, but the other 25% would be way off. We call them 'clunkers', songs that no human would say belongs in the playlist of similar songs. Also, we'd see other similar problems, where the songs in the 75% of the proper genre were not really very similar. A folk rock song would be 'near' a punk rock song, or a choral piece would be near a harpsichord piece. One way of reducing this problem would be to take into account the artist and album metadata in the evaluation. Presumably there are three enclosing clusters: at the coarsest level is a genre cluster, within this cluster we could expect to find multiple artist clusters, and within an artist cluster we may find multiple album clusters. Now, if it turns out that the similarity classifier being evaluated really makes no distinction of nearness within the genre (e.g choral and harpsichord music can be 'near'), then the ratio of the average artist distance to average genre distance will approach one (the artist cluster is as large as the genre cluster), but if this ratio is much less than one then we have small artist clusters within the genre cluster. The harpsichord music has clustered together and separated it from the choral music performed by a different artist. The album clustering can give another, finer-grained level (but I'm not convinced that this extra level is necessary).
To sum up then, the average artist distance compared to the average genre distance gives a notion of how well a similarity metric is working within a single genre.
Volatility of objective statistics and clustering metric
It has been suggested that the use of the mean average in the calculation of the above statistics is too volatile as it is influenced by outliers and we should therefore use the median average. Outliers easily sneak into any ground truth or can be caused by choosing a bad segment of the song. However it should be noted that if a participant selects a poor segment from a song that is indicative of the performance of the assumption that you can randomly select a representative sample from a piece of music (an intelligent segmentation or thumbnailing technique might have a significant advantage). However, if a poor query segment is selected for the evaluation, it will effect all submissions equally. If a particular submission handles a poor query better it should achieve a better evaluation score, rather than have that additional performance averaged out. I.e. if an algorithm doesn't produce as many outliers, that fact should represented in the evaluation score.
The use of the median average will most likely improve all calculated ratios, but will reduce the difference between algorithms. A more useful alternative may be the trimmed mean (remove 1 - 2% of results from both ends of each distribution then calculate mean). It has also been commented that, in generated playlists, outliers can ruin the perception of the performance of the rest of the generated playlist and therefore the use of a metric which deliberately ignores outliers is highly inappropriate (this applies to both the median and trimmed mean).
Another interesting statistic maybe the difference between the mean and median statistics as a lower value should indicate that the algorithm handles outliers well (is less volatile), while a higher value will indicate more outliers in the distribution.
Here's another possibility for evaluation. I notice that they are now around 60 people subscribed to this email list. I imagine that this will grow to at least 100 or so by the time the mirex 2006 evaluation begins. With this many people, perhaps we can do some real human evalutions of submitted systems. If we can get each person on this list to agree to evaluate 10 playlists and rate them on a scale of 1 to 5 in terms of how similar the songs are, then we can get 1000 playlists evaluated. If 50 systems are submitted, that means we can have 20 different people look at each playlist. Here's what we could do:
- Each system would be given a seed song and asked to generate a 'playlist' of the 10 most similar songs (all systems would receive the same seed song)
- The lists would be anonymized (song and artists removed from titles to prevent bias from the metadata)
- The playlists would be randomly distributed to the set of evaluators to be scored (a web interface would suffice).
- Evaluators assign a 1 to 5 rating to each of their ten assigned playlists
- The average score for each playlist would recorded as part of the evaluation
- The resulting playlists can be made available for browsing after the evaluation results are published.
Since music similarity is such a human quantity I think that it is important to have some user evaluation of similarty as part of MIREX. Thoughts?
Alternatives to rating on a scale
A pairwise comparison approach ("Is playlist A better or worse than playlist B?") might be more appropriate than rating on a scale (where, for example, individuals might prefer or avoid the middle of the scale by nature, or where all ten of someone's playlists could be in reality very bad compared to the set of all generated playlists, but he or she would have no basis to judge them as such). A one-bit measurement (relevant or irrelevant) has really paid off in text retrieval. I expect it would pay off in music retrieval as well.
A one bit measure of overall similarity may be tricky to come by. But there are plenty of one bit musical features that could be measured quickly and pretty consistently, and tracks with a sufficient proportion of these in common can reasonably be called similar. This is presumably how the folks at pandora.com manage to process a new track in 20 seconds. Could this approach work?