2006:Score Following

Overview

This page is devoted to discussions of the evaluation of Score Following algorithms at MIREX 2006. Discussions have already begun on the https://mail.lis.uiuc.edu/mailman/listinfo/mrx-com01 MIREX 06 "ScoreFollowing06" contest planning list] and will be briefly digested here. A full digest of the discussions is available to subscribers from the https://mail.lis.uiuc.edu/mailman/listinfo/mrx-com01 MIREX 06 "ScoreFollowing06" contest planning archive].

As consensus is achieved on the planning list, a full proposal (2006:Score Following Proposal) will be produced for the format of the evaluation, including pseudocode for the evaluation metric and suggested formats for submitted algorithms. A skeleton of proposal is already available on the 2006:Score Following Proposal page.

Moderators

Arshia Cont (University of California in San Diego (UCSD) and Ircam - Realtime Applications Team, France) - cont@ircam.fr

Introduction

Score Following is the real-time alignment of incoming music signal to the music score. The music signal can be symbolic (Midi Score Following) or Audio. Score following has more than 20 years of history but there has not been any collective attempt and neither any comprehensive framework for evaluation and testing of different score followers. On the other side, different systems are geared towards different applications (page turning, accompaniment, synchronization etc.) and run on different input (audio or symbolic MIDI) with different reference types (MIDI score or reference Audio).

This page serves as a summary of the discussions held on the ScoreFollower06 mailing list and will eventually hold a final evaluation proposal for MIREX 2006.

Important threads on the discussion list

Contributors:

Christopher Raphael [CR]
Arshia Cont [AC]
Diemo Schwarz [DS]
George Litterst [GL]
Kris West [KW]
Arie Van Schuetterhoef
Michael Good
Stephen Downie

Acceptable Systems

We need to distinguish between realtime and offline alignment. Offline audio to score alignment can itself be a separate subject for evaluation. For this evaluation session, we are focusing on REALTIME audio to score alignment or score following. The main issue here is causality and ofcourse computation time. Offline audio to score alignment systems are mostly non-causal and use future data for alignment. It would not be fare to compare online and offline systems in one session [AC].

Christopher Raphael

I didn't understand that we are limiting attention to online score following. While it is okay with me to do this, I think online score following is not really as well defined as offline score following. The difficulty is that an online score follower delivers its note onset estimates with some (possibly variable) latency. As this latency increases the problem gradually morphs into offline following. In some cases (say a musical accompaniment system that predicts future events using current observations) small latency does very little damage. In others, such as when a system waits to "hear" a live player before responding, latency is basically just as bad as detection error. So, in the context of MIREX evaluations, I think it is hard to weigh latency and error, since the tradeoff depends so heavily on context.

Diemo Schwarz

Regarding real time vs. non real time systems, the real distinction is, as Arshia pointed out, causality. I suggest to allow systems that don't run (yet) in real time when they're causal. (That excludes DTW-based alignment systems, for instance.)

Arie Van Schutterhoef

Also there is the issue of what one uses as score. In the ComParser system, the score is a reference audio.

Christopher Raphael

I think I agree that the causality of the score follower is the important issues --- not real time. However, it is not as obvious what is meant by this as in the case of a filter. It seems to me that any reasonable definition of a score follower must allow the algorithm to determine that a note onset occurs at frame t-s while (and not until) examining frame t. So dynamic time warping really satistifies this criterion since it reports all of its estimates at the final frame, though the latency (s) is clearly quite bad.

Kris West

Are you going to assume honesty and allow the submission to decode/play the audio or midi itself (in which case you might want to have start and end markers to keep an eye on the timing/playback speed)? or will you use an external playback device and require submissions to sample the input from the line-in or midi port themselves?

Evaluation Metrics

Arshia Cont

Let's imagine the following scenario for evaluation: We have some scores + some audio performance (or Midi performance) + (pre)ALIGNED (midi) scores to the audio. Simulating a score follower would give us some kind of online segmentation of the audio using the score as prior information. We can use these results and compare with the reference (aligned score).

How to compare ofcourse is something we need to figure out. We should measure delays. We should also have "human errors" in the audio (pitch rhythm etc.) and study the tolerance of the system. There are other criteria as well that we should realize.

Christopher Raphael

Just to make sure we are all on the same page, by "latency" I mean the difference in time between when a note onset is estimated and when it is reported. So if a score follower estimates that note 27 begins at time t and the this decision is made at time s, then s-t is the latency.

I think that the damage done by latency varies from nearly none to essentially the same as estimation error, depending on musical context. But most often, I think latency, at least in a musical accompaniment system, is less serious than estimation error. How about if we simply measure both average (absolute) estimation error and average latency and make the overall measure a weighted sum of the two with higher weight given to error.

Monophonic music is plenty challenging, especially with the online problem. Also this includes everyone, so I would vote for that.

Diemo Schwarz

Regarding what to measure, I suggest to take a look at the discussion in our NIME 2003 article [1], section 4 on evaluation, expanded in my thesis [2], section 5.6 "Evaluation of Alignment", page 44ff.

I think Christopher's definition of latency is appropriate for the application of automatic accompaniment (with tempo estimation). The other main application is real-time score recognition, where latency is the most important quality indicator. So we define it as the time difference between the reference alignment mark and the detected event onset (note or rest).

The obvious task is latency comparison, excluding errors (errors are when the latency is outside of some threshold or missed events).

Arshia Cont

In a recent paper, Dan Ellis and his student (4) discuss evaluation in the context of music transcription. The important remark in their evaluation is that they measure MISSED NOTES, FALSE ALARMS, TOTAL ERROR and SUBSTITUTION ERROR separately. Reading the paper, I think we can adopt this approach also for score following evaluation. Read specifically pages 8 and 9.

Christopher Raphael

I think that as a general rule accuracy is more important than latency, though we agree now that it doesn't make sense to completely disregard latency.

In score following where you know the events you are looking for, I don't see what a false alarm would be. If it means detecting a score note early, I am not sure if that is any worse than detecting it late, so this would be covered by estimation error. Substitutions also don't make any sense in score following, at least to me, since you are only going to estimate times for the notes of the score. I suppose that a system might decide it doesn't want to estimate a time for a certain note because it isn't sure where it is. This could be a *missed* note. So, I don't really belive substitutions or false alarms make sense in this problem.

George Litterst

In the case of evaluating score-following programs, the big question in my mind is: How well does a particular program handle a real-world, musical task that is important to the user? Accordingly, I think that the evaluation should be done under real world circumstances. Here are two examples:

The user wants a system that outputs a coordinated accompaniment based on the solo performance of the user. How musical is the output of the score-follower before training? How musical is the output of the score-follower after training?

The user wants to see a musical score display (with no audible accompaniment) that is coordinated with his/her playing (thus avoiding the need for manual page-turning). How well coordinated is the score display?

In the first case, there is an obvious need for tight coordination of the accompaniment with the soloist or else the score-follower is useless. From a user point of view, a modest amount of training may not be an issue. If the training takes a week of practice before the results are acceptable, the program is probably not going to be useful to anyone.

In the second case, if the user can always see many measures of music at one time (including the current measure), he/she may not care that the score-follower has a small amount of latency because that latency does not interfere with the end result.

Groundtruth Database

Arshia Cont

For this ofcourse we need to construct an aligned audio to score database with participants' contributions.

Christopher Raphael

My preference is for using audio that doesn't come from MIDI for evaluation. Of course, this requires some way of getting ground truth. One not-too-bad approach is to play the music and tap along on a key. Then one can record the tap times. It isn't really possible to get all the notes this way, but maybe beats are enough.

Diemo Schwarz

To obtain ground truth reference alignments, we can work with the offline polyphonic alignment system described in [3].

Input/Output types

We should come up with a common way of outputing data. An easy way would be to send ASCII markers to the score event + timing (in milliseconds) [AC]. The evaluation framework would have to execute your binary/script, collect the ASCII outputs and log the time the signal was received [KW]. It was suggested that we might forget about the actual real-time component of the problem and just examine "online" (= no lookahead) systems. In this case I think system needs only to output two times (samples) for each note:

the time the system believes the note begins
the time at which the above estimate was arrived at [CR]

Based on this, it was proposed by [AC] and [DS] to format the reference as well as the output of systems as follows:

The reference files constitute a ground truth alignment between a MIDI score and a recording of it. They have one line per score note, with the columns:

note onset time in reference audio file [ms]
note start time in score [ms]
MIDI note number in score [nn]

The result files represent the alignment found by a score following system between a MIDI score and a recording of a performance of it. They have one line per detected note with the columns:

estimated note onset time in performance audio file (ms)
detection time relative to performance audio file (ms)
note start time in score (ms)
MIDI note number in score (int)

For this year's MIREX, we think it would be easier to use MIDI score for Solo parts.

Issue of Training

Arshia Cont

Since some systems need offline training to obtain parameters, we need to be careful in organizing the references and assignments for training and tests. Ideally we should do training on a rehearsal recording of every particular piece but in practice it is instrument specific and when recording conditions change heavily during performance.

Christopher Raphael

I use both instrument-specific and piece-specific training, but my system can work without either of them. Of the two, the piece-specific training is more useful to me. If you are using piece-specific training too, it would be nice to have a version of the tests where this training is used. Another reason to do this is that it includes the folks who do matching from audio since this could be considered piece-specific training.

Paris Meeting

Score Following Evaluation meeting in Paris

Location: Ircam-Centre Pompidou

Date: 10/08/2006

Present: Stephen Downie (UIUC/MIREX), Diemo Schwarz (Ircam), Arshia Cont (Ircam/UCSD)

Structure of the Database

The database for Score Following MIREX will contain different musical pieces (Audio, score and training data if needed). It is suggested to follow Chris Raphael's convention, i.e. to put all files specific to a music piece in a separate folder. Each folder must contain (atleast) one score and the corresponding audio to be aligned and also a REFERENCE (for evaluation).

For systems that need training (or sessions with training), it was suggested to have an additional subfolder (for each piece) with different performances of the same audio plus alignment.

Naming/File conventions

A standard naming convention should be derived for the database since we are calling INPUT FOLDERS (See (2)). One issue is coherent file types. For Audio it is suggested to use either AIFF or WAV files and for score, MIDI. The output, as discussed on the list, should be standardized ascii text.

ISSUE: How are we going to define classes such as TRILLS (which should usually be considered as a separate object -- not defined in MIDI)?

System call format

During evaluation, each system will be called in COMMAND LINE with the following format:

<system-execution-file> <input-folder> <output-filename>

The input folder contains the score and audio performance of the score. Your submitted binaries should be able to BROWSE this folder and use the appropriate score and audio file and undertake the score following task, and write the results to the output file as given.

It is important to be able to create the output ascii file in a "different" path than the default.

In order to consider the issue of training, an alternative call format would be:

<system-execution-file> <input-folder> <output-filename> <training-folder>

where the training folder contains appropriate files for training. Obviously, if this third argument is not given, it is assumed that there is no learning/training phase.

System wrapper

Participating systems are on different platforms and might need a wrapper to prepare audio/score for the evaluation task of their systems. At this stage, it was *suggested* by Arshia Cont to simulate streaming to assure the causality of the submitted system. In this sense, the wrapper sends audio frames into the system (and not the whole audio). Ofcourse, we have to study the feasability of this issue due to time constraints of this year's task.

2006:Score Following

Contents

Overview

Moderators

Introduction

Important threads on the discussion list

Acceptable Systems

Christopher Raphael

Diemo Schwarz

Arie Van Schutterhoef

Christopher Raphael

Kris West

Evaluation Metrics

Arshia Cont

Christopher Raphael

Diemo Schwarz

Arshia Cont

Christopher Raphael

George Litterst

Groundtruth Database

Arshia Cont

Christopher Raphael

Diemo Schwarz

Input/Output types

Issue of Training

Arshia Cont

Christopher Raphael

Paris Meeting

Structure of the Database

Naming/File conventions

System call format

System wrapper

Related Papers

Navigation menu

Search