2010:Structural Segmentation

From MIREX Wiki
Revision as of 15:06, 23 March 2010 by MertBay (talk | contribs) (Discussion from 2009)

Description

This task was first run in 2009. The text of this section is copied from the 2009 page. Please add your comments and discussions for 2010.

The segment structure (or form) is one of the most important musical parameters. It is furthermore special because musical structure -- especially in popular music genres -- is accessible to everybody: it needs no particular musical knowledge.

Data format

Input

Single channel, CD-quality audio (PCM, 16-bit, 44100 Hz).

Output

Three column text file of the format

<onset_time> <offset_time> <label>
<onset_time> <offset_time> <label>
...

This is the same as Chris Harte's chord labelling files (.lab), and so is the same format as the ground truth as well. Onset and offset times are given in seconds, and the labels are simply letters: 'A', 'B', ... with segments referring to the same structural element having the same label.

Ground Truth

Ground truth data on audio is available for more than 200 songs, so given a quality measure everyone agrees on, evaluation wouldn't be harder than on other MIREX tasks.

Jouni Paulus's structure analysis page links to a corpus of 177 Beatles songs (zip file). The Beatles annotations are not a part of the TUTstructure07 dataset. That dataset contains 557 songs, a list of which is available here.

Ewald Peiszer's thesis page links to a portion of the corpus he used: 43 non-Beatles pop songs (including 10 J-pop songs) (zip file).

Those public corpora give a combined 220 songs.

Evaluation Measures

At the last ISMIR conference Lukashevich proposed a measure for segmentation evaluation. Because of the complexity of the structural segmentation task definition, several different evaluation measures will be employed to address different aspects. It should be noted that none of the evaluation measures cares about the true labels of the sections: they only denote the clustering. This means that it does not matter if the systems produce true labels such as "chorus" and "verse", or arbitrary labels such as "A" and "B".

Boundary retrieval

Hit rate Found segment boundaries are accepted to be correct if they are within 0.5s (Turnbull et al. ISMIR2007) or 3s (Levy & Sandler TASLP2008) from a border in the ground truth. Based on the matched hits, boundary retrieval recall rate, boundary retrieval precision rate, and boundary retrieval F-measure are be calculated.

Median deviation Two median deviation measure between boundaries in the result and ground truth are calculated: median true-to-guess is the median time from boundaries in ground truth to the closest boundaries in the result, and median guess-to-true is similarly the median time from boundaries in the result to boundaries in ground truth. (Turnbull et al. ISMIR2007)

Frame clustering

Both the result and the ground truth are handled in short frames (e.g., beat or fixed 100ms). All frame pairs in a structure description are handled. The pairs in which both frames are assigned to the same cluster (i.e., have the same label) form the sets P_E (for the system result) and P_A (for the ground truth). The pairwise precision rate can be calculated by P = \frac{|P_E \cap P_A|}{|P_E|}, pairwise recall rate by R = \frac{|P_E \cap P_A|}{|P_A|}, and pairwise F-measure by F=\frac{2 P R}{P + R}. (Levy & Sandler TASLP2008)

Normalised conditional entropies

Over- and under segmentation based evaluation measures proposed in Lukashevich ISMIR2008. Structure descriptions are represented as frame sequences with the associated cluster information (similar to the Frame clustering measure). Confusion matrix between the labels in ground truth and the result is calculated. The matrix C is of size |L_A| * |L_E|, i.e., number of unique labels in the ground truth times number of unique labels in the result. From the confusion matrix, the joint distribution is calculated by normalising the values with the total number of frames F:

p_{i,j} = C_{i,j} / F

Similarly, the two marginals are calculated:

p_i^a = \sum_{j=1}^{|L_E|} C{i,j}/F , and

p_j^e = \sum_{i=1}^{|L_A|} C{i,j}/F

Conditional distributions:

p_{i,j}^{a|e} = C_{i,j} / \sum_{i=1}^{|L_A|} C{i,j} , and

p_{i,j}^{e|a} = C_{i,j} / \sum_{j=1}^{|L_E|} C{i,j}

The conditional entropies will then be

H(E|A) = - \sum_{i=1}^{|L_A|} p_i^a \sum_{j=1}^{|L_E|} p_{i,j}^{e|a} \log_2(p_{i,j}^{e|a}), and

H(A|E) = - \sum_{j=1}^{|L_E|} p_j^e \sum_{i=1}^{|L_A|} p_{i,j}^{a|e} \log_2(p_{i,j}^{a|e})

The final evaluation measures will then be the oversegmentation score

S_O = 1 - \frac{H(E|A)}{\log_2(|L_E|)} , and the undersegmentation score

S_U = 1 - \frac{H(A|E)}{\log_2(|L_A|)}

Potential Participants

Discussion for 2010

Discussion from 2009

https://www.music-ir.org/mirex/2009/index.php/Structural_Segmentation#Issues_and_Discussion