2020:Singing Transcription from Polyphonic Music

From MIREX Wiki
Revision as of 15:56, 19 August 2020 by Yun Hao (talk | contribs) (Reference)


The goal of this task is to transcribe the vocal part of polyphonic music into a series of notes, each note contains onset, offset and score pitch. The input of this task is a music recording, mostly pop music, that contains a vocal and some accompaniments. The vocal part is monophonic, but the accompaniments are not. Therefore, to deal with this task, algorithms that separate vocal part from audio may be considered as a preprocessing step. However, this is not a necessary part of a singing transcription algorithm. Algorithms that directly perform singing transcription on mixed audio are also welcome. This task is different from “audio melody extraction”, since the target of melody extraction is to determine pitch for each frame, while singing transcription is aimed to determine notes of vocal part in music.


Two datasets can be used to construct and evaluate a model for singing transcription:

RWC Music Database [1]: Popular Music (RWC-MDB-P)

We can use the “Popular Music Database” part of RWC database for this task. RWC-MDB-P consists of 100 songs and annotations in MIDI format (AIST annotation). The database is available at https://staff.aist.go.jp/m.goto/RWC-MDB/rwc-mdb-p.html, and the annotations are available at https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation. By excluding 6 songs (No. 3, 5, 8, 10, 23, and 66) with multiple singers (the melody part is not monophonic), the remaining 94 songs can be used for this task.

Cmedia-200 dataset

This dataset consists of 200 Youtube links of Chinese pop songs, together with their groundtruth files of vocal transcription. We will release 100 of the dataset as the open set for training/validation, and use the other 100 as the hidden set for test.


We will use Python package “mir_eval” [2] to evaluate the accuracy of a transcription by computing COnPOff, COnP and COn metrics described in [3]. These metrics compute the maximum number of groundtruth notes that are corrected transcribed. Each note in groundtruth can only be matched with one transcribed note, and vice versa. Three rules are utilized to determine if two notes are matched with each other: 1. The onset difference is less than the threshold (100ms in this competition). 2. The pitch difference is less than the threshold (50 cents in this competition). 3. The offset difference is less than the threshold defined by max(50ms, 0.2* duration of groundtruth note) in this competition. Two notes should satisfy all three conditions above to be considered as “correctly transcribed” in COnPOff metric. However, COnP only requires the two notes to satisfy (1) and (2), while COn only requires the two notes to satisfy (1). We will compute the f1-score of COnPOff, COnP and COn on each song. The final results reported are the average f1-score of the three metrics. In fact, the COnPOff metric is the same as the evaluation metrics of “note tracking” subtask in MIREX “Multiple Fundamental Frequency Estimation & Tracking” task. The only difference is that the onset threshold is set to 100ms instead of 50ms due to the difficulty of labeling (and maybe, transcribing) onset of singing voice. Also, we will report another metric proposed by ourselves, the “shifted COn” metric. To compute this metric, a sub-optimal algorithm is used to automatically shift the whole prediction of a song. After finding out the proper time shift, we will then compute COn metric between the shifted prediction file and the groundtruth. The result of this COn is named as “shifted COn”.

Submission Format

Input Format

Sample rate: 44.1 KHz Sample size: 16 bit Number of channels: 2 (stereo) Encoding: WAV

Output Format

The algorithm should output a plain text file. Each line represents a note, which contains three numbers: onset, offset and score pitch. Onset and offset are in seconds, while score pitch should be an integer MIDI number in semitones. An example output file is shown next: 0.131 0.355 64 0.355 0.896 64 0.896 1.141 62 1.888 2.333 62

Since the vocal part is monophonic, in the groundtruth, the offset of one note is always not greater than the onset of the next note, i.e., there is no note overlapping. Also, the duration of a note is always positive. That is, the offset time of a note is always larger than the onset time of the same note. We strongly suggest the submitted algorithms to follow these rules.

Command line calling format

The submitted algorithm must take as arguments a SINGLE .wav file to perform the singing transcription on as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input and the output file path and name as %output, a program called “main” could be called from the command-line as follows:

./main %input %output

Time limits

The time limit is 24 hours. In 24 hours, the algorithm should transcribe all 294 songs in RWC dataset and Cmedia dataset, of which the total duration is about 20hr.


[1] Masataka Goto, Hiroki Hashiguchi, Takuichi Nishimura, and Ryuichi Oka: “RWC Music Database: Popular, Classical, and Jazz Music Databases,” Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR 2002), pp.287-288, October 2002.

[2] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, “mir_eval: A Transparent Implementation of Common MIR Metrics,” in Proceedings of the 15th International Conference on Music Information Retrieval, 2014.

[3] E. Molina, A. M. Barbancho-Perez, L. J. Tardón, I. Barbancho-Perez: “Evaluation Framework for Automatic Singing Transcription,” in Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), pp.567-572, October 2014.