2020:Singing Transcription from Polyphonic Music

From MIREX Wiki


The goal of this task is to transcribe the vocal part of polyphonic music into a series of notes, each note is denoted by three numbers, onset, offset and the score pitch. The input of this task is a music recording (mostly pop music) that contains a vocal and some accompaniments, and the output is a series of notes. The vocal part is monophonic, but the accompaniments are not.

Therefore, to deal with this task, algorithms that separate vocal part from audio may be considered as a preprocessing step. However, this is not a necessary part of a singing transcription algorithm. Algorithms that directly perform singing transcription on mixed audio are also welcome.

This task is different from “audio melody extraction”, since the target of melody extraction is to determine pitch for each frame, while singing transcription is aimed to determine notes of vocal part in music.

Besides, it's worth noting that the definition of “singing transcription” is not quite specific. Some researches [1][2] regard this term as the task of “transcribing polyphonic music into notes”, but other researches [3][4] seems to regard this term as the task of “transcribing monophonic signals without accompaniment into notes”, since both of them [3][4] created monophonic datasets (without accompaniments) for “automatic singing transcription”.

Therefore, to make the name more specific, we call the task that “transcribing polyphonic music that contains only monophonic vocal into notes” as singing transcription from polyphonic music.


Two datasets can be used to construct and evaluate a model for singing transcription:

RWC Music Database : Popular Music (RWC-MDB-P)

We can use the “Popular Music Database” part of RWC database [5] for this task. RWC-MDB-P consists of 100 songs and annotations in MIDI format (AIST annotation). The database is available at https://staff.aist.go.jp/m.goto/RWC-MDB/rwc-mdb-p.html, and the annotations are available at https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation. By excluding 6 songs (No. 3, 5, 8, 10, 23, and 66) with multiple singers (the melody part is not monophonic), the remaining 94 songs can be used for this task.

Cmedia dataset

This dataset consists of 200 Youtube links of pop songs (most of them are Chinese songs), together with their groundtruth files of vocal transcription. We will release 100 of them as the open set for training/validation (training set), and use the other 100 as the hidden set for test.

The training set can be downloaded here. We strongly suggest participants to use this dataset as training set (if your algorithm is data-driven), since the property of Cmedia training set is close to Cmedia hidden set.


We will use Python package “mir_eval” [6] to evaluate the accuracy of a transcription by computing COnPOff, COnP and COn metrics described in [4]. These metrics compute the maximum number of groundtruth notes that are corrected transcribed. Each note in groundtruth can only be matched with one transcribed note, and vice versa. Three rules are utilized to determine if two notes are matched with each other:

1. The onset difference is less than the threshold (100ms in this competition).

2. The pitch difference is less than the threshold (50 cents in this competition).

3. The offset difference is less than the threshold defined by max(50ms, 0.2* duration of groundtruth note) in this competition.

Two notes should satisfy all three conditions above to be considered as “correctly transcribed” in COnPOff metric. However, COnP only requires the two notes to satisfy (1) and (2), while COn only requires the two notes to satisfy (1). We will compute the f1-score of COnPOff, COnP and COn on each song. The final results reported are the average f1-score of the three metrics.

In fact, the COnPOff metric is the same as the evaluation metrics of “note tracking” subtask in MIREX “Multiple Fundamental Frequency Estimation & Tracking” task. The only difference is that the onset threshold is set to 100ms instead of 50ms due to the difficulty of labeling (and maybe, transcribing) onset of singing voice.

A simple evaluation code can be downloaded from here.

Submission Format

Input Format

Sample rate: 44.1 KHz

Sample size: 16 bit

Number of channels: 2 (stereo)

Encoding: WAV

Output Format

The algorithm should output a plain text file. Each line represents a note, which contains three numbers: onset, offset and score pitch. Onset and offset are floating points in seconds, while score pitch should be an integer MIDI number in semitones. An example output file is shown next:

0.131 0.355 64
0.355 0.896 64
0.896 1.141 62
1.888 2.333 62

Since the vocal part is monophonic, in the groundtruth, the offset of one note is always not greater than the onset of the next note, i.e., there is no note overlapping. Also, the duration of a note is always positive. That is, the offset time of a note is always larger than the onset time of the same note. We strongly suggest the submitted algorithms to follow these rules.

Command line calling format

The submitted algorithm must take as arguments a SINGLE .wav file to perform the singing transcription on as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input and the output file path and name as %output, a program called “main” could be called from the command-line as follows:

./main %input %output

Time limits

The time limit is 24 hours. In 24 hours, the algorithm should transcribe all 200 songs in Cmedia dataset, of which the total duration is about 14hr.

If the algorithm cannot transcribe all 200 songs on time, it's still OK. However, the algorithm should at least transcribe 100 songs (from Cmedia hidden set) within time limit, otherwise no evaluation result can be reported.

The algorithm will be executed on a computer with 64GB memory and one NVIDIA GEFORCE GTX 1080Ti GPU.


If you have any question about this task or the datasets, please feel free to send us an email: b06902046@ntu.edu.tw (Jun-You Wang) or roger.jang@gmail.com (Jyh-Shing Roger Jang).

Since this is a new MIREX task, we are eagerly waiting for your help to make everything get on track.

Submission deadline

September 13th, 2020.


[1] M. Ryynanen and A. Klapuri: “Transcription of the Singing Melody in Polyphonic Music,” in Proc. of the 7th International Society for Music Information Retrieval Conference (ISMIR 2006), pp.222-227, October 2006.

[2] R. Nishikimi, E. Nakamura, S. Fukayama, M. Goto, K. Yoshii: “Automatic Singing Transcription Based on Encoder-decoder Recurrent Neural Networks with a Weakly-supervised Attention Mechanism,” in Processing of 2019 IEEE International Conference on Acoustics, Speech and Signal (ICASSP), 2019.

[3] E. Gómez and J. Bonada: Towards computer-assisted flamenco transcription: “An experimental comparison of automatic transcription algorithms as applied to a cappella singing,” Computer Music Journal, 37(2):73–90, 2013.

[4] E. Molina, A. M. Barbancho-Perez, L. J. Tardón, I. Barbancho-Perez: “Evaluation Framework for Automatic Singing Transcription,” in Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), pp.567-572, October 2014.

[5] Masataka Goto, Hiroki Hashiguchi, Takuichi Nishimura, and Ryuichi Oka: “RWC Music Database: Popular, Classical, and Jazz Music Databases,” Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR 2002), pp.287-288, October 2002.

[6] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, “mir_eval: A Transparent Implementation of Common MIR Metrics,” in Proceedings of the 15th International Conference on Music Information Retrieval, 2014.