2020:Singing Transcription from Polyphonic Music
The goal of this task is to transcribe the vocal part of polyphonic music into a series of notes, each note contains onset, offset and score pitch. The input of this task is a music recording, mostly pop music, that contains a vocal and some accompaniments. The vocal part is monophonic, but the accompaniments are not.
Therefore, to deal with this task, algorithms that separate vocal part from audio may be considered as a preprocessing step. However, this is not a necessary part of a singing transcription algorithm. Algorithms that directly perform singing transcription on mixed audio are also welcome.
This task is different from “audio melody extraction”, since the target of melody extraction is to determine pitch for each frame, while singing transcription is aimed to determine notes of vocal part in music.
Two datasets can be used to construct and evaluate a model for singing transcription:
RWC Music Database : Popular Music (RWC-MDB-P)
We can use the “Popular Music Database” part of RWC database  for this task. RWC-MDB-P consists of 100 songs and annotations in MIDI format (AIST annotation). The database is available at https://staff.aist.go.jp/m.goto/RWC-MDB/rwc-mdb-p.html, and the annotations are available at https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation. By excluding 6 songs (No. 3, 5, 8, 10, 23, and 66) with multiple singers (the melody part is not monophonic), the remaining 94 songs can be used for this task.
However, since the pitch annotations of RWC database is not quite accurate, mostly due to octave error, we will not evalute a model using RWC database.
This dataset consists of 200 Youtube links of pop songs (most of them are Chinese songs), together with their groundtruth files of vocal transcription. We will release 100 of the dataset as the open set for training/validation, and use the other 100 as the hidden set for test. The TRAINING set can be downloaded here. We strongly suggest participants to use this dataset as training set (if your algorithm is data-driven), since the property of Cmedia training set is close to Cmedia hidden set.
We will use Python package “mir_eval”  to evaluate the accuracy of a transcription by computing COnPOff, COnP and COn metrics described in . These metrics compute the maximum number of groundtruth notes that are corrected transcribed. Each note in groundtruth can only be matched with one transcribed note, and vice versa. Three rules are utilized to determine if two notes are matched with each other:
1. The onset difference is less than the threshold (100ms in this competition). 2. The pitch difference is less than the threshold (50 cents in this competition). 3. The offset difference is less than the threshold defined by max(50ms, 0.2* duration of groundtruth note) in this competition.
Two notes should satisfy all three conditions above to be considered as “correctly transcribed” in COnPOff metric. However, COnP only requires the two notes to satisfy (1) and (2), while COn only requires the two notes to satisfy (1). We will compute the f1-score of COnPOff, COnP and COn on each song. The final results reported are the average f1-score of the three metrics.
In fact, the COnPOff metric is the same as the evaluation metrics of “note tracking” subtask in MIREX “Multiple Fundamental Frequency Estimation & Tracking” task. The only difference is that the onset threshold is set to 100ms instead of 50ms due to the difficulty of labeling (and maybe, transcribing) onset of singing voice.
Sample rate: 44.1 KHz Sample size: 16 bit Number of channels: 2 (stereo) Encoding: WAV
The algorithm should output a plain text file. Each line represents a note, which contains three numbers: onset, offset and score pitch. Onset and offset are floating points in seconds, while score pitch should be an integer MIDI number in semitones. An example output file is shown next:
0.131 0.355 64 0.355 0.896 64 0.896 1.141 62 1.888 2.333 62
Since the vocal part is monophonic, in the groundtruth, the offset of one note is always not greater than the onset of the next note, i.e., there is no note overlapping. Also, the duration of a note is always positive. That is, the offset time of a note is always larger than the onset time of the same note. We strongly suggest the submitted algorithms to follow these rules.
Command line calling format
The submitted algorithm must take as arguments a SINGLE .wav file to perform the singing transcription on as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input and the output file path and name as %output, a program called “main” could be called from the command-line as follows:
./main %input %output
The time limit is 24 hours. In 24 hours, the algorithm should transcribe all 200 songs in RWC dataset and Cmedia dataset, of which the total duration is about 14hr.
If the algorithm cannot transcribe all 200 songs on time, it's still OK. However, the algorithm should at least transcribe 100 songs (from Cmedia hidden set) within time limit, otherwise no evaluation result can be reported. The algorithm will be executed on a computer with 64GB RAM and 1 NVIDIA GEFORCE GTX 1080Ti GPU.
 Masataka Goto, Hiroki Hashiguchi, Takuichi Nishimura, and Ryuichi Oka: “RWC Music Database: Popular, Classical, and Jazz Music Databases,” Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR 2002), pp.287-288, October 2002.
 C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, “mir_eval: A Transparent Implementation of Common MIR Metrics,” in Proceedings of the 15th International Conference on Music Information Retrieval, 2014.
 E. Molina, A. M. Barbancho-Perez, L. J. Tardón, I. Barbancho-Perez: “Evaluation Framework for Automatic Singing Transcription,” in Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), pp.567-572, October 2014.