Difference between revisions of "2020:Singing Transcription from Polyphonic Music"

From MIREX Wiki
(Reference)
(Format some paragraphs and add the link to Cmedia dataset)
Line 1: Line 1:
 
== Description ==
 
== Description ==
 
The goal of this task is to transcribe the vocal part of polyphonic music into a series of notes, each note contains onset, offset and score pitch. The input of this task is a music recording, mostly pop music, that contains a vocal and some accompaniments. The vocal part is monophonic, but the accompaniments are not.
 
The goal of this task is to transcribe the vocal part of polyphonic music into a series of notes, each note contains onset, offset and score pitch. The input of this task is a music recording, mostly pop music, that contains a vocal and some accompaniments. The vocal part is monophonic, but the accompaniments are not.
 +
 
Therefore, to deal with this task, algorithms that separate vocal part from audio may be considered as a preprocessing step. However, this is not a necessary part of a singing transcription algorithm. Algorithms that directly perform singing transcription on mixed audio are also welcome.
 
Therefore, to deal with this task, algorithms that separate vocal part from audio may be considered as a preprocessing step. However, this is not a necessary part of a singing transcription algorithm. Algorithms that directly perform singing transcription on mixed audio are also welcome.
 +
 
This task is different from “audio melody extraction”, since the target of melody extraction is to determine pitch for each frame, while singing transcription is aimed to determine notes of vocal part in music.
 
This task is different from “audio melody extraction”, since the target of melody extraction is to determine pitch for each frame, while singing transcription is aimed to determine notes of vocal part in music.
  
 
== Data ==
 
== Data ==
 
Two datasets can be used to construct and evaluate a model for singing transcription:
 
Two datasets can be used to construct and evaluate a model for singing transcription:
=== RWC Music Database [1]: Popular Music (RWC-MDB-P) ===
+
=== RWC Music Database : Popular Music (RWC-MDB-P) ===
We can use the “Popular Music Database” part of RWC database for this task. RWC-MDB-P consists of 100 songs and annotations in MIDI format (AIST annotation). The database is available at https://staff.aist.go.jp/m.goto/RWC-MDB/rwc-mdb-p.html, and the annotations are available at https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation. By excluding 6 songs (No. 3, 5, 8, 10, 23, and 66) with multiple singers (the melody part is not monophonic), the remaining 94 songs can be used for this task.
+
We can use the “Popular Music Database” part of RWC database [1] for this task. RWC-MDB-P consists of 100 songs and annotations in MIDI format (AIST annotation). The database is available at https://staff.aist.go.jp/m.goto/RWC-MDB/rwc-mdb-p.html, and the annotations are available at https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation. By excluding 6 songs (No. 3, 5, 8, 10, 23, and 66) with multiple singers (the melody part is not monophonic), the remaining 94 songs can be used for this task.
 +
 
 +
However, since the pitch annotations of RWC database is not quite accurate, mostly due to octave error, we will not evalute a model using RWC database.
  
=== Cmedia-200 dataset ===
+
=== Cmedia dataset ===
This dataset consists of 200 Youtube links of Chinese pop songs, together with their groundtruth files of vocal transcription. We will release 100 of the dataset as the open set for training/validation, and use the other 100 as the hidden set for test.
+
This dataset consists of 200 Youtube links of pop songs (most of them are Chinese songs), together with their groundtruth files of vocal transcription. We will release 100 of the dataset as the open set for training/validation, and use the other 100 as the hidden set for test.
 +
The TRAINING set can be downloaded [https://drive.google.com/file/d/15b298vSP9cPP8qARQwa2X_0dbzl6_Eu7/ here]. We strongly suggest participants to use this dataset as training set (if your algorithm is data-driven), since the property of Cmedia training set is close to Cmedia hidden set.
  
 
== Evaluation ==
 
== Evaluation ==
 
We will use Python package “mir_eval” [2] to evaluate the accuracy of a transcription by computing COnPOff, COnP and COn metrics described in [3]. These metrics compute the maximum number of groundtruth notes that are corrected transcribed. Each note in groundtruth can only be matched with one transcribed note, and vice versa. Three rules are utilized to determine if two notes are matched with each other:
 
We will use Python package “mir_eval” [2] to evaluate the accuracy of a transcription by computing COnPOff, COnP and COn metrics described in [3]. These metrics compute the maximum number of groundtruth notes that are corrected transcribed. Each note in groundtruth can only be matched with one transcribed note, and vice versa. Three rules are utilized to determine if two notes are matched with each other:
 +
 
1. The onset difference is less than the threshold (100ms in this competition).
 
1. The onset difference is less than the threshold (100ms in this competition).
 
2. The pitch difference is less than the threshold (50 cents in this competition).
 
2. The pitch difference is less than the threshold (50 cents in this competition).
 
3. The offset difference is less than the threshold defined by max(50ms, 0.2* duration of groundtruth note) in this competition.
 
3. The offset difference is less than the threshold defined by max(50ms, 0.2* duration of groundtruth note) in this competition.
 +
 
Two notes should satisfy all three conditions above to be considered as “correctly transcribed” in COnPOff metric. However, COnP only requires the two notes to satisfy (1) and (2), while COn only requires the two notes to satisfy (1).
 
Two notes should satisfy all three conditions above to be considered as “correctly transcribed” in COnPOff metric. However, COnP only requires the two notes to satisfy (1) and (2), while COn only requires the two notes to satisfy (1).
 
We will compute the f1-score of COnPOff, COnP and COn on each song. The final results reported are the average f1-score of the three metrics.
 
We will compute the f1-score of COnPOff, COnP and COn on each song. The final results reported are the average f1-score of the three metrics.
 +
 
In fact, the COnPOff metric is the same as the evaluation metrics of “note tracking” subtask in MIREX “Multiple Fundamental Frequency Estimation & Tracking” task. The only difference is that the onset threshold is set to 100ms instead of 50ms due to the difficulty of labeling (and maybe, transcribing) onset of singing voice.
 
In fact, the COnPOff metric is the same as the evaluation metrics of “note tracking” subtask in MIREX “Multiple Fundamental Frequency Estimation & Tracking” task. The only difference is that the onset threshold is set to 100ms instead of 50ms due to the difficulty of labeling (and maybe, transcribing) onset of singing voice.
Also, we will report another metric proposed by ourselves, the “shifted COn” metric. To compute this metric, a sub-optimal algorithm is used to automatically shift the whole prediction of a song. After finding out the proper time shift, we will then compute COn metric between the shifted prediction file and the groundtruth. The result of this COn is named as “shifted COn”.
 
  
 
== Submission Format ==
 
== Submission Format ==
Line 30: Line 37:
  
 
=== Output Format ===
 
=== Output Format ===
The algorithm should output a plain text file. Each line represents a note, which contains three numbers: onset, offset and score pitch. Onset and offset are in seconds, while score pitch should be an integer MIDI number in semitones. An example output file is shown next:
+
The algorithm should output a plain text file. Each line represents a note, which contains three numbers: onset, offset and score pitch. Onset and offset are floating points in seconds, while score pitch should be an integer MIDI number in semitones. An example output file is shown next:
0.131 0.355 64
+
0.131 0.355 64
0.355 0.896 64
+
0.355 0.896 64
0.896 1.141 62
+
0.896 1.141 62
1.888 2.333 62
+
1.888 2.333 62
  
 
Since the vocal part is monophonic, in the groundtruth, the offset of one note is always not greater than the onset of the next note, i.e., there is no note overlapping. Also, the duration of a note is always positive. That is, the offset time of a note is always larger than the onset time of the same note. We strongly suggest the submitted algorithms to follow these rules.
 
Since the vocal part is monophonic, in the groundtruth, the offset of one note is always not greater than the onset of the next note, i.e., there is no note overlapping. Also, the duration of a note is always positive. That is, the offset time of a note is always larger than the onset time of the same note. We strongly suggest the submitted algorithms to follow these rules.
Line 44: Line 51:
  
 
=== Time limits ===
 
=== Time limits ===
The time limit is 24 hours. In 24 hours, the algorithm should transcribe all 294 songs in RWC dataset and Cmedia dataset, of which the total duration is about 20hr.
+
The time limit is 24 hours. In 24 hours, the algorithm should transcribe all 200 songs in RWC dataset and Cmedia dataset, of which the total duration is about 14hr.
 +
 
 +
If the algorithm cannot transcribe all 200 songs on time, it's still OK. However, the algorithm should at least transcribe 100 songs (from Cmedia hidden set) within time limit, otherwise no evaluation result can be reported.
 +
The algorithm will be executed on a computer with 64GB RAM and 1 NVIDIA GEFORCE GTX 1080Ti GPU.
  
 
== Reference ==
 
== Reference ==

Revision as of 23:16, 24 August 2020

Description

The goal of this task is to transcribe the vocal part of polyphonic music into a series of notes, each note contains onset, offset and score pitch. The input of this task is a music recording, mostly pop music, that contains a vocal and some accompaniments. The vocal part is monophonic, but the accompaniments are not.

Therefore, to deal with this task, algorithms that separate vocal part from audio may be considered as a preprocessing step. However, this is not a necessary part of a singing transcription algorithm. Algorithms that directly perform singing transcription on mixed audio are also welcome.

This task is different from “audio melody extraction”, since the target of melody extraction is to determine pitch for each frame, while singing transcription is aimed to determine notes of vocal part in music.

Data

Two datasets can be used to construct and evaluate a model for singing transcription:

RWC Music Database : Popular Music (RWC-MDB-P)

We can use the “Popular Music Database” part of RWC database [1] for this task. RWC-MDB-P consists of 100 songs and annotations in MIDI format (AIST annotation). The database is available at https://staff.aist.go.jp/m.goto/RWC-MDB/rwc-mdb-p.html, and the annotations are available at https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation. By excluding 6 songs (No. 3, 5, 8, 10, 23, and 66) with multiple singers (the melody part is not monophonic), the remaining 94 songs can be used for this task.

However, since the pitch annotations of RWC database is not quite accurate, mostly due to octave error, we will not evalute a model using RWC database.

Cmedia dataset

This dataset consists of 200 Youtube links of pop songs (most of them are Chinese songs), together with their groundtruth files of vocal transcription. We will release 100 of the dataset as the open set for training/validation, and use the other 100 as the hidden set for test. The TRAINING set can be downloaded here. We strongly suggest participants to use this dataset as training set (if your algorithm is data-driven), since the property of Cmedia training set is close to Cmedia hidden set.

Evaluation

We will use Python package “mir_eval” [2] to evaluate the accuracy of a transcription by computing COnPOff, COnP and COn metrics described in [3]. These metrics compute the maximum number of groundtruth notes that are corrected transcribed. Each note in groundtruth can only be matched with one transcribed note, and vice versa. Three rules are utilized to determine if two notes are matched with each other:

1. The onset difference is less than the threshold (100ms in this competition). 2. The pitch difference is less than the threshold (50 cents in this competition). 3. The offset difference is less than the threshold defined by max(50ms, 0.2* duration of groundtruth note) in this competition.

Two notes should satisfy all three conditions above to be considered as “correctly transcribed” in COnPOff metric. However, COnP only requires the two notes to satisfy (1) and (2), while COn only requires the two notes to satisfy (1). We will compute the f1-score of COnPOff, COnP and COn on each song. The final results reported are the average f1-score of the three metrics.

In fact, the COnPOff metric is the same as the evaluation metrics of “note tracking” subtask in MIREX “Multiple Fundamental Frequency Estimation & Tracking” task. The only difference is that the onset threshold is set to 100ms instead of 50ms due to the difficulty of labeling (and maybe, transcribing) onset of singing voice.

Submission Format

Input Format

Sample rate: 44.1 KHz Sample size: 16 bit Number of channels: 2 (stereo) Encoding: WAV

Output Format

The algorithm should output a plain text file. Each line represents a note, which contains three numbers: onset, offset and score pitch. Onset and offset are floating points in seconds, while score pitch should be an integer MIDI number in semitones. An example output file is shown next:

0.131 0.355 64
0.355 0.896 64
0.896 1.141 62
1.888 2.333 62

Since the vocal part is monophonic, in the groundtruth, the offset of one note is always not greater than the onset of the next note, i.e., there is no note overlapping. Also, the duration of a note is always positive. That is, the offset time of a note is always larger than the onset time of the same note. We strongly suggest the submitted algorithms to follow these rules.

Command line calling format

The submitted algorithm must take as arguments a SINGLE .wav file to perform the singing transcription on as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input and the output file path and name as %output, a program called “main” could be called from the command-line as follows:

./main %input %output

Time limits

The time limit is 24 hours. In 24 hours, the algorithm should transcribe all 200 songs in RWC dataset and Cmedia dataset, of which the total duration is about 14hr.

If the algorithm cannot transcribe all 200 songs on time, it's still OK. However, the algorithm should at least transcribe 100 songs (from Cmedia hidden set) within time limit, otherwise no evaluation result can be reported. The algorithm will be executed on a computer with 64GB RAM and 1 NVIDIA GEFORCE GTX 1080Ti GPU.

Reference

[1] Masataka Goto, Hiroki Hashiguchi, Takuichi Nishimura, and Ryuichi Oka: “RWC Music Database: Popular, Classical, and Jazz Music Databases,” Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR 2002), pp.287-288, October 2002.

[2] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, “mir_eval: A Transparent Implementation of Common MIR Metrics,” in Proceedings of the 15th International Conference on Music Information Retrieval, 2014.

[3] E. Molina, A. M. Barbancho-Perez, L. J. Tardón, I. Barbancho-Perez: “Evaluation Framework for Automatic Singing Transcription,” in Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), pp.567-572, October 2014.