Difference between revisions of "2021:Lyrics Transcription (former: Automatic Lyrics-to-Audio Alignment)"

From MIREX Wiki
(Question?)
(Evaluation)
 
(76 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Description==
+
= Description =
  
This year for the first time we host two tasks simultaneously:
+
This year we host the '''MIREX2021: Automatic Lyrics Transcription''' challenge. You are free to participate in one of the tasks or both of them.  
Lyrics Transcription and Lyrics-to-audio alignment. You are free to participate in one of the tasks or both of them.
 
The task of Lyrics Transcription aims to identify the words from sung music audio, in the sam way as in automatic speech recognition. 
 
The task of Automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics.  The beginning timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases.  For this task word-level alignment is required.
 
  
  -----------------------    ---------------------------------------------------
+
The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:
  | Mixed singing audio |    | Lyrics at word-level: no more carefree ... ... |
 
  -----------------------    ---------------------------------------------------
 
                  |                                            |
 
                  --------------------------------------------
 
                                      |
 
                              --------------------
 
                              | Alignment system |
 
                              --------------------
 
                                      |
 
                                      |
 
                              --------------------------
 
                              | 0.123 0.798  no    |
 
                              | 0.798 1.123  more  |
 
                              | 1.345 2.176  carefree|
 
                              | ... ...                |
 
                              --------------------------
 
The algorithm receives mixed singing audio (singing voice + musical accompaniment) and for the case of alignment its corresponding lyrics at word-level. It outputs the recognized words in the case of transcription and the onset and offset timestamps (second) of each word in the case of alignment.
 
  
==Datasets==
+
  Prediction('''w''') = argmax P('''w'''|'''X''')
  
===Training Datasets===
+
where '''w''' and '''X''' are the word and acoustic features respectively.
  
==== DAMP dataset ====
+
Ideally, the lyrics transcriber should return meaningful word sequences:
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP Multilingual Vocal Performances (MVP) dataset] contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers, collected with the Sing! Karaoke mobile app in different recording conditions, but generally with good audio quality. A carefully curated subset DAMPB of 20 performances of each of the 300 songs has been created by (Kruspe, 2016). Here is the [https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.
 
  
* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
+
  Prediction('''w''')  = [ <w_1>, <w_2>, ..., <w_N> ]
* No lyrics boundary annotations are available, still the textual lyrics are on the [https://www.smule.com/songs Smule Sing! Karaoke website]
 
  
==== DALI Dataset ====
+
The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.
  
The DALI dataset (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) contains over 5000 songs with semi-automatically aligned lyrics annotations. The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note). For each song DALI provides a link to a matched youtube video, from which the audio could be retrieved.
+
= Submission Format =
For more details how, see its full description [https://github.com/gabolsgabs/DALI here].
 
  
===Evaluation Datasets===
+
Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:
  
The following datasets are used for evaluation and so cannot be used by participants to train their models under any circumstance.
+
=== A) The main transcription script ===
  
==== Hansen's Dataset ====
+
The main transcription script to execute. This should be a one-line executable in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.  
The dataset contains 9 pop music songs in English with annotations of both beginnings- and ending-timestamps of each word. The ending timestamps are for convenience (copies of next word's beginning timestamp) and are not used in the evaluation. Sentence-level annotations are also provided.
 
The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here]
 
  
You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients]. The dataset has been kindly provided by Jens Kofod Hansen.
+
===  I / O ===
  
* file duration up to 4:40 minutes (total time: 35:33 minutes)
+
The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.
* 3590 words annotated in total
 
  
==== Mauch's Dataset ====
+
Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:
The dataset contains 20 pop music songs in English with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, to enable comparison to previous work, evaluated on this dataset.
 
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].
 
  
You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment]. The dataset has been kindly provided by Sungkyun Chang.
+
foobar ${input_audio_path}  ${output}
  
* file duration up to 5:40 minutes (total time: 1h 19m)
+
OR with flags:
* 5050 words annotated in total
 
  
 +
foobar -i ${input_audio_path}  -o ${output}
  
==== Jamendo Dataset ====
+
==== Input Audio ====
This dataset contains 20 full-duration music pieces with 10 different Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment. It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].
 
  
* file duration up to 4:43 (total time: 1h 12m)
+
Participating algorithms will have to receive the following input format:
* 5677 words annotated in total
 
  
=== Phonetization ===
+
* Audio format : WAV / MP3
A popular choice for phonetization of the words is the [http://www.speech.cs.cmu.edu/cgi-bin/cmudict CMU pronunciation dictionary]. One can phonetize them with the [http://www.speech.cs.cmu.edu/tools/lextool.html online tool]. A list of all words of both datasets, which are outside of the [https://github.com/georgid/AlignmentDuration/blob/noteOnsets/src/for_english/cmudict.0.6d.syll list of CMU words] is given [https://www.dropbox.com/s/flu4cpqff916bas/words_not_in_dict?dl=0 here].
+
* CD-quality (PCM, 16-bit, 44100 Hz)
 +
* single channel (mono) for a cappella (Hansen) and two channels for original
  
=== Audio Format ===
+
==== Output File Format ====
  
The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files)
+
A text file (per song) containing list of words separated by white space:
  
* CD-quality (PCM, 16-bit, 44100 Hz)
+
  <word_1> <word_2> ... <word_N>
* single channel (mono) for a cappella and two channels for original
 
  
==Evaluation==
+
Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.
===Transcription===
 
Word Error Rate (WER) - the standard metric use in Automatic Speech Recognition.
 
  
===Alignment===
+
This file should ideally be located at:
The submitted algorithms will be evaluated at the word boundaries for the originally mixed songs (a cappella singing + instrumental accompaniment). Evaluation metrics on the a cappella singing can be reported as well on request, for the sake of getting insights on the impact of instrumental accompaniment on the algorithm, but will not be considered for the ranking.
+
   
 +
  ${output}/${input_song_id}.txt
  
'''Average absolute error/deviation''' Initially utilized in [http://www.cs.tut.fi/~mesaros/pubs/autalign_cr.pdf Mesaros and Virtanen (2008)], the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song.
 
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L117 test] of using this metric.
 
  
'''Percentage of correct segments''' The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song. This metric is suggested by [https://www.researchgate.net/publication/224241940_LyricSynchronizer_Automatic_Synchronization_System_Between_Musical_Audio_Signals_and_Lyrics Fujihara et al. (2011), Figure 9].
+
=== B) The README file ===
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L76 test] of using this metric.
 
  
'''Percentage of correct estimates according to a tolerance window''' A metric that takes into consideration that the onset displacements from ground truth below a certain threshold could be tolerated by human listeners. We use 0.3 seconds as the tolerance window. This metric is suggested in [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment].
+
This file must contain detailed installation instructions, the use of the main script and contact information.
Here is a [https://github.com/georgid/AlignmentEvaluation/blob/126c3fa5fa1994acdcfbe3ea1344acfe71ae2b8e/test/EvalMetricsTest.py#L151 test] of using this metric.
 
  
For more detailed definition and formulas about the metrics, please check the section 2.2.1 of [https://doi.org/10.5281/zenodo.841979 this thesis].
+
----
  
'''To obtain all three metrics for one detected output:'''
+
Any submission that is failed to meet above requirements will not be considered in evaluation!
  
<code> python [https://github.com/georgid/AlignmentEvaluation/blob/master/align_eval/eval.py eval.py] <file path of the reference word boundaries> <file path of the detected word boundaries> </code>
+
= Training Datasets =
  
Note that evaluation scripts depend on [https://github.com/craffel/mir_eval/ mir_eval].
+
Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.  
  
 +
The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.
  
== Submission Format ==
+
In this challenge, the participants are encouraged but '''not obliged''' to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:
  
Submissions should be packaged and contain at least two files: The algorithm itself (as a binary or source code) and a README containing contact information and detailing, in full, the use of the algorithm.
+
=== DAMP dataset ===
 +
The [https://zenodo.org/record/2747436#.Xyge4xMzZ0s DAMP - Sing!300x30x2 dataset] which consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application. The data is curated to be gender-wise balanced and contains performers from 30 different countries, which introduces a good amount of variability in terms of accents and pronunciation. 
 +
[https://docs.google.com/spreadsheets/d/1YwhPhXU6t-BMZfdEODS_pNW_umFIsciYL62kh-fiBWI/edit?usp=sharing list of recordings]. For more details see the paper.  
  
=== Input Data ===
+
* The audio can be downloaded from the [https://ccrma.stanford.edu/damp/ Smule web site]
Participating algorithms will have to receive the following input format:
+
* Lyrics boundary annotations can be generated from raw annotations using [https://github.com/groadabike/Kaldi-Dsing-task this repository].
 +
* Or annotations can be directly retrieved in the Kaldi form [https://github.com/emirdemirel/ALTA/s5/data here]
  
====Transcription====
+
=== DALI Dataset ===
* Audio in wav, 44.1kHz, stereo.
 
  
====Alignment====
+
DALI (a large '''D'''ataset of synchronised '''A'''udio, '''L'''yr'''I'''cs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations. The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note). For each song DALI provides a link to a matched youtube video, from which the audio could be retrieved.
  
* Audio in wav, 44.1kHz, stereo.
+
* For more details how, see its full description [https://github.com/gabolsgabs/DALI here].
* Lyrics in .txt file where each word is separated by a space, each lyrics phrase is separated by a line break mark (\n).
 
  
=== Output File Format ===
+
= Evaluation Datasets =
====Transcription====
 
A list of words separated by white space
 
<word1> <word2> ...
 
Any non-word items (e.g. silence, end of the sentence) should be excluded.
 
  
====Alignment====
+
The following datasets are used for evaluation and so '''cannot''' be used by participants to train their models under any circumstance.
  
The alignment output file format is a tab-delimited ASCII text format.  
+
Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI. In case using DALI for training, you '''MUST''' exclude the songs listed above in training the model for a scientific evaluation.  
  
Three column text file of the format
+
=== Hansen's Dataset ===
 +
The dataset contains 9 pop music songs released in early 2010s.
  
<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
+
The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen [https://www.dropbox.com/sh/wm6k4dqrww0fket/AAC1o1uRFxBPg9iAeSAd1Wxta?dl=0 here].
<onset_time(sec)>\t<offset_time(sec)>\t<label>\n
 
...
 
  
where \t denotes a tab, \n denotes the end of the line. The < and > characters are not included. An example output file would look something like:
+
You can read in detail about how the dataset was made here: [http://publica.fraunhofer.de/documents/N-345612.html Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients]. The recordings have been provided by Jens Kofod Hansen for public evaluation.
  
0.000    5.223    word1
+
* file duration up to 4:40 minutes (total time: 35:33 minutes)
5.223    15.101  word2
+
* 3590 words annotated in total
15.101  20.334  word3
 
  
'''NOTE:''' the offset timestamps column is utilized only by the percentage of correct segments metric. Therefore skipping the second column is acceptable, and could result in degraded performance of this respective metric only.
+
=== Mauch's Dataset ===
  
=== Command line calling format ===
+
The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word.
 +
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here].
  
The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input_audio; the lyrics .txt file as %input_txt and the output file path and name as %output, a program called foobar could be called from the command-line as follows:
+
You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment]. The dataset has been kindly provided by Sungkyun Chang.
  
foobar %input_audio (%input_txt) %output
+
* file duration up to 5:40 minutes (total time: 1h 19m)
foobar -i %input_audio (-it %input_txt)  -o %output
+
* 5050 words annotated in total
  
 +
=== Jamendo Dataset ===
  
=== README File ===
+
This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.
  
A README file accompanying each submission should contain clear instructions on how to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.
+
It is available online on [https://github.com/f90/jamendolyrics Github], although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to [https://arxiv.org/abs/1902.06797 this paper].
  
== Submission closing dates ==
+
* file duration up to 4:43 (total time: 1h 12m)
Closing date: TBD
+
* 5677 words annotated in total
  
== Questions? ==
+
= Evaluation =
  
* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)
+
Word Error Rate (WER) : the standard metric use in Automatic Speech Recognition.
  
== Potential Participants ==
+
  WER = (S + I + D) / (C + S + I + D)
Chitralekha Gupta
 
  
Emir Demirel
+
where;
 +
C : correctly predicted words
 +
S : substitution errors
 +
I : insertion errors
 +
D : deletion errors
  
Gerardo Roa Dabike
 
  
== Bibliography ==
+
Character Error Rate (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.
  
Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.
+
= Submission closing dates =
  
Sharma B, Gupta C. (2019) Automatic Lyrics-to-audio Alignment on Polyphonic Music Using Singing-adapted Acoustic Models. ICASSP 2019
+
Closing date: '''December 9, 2021'''
  
Lee S. W., Scott, J. (2017) Word-level lyrics-audio synchronization using separated vocals", Acoustics Speech and Signal Processing, ICASSP IEEE International Conference on, pp. 646-650
+
= Questions? =
  
Chang, S., & Lee, K. (2017). Lyrics-to-Audio Alignment by Unsupervised Discovery of Repetitive Patterns in Vowel Acoustics. arXiv preprint arXiv:1701.06078.
+
* send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)
  
Pons, J. Gong, R. and Serra, X. (2017). Score-informed syllable segmentation for a cappella singing voice with convolutional neural networks. ISMIR 2017
+
== Potential Participants ==
 +
Chitralekha Gupta
  
Kruspe, A. (2016). Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing, ISMIR 2016
+
Emir Demirel
  
Dzhambazov, G. and Serra, X. (2015) Modeling of phoneme durations for alignment between polyphonic audio and lyrics, in 12th Sound and Music Computing Conference
+
Gerardo Roa Dabike
  
Fujihara, H., & Goto, M. (2012). Lyrics-to-audio alignment and its application. In Dagstuhl Follow-Ups (Vol. 3). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
+
Jiawen Huang
  
Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.
+
== Bibliography ==
  
Fujihara, H. Goto, M. Ogata, J. and Okuno, H. G. (2011) Lyricsynchronizer: Automatic synchronization system between musical audio signals and lyrics, IEEE Journal of Selected Topics in Signal Processing
+
Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.
  
Mesaros, A. and Virtanen, T. (2008), Automatic alignment of music audio and lyrics, in Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 2008.
+
Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

Latest revision as of 16:56, 26 October 2021

Description

This year we host the MIREX2021: Automatic Lyrics Transcription challenge. You are free to participate in one of the tasks or both of them.

The task of Lyrics Transcription aims to identify the words from sung utterances, in the same way as in automatic speech recognition. This can be mathematically expressed as follows:

 Prediction(w) = argmax P(w|X)

where w and X are the word and acoustic features respectively.

Ideally, the lyrics transcriber should return meaningful word sequences:

 Prediction(w)  = [ <w_1>, <w_2>, ..., <w_N> ]

The algorithm receives either monophonic singing performances or a polyphonic mix (singing voice + musical accompaniment). Both cases are evaluated separately in this challenge.

Submission Format

Submissions should be packaged in a compressed file (.zip or .rar, etc.) which contains at least two files:

A) The main transcription script

The main transcription script to execute. This should be a one-line executable in one of the following formats: a bash (.sh) a python (.py) script, or a binary file.

I / O

The submitted algorithm must take as arguments an audio file and the full output path to save the transcriptions. The ability to specify the output path and file name is essential.

Denoting the input audio filename path as $[input_audio_path} and the output file path and name as ${output}, a program called `foobar' will be called from the command-line as follows:

foobar ${input_audio_path}  ${output}

OR with flags:

foobar -i ${input_audio_path}  -o ${output}

Input Audio

Participating algorithms will have to receive the following input format:

  • Audio format : WAV / MP3
  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono) for a cappella (Hansen) and two channels for original

Output File Format

A text file (per song) containing list of words separated by white space:

 <word_1> <word_2> ... <word_N>

Any non-word items (e.g. silence, music, noise or end of the sentence tokens) should be removed from the final output.

This file should ideally be located at:

 ${output}/${input_song_id}.txt


B) The README file

This file must contain detailed installation instructions, the use of the main script and contact information.


Any submission that is failed to meet above requirements will not be considered in evaluation!

Training Datasets

Datasets within automatic lyrics transcription research can be categorised under two domains in regards to the presence of music instruments accompanying the singer: Monophonic and polyphonic datasets.

The former is considered to have only one singer singing the lyrics, and the latter is when there is music accompaniment.

In this challenge, the participants are encouraged but not obliged to use the open source datasets below, which are also commonly used in the literature for benchmarking ALT results:

DAMP dataset

The DAMP - Sing!300x30x2 dataset which consists of solo singing recordings (monophonic) performed by amateur singers, collected via a mobile Karaoke application. The data is curated to be gender-wise balanced and contains performers from 30 different countries, which introduces a good amount of variability in terms of accents and pronunciation. list of recordings. For more details see the paper.

  • The audio can be downloaded from the Smule web site
  • Lyrics boundary annotations can be generated from raw annotations using this repository.
  • Or annotations can be directly retrieved in the Kaldi form here

DALI Dataset

DALI (a large Dataset of synchronised Audio, LyrIcs and notes) is the benchmark dataset for building an acoustic model on polyphonic recordings and it contains over 5000 songs with semi-automatically aligned lyrics annotations. The songs are commercial recordings in full-duration, whereas the lyrics are described according to different levels of granularity including words and notes (and syllables underlying a given note). For each song DALI provides a link to a matched youtube video, from which the audio could be retrieved.

  • For more details how, see its full description here.

Evaluation Datasets

The following datasets are used for evaluation and so cannot be used by participants to train their models under any circumstance.

Note that the evaluation sets listed below consist of popular songs in English language, and have overlapping samples with DALI. In case using DALI for training, you MUST exclude the songs listed above in training the model for a scientific evaluation.

Hansen's Dataset

The dataset contains 9 pop music songs released in early 2010s.

The audio has two versions: the original mix with instrumental accompaniment and a cappella singing voice only one. An example song can be seen here.

You can read in detail about how the dataset was made here: Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients. The recordings have been provided by Jens Kofod Hansen for public evaluation.

  • file duration up to 4:40 minutes (total time: 35:33 minutes)
  • 3590 words annotated in total

Mauch's Dataset

The dataset contains 20 pop music songs with annotations of beginning-timestamps of each word. The audio has instrumental accompaniment. An example song can be seen here.

You can read in detail about how the dataset was used for the first time here: Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment. The dataset has been kindly provided by Sungkyun Chang.

  • file duration up to 5:40 minutes (total time: 1h 19m)
  • 5050 words annotated in total

Jamendo Dataset

This dataset contains 20 recordings with varying Western music genres, annotated with start-of-word timestamps. All songs have instrumental accompaniment.

It is available online on Github, although note that we do not allow tuning model parameters using this data, it can only be used to gain insight into the general structure of the test data. For more information also refer to this paper.

  • file duration up to 4:43 (total time: 1h 12m)
  • 5677 words annotated in total

Evaluation

Word Error Rate (WER) : the standard metric use in Automatic Speech Recognition.

 WER = (S + I + D) / (C + S + I + D)

where;

C : correctly predicted words
S : substitution errors
I : insertion errors
D : deletion errors


Character Error Rate (CER) : the above computation can also be done on the character level. This metric penalises the partially correctly predicted / incorrectly spelled words less than WER.

Submission closing dates

Closing date: December 9, 2021

Questions?

  • send us an email - e.demirel@qmul.ac.uk (Emir Demirel) or info@voicemagix.com (Georgi Dzhambazov)

Potential Participants

Chitralekha Gupta

Emir Demirel

Gerardo Roa Dabike

Jiawen Huang

Bibliography

Stoller, D. and Durand, S. and Ewert, S. (2019) End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model. ICASSP 2019.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.