2018:Automatic Lyrics-to-Audio Alignment

From MIREX Wiki


请翻到底看中文版简介分任务1:无伴奏中文流行歌曲 (subtask 1)介绍

Automatic lyrics-to-audio alignment algorithm can be useful for Karaoke lyrics display and lyrics alignment of music videos. It is also a pre-processing step for singing voice synthesis and joint analysis of audio and lyrics [Fujihara, H., & Goto, M. (2012)]. Most of the previous works use forced alignment technique stemmed from Automatic Speech Recognition field [Loscos, A. et al. (1999), Mesaros, A. and Virtanen, T. (2008), Fujihara, H. et al. (2011)]. To improve the alignment accuracy, additional musical side information extracted from the musical score is also used in many works, such as chord information [Mauch, M. et al. (2012)], note length duration [Iskandar, D. (2006)] and syllable/phoneme duration [Kruspe, A. (2015), Dzhambazov, G. and Serra, X. (2015), Gong, R. et al. (2015), Pons, J. (2017)]. However, an open-source and fully-automatic alignment system without using any musical side information still hasn't been realized. The possible reasons could be:

  1. Non-availability of a large annotated and publicly available singing voice dataset
  2. Influences of accompanied music
  3. The complexity of music structure and the lack of clear singing phrase boundaries.

The MIREX task of automatic lyrics-to-audio alignment has as a goal of synchronizing between an audio recording of singing (a cappella or mixed) and its corresponding lyrics (syllable or word level). The onset and offset timestamps of a lyric unit can be estimated on different granularities, such as phoneme, syllable, word, phrase. For this task, syllable or word-level alignment is required.

This task contains two subtasks:

  1. A cappella Mandarin Chinese pop songs
  2. Mixed English pop songs

The participants can submit their algorithms for either one of the subtasks or both according to their interest and time arrangement. The participants can use any external training dataset or modify/augment the dataset/annotation provided below.

Subtask 1: A cappella Mandarin Chinese pop songs

  ----------------------------    ---------------------------------------------------------------------
  | A cappella singing audio |    | Lyrics in pinyin: wang le you duo jiu zai mei ting dao ni ... ... |
  ----------------------------    ---------------------------------------------------------------------
                 |                                            |
                             | Alignment system |
                             | 0.123 	0.798 	wang  |
                             | 0.798 	1.123 	le    |
                             | 1.345 	2.176 	you   |
                             | ... ...                |

Subtask 1 algorithm receives two inputs - a cappella singing audio and its corresponding lyrics in pinyin format, outputs the onset and offset timestamps (second) of each pinyin syllable. Due to time constraints, for the training datasets, we are not able to provide the word-level annotation and the lexicon in simplified or traditional Mandarin format. Thus, we don't accept the submissions which receive the lyrics input in simplified or traditional Mandarin format. If you are willing to contribute for verifying the word-level annotation and building the lexicon, please check Ask for contribution section.

Training Datasets

MIR-1k Dataset

The original MIR-1k dataset can be download here. It contains 1000 song clips which the musical accompaniment and the clean singing voice are recorded at left and right channels, respectively. The duration of each clip ranges from 4 to 13 seconds, and the total length of the dataset is 133 minutes. The original dataset also contains the corresponding lyrics in traditional Mandarin Chinese characters. We automatically converted the lyrics into pinyin format. Here is the link. The pinyin lyrics are manually corrected.

Jingju a cappella singing Dataset

Jingju (also known as Peking or Beijing opera) is a form of Chinese opera which combines music, vocal performance, mime, dance and acrobatics. The language used in jingju is a combination of Beijing Mandarin and Jiangsu, Anhui, Hubei dialects. The jingju a cappella singing dataset has 3 parts. Each contains annotation (annotation_txt files) at phrase-level in pinyin format:

The pinyin lyrics are manually corrected.

MIREX 2018 Mandarin pop song dataset

The dataset contains 20 Mandarin Chinese pop music songs with annotations of onset and offset timestamps of each phrase. The lyrics are in pinyin format. The dataset has two parts:

The pinyin lyrics are manually corrected.

Evaluation Datasets

The dataset contains 10 Mandarin Chinese pop music songs collected at the same time as the MIREX 2018 Mandarin pop song training dataset. 5 songs are sung by amateur singers and another 5 songs are source-separated from the mixed recordings.


We provide the pinyin lexicon at syllable-level and phoneme lexicon which correspond to the lyrics annotations of the training datasets.

Audio Format

All recordings used for subtask 1 are in wav format.

  • MIR-1k dataset: 16kHz sampling rate, left channel - musical accompaniment, right channel - clean singing voice
  • Jingju a cappella singing dataset: 44.1kHz sampling rate, mono channel
  • MIREX 2018 Mandarin pop song dataset: 44.1kHz sampling rate, mono channel

Subtask 2: Mixed English pop songs

  -----------------------    ---------------------------------------------------
  | Mixed singing audio |    | Lyrics at word-level: no more carefree ... ... |
  -----------------------    ---------------------------------------------------
                 |                                            |
                             | Alignment system |
                             | 0.123 	0.798  no     |
                             | 0.798 	1.123  more   |
                             | 1.345 	2.176  carefree|
                             | ... ...                |

Subtask 2 algorithm receives two inputs - mixed singing audio (singing voice + musical accompaniment) and its corresponding lyrics at word-level, outputs the onset and offset timestamps (second) of each word.

Training Dataset

The DAMP dataset contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers, collected with the Sing! Karaoke mobile app in different recording conditions, but generally with good audio quality. A carefully curated subset DAMPB of 20 performances of each of the 300 songs has been created by (Kruspe, 2016). Here is the list of recordings.

Evaluation Datasets

Hansen's Dataset

The dataset contains 9 pop music songs in English with annotations of both beginnings- and ending-timestamps of each word. The ending timestamps are for convenience (copies of next word's beginning timestamp) and are not used in the evaluation. Sentence-level annotations are also provided. The audio has two versions: the original with instrumental accompaniment and a cappella singing voice only one. An example song can be seen here

You can read in detail about how the dataset was made here: Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients. The dataset has been kindly provided by Jens Kofod Hansen.

  • file duration up to 4:40 minutes (total time: 35:33 minutes)
  • 3590 words annotated in total

Mauch's Dataset

The dataset contains 20 pop music songs in English with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, to enable comparison to previous work, evaluated on this dataset. The audio has instrumental accompaniment. An example song can be seen here.

You can read in detail about how the dataset was used for the first time here: Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment. The dataset has been kindly provided by Sungkyun Chang.

  • file duration up to 5:40 minutes (total time: 1:19:12 hours)
  • 5050 words annotated in total

Gracenote Dataset

The dataset contains 15 pop music song excerpts in English with annotations of beginning-timestamps of each word. 8 song excerpts have instrumental accompaniment. The other 7 song excerpts have has two versions: with instrumental accompaniment and a cappella singing.

  • file duration up to 1:11 (total time: 11:42 minutes)
  • 1181 words annotated in total


A popular choice for phonetization of the words is the CMU pronunciation dictionary. One can phonetize them with the online tool. A list of all words of both datasets, which are outside of the list of CMU words is given here.

Audio Format

The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files)

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono) for a cappella and two channels for original


The submitted algorithms for both subtasks will be evaluated at the word boundaries for the original mixed songs (a cappella singing + instrumental accompaniment). Evaluation metrics only on the a cappella singing will be reported as well, for the sake of getting insights on the impact of instrumental accompaniment on the algorithm, but will not be considered for the ranking.

Average absolute error/deviation Initially utilized in Mesaros and Virtanen (2008), the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song. Here is a test of using this metric.

Percentage of correct segments The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song. This metric is suggested by Fujihara et al. (2011), Figure 9. Here is a test of using this metric.

Percentage of correct estimates according to a tolerance window A metric that takes into consideration that the onset displacements from ground truth below a certain threshold could be tolerated by human listeners. We use 0.3 seconds as the tolerance window. This metric is suggested in Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment. Here is a test of using this metric.

For more detailed definition and formulas about the metrics, please check the section 2.2.1 of this thesis.

To obtain all three metrics for one detected output:

python eval.py <file path of the reference word boundaries> <file path of the detected word boundaries>

Note that evaluation scripts depend on mir_eval.

Submission Format

Submissions to this task will have to conform to a specified format detailed below. Submissions should be packaged and contain at least two files: The algorithm itself and a README containing contact information and detailing, in full, the use of the algorithm.

Input Data

Participating algorithms will have to receive the following input format:

  • Audio in wav, 44.1kHz, subtask1: mono, subtask2: stereo.
  • Lyrics in .txt file where each word is separated by a space, each lyrics phrase is separated by a line break mark (\n).

Output File Format

The alignment output file format is a tab-delimited ASCII text format.

Three column text file of the format


where \t denotes a tab, \n denotes the end of the line. The < and > characters are not included. An example output file would look something like:

0.000    5.223    label1
5.223    15.101   label2
15.101   20.334   label3

where label is Mandarin syllable pinyin for subtask 1 and English word for subtask 2.

NOTE: the offset timestamps column is utilized only by the percentage of correct segments metric. Therefore skipping the second column is acceptable, and could result in degraded performance of this respective metric only.

Command line calling format

The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input_audio; the lyrics .txt file as %input_txt and the output file path and name as %output, a program called foobar could be called from the command-line as follows:

foobar %input_audio %input_txt %output
foobar -i %input_audio -it %input_txt  -o %output


A README file accompanying each submission should contain explicit instructions on which subtask to participate and how to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.

Packaging submissions

Please provide submissions as a binary or source code.

Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed. A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result.

Submission opening and closing dates

Closing date: August 11th 2018

Ask for contribution

Your contribution will make a better task next year. Two types of contribution occurred to us:


  • You can help us to verify and correct the word segmentation in MIR-1k and MIREX 2018 Mandarin pop song datasets.
  • You can help us to add the missing words into the existing open-source Mandarin lexicons, such as thchs and Aishell.


  • You can provide any training or evaluation singing voice dataset with a publicly available permission.

Another idea for contribution? Please open an issue in the GitHub repo or send us an email.

Question? Problem? Suggestion?

If you have any question, problem or suggestion, please

  • open an issue in the Github repo.
  • or send us an email - rong.gong<at>upf.edu.


Chang, S., & Lee, K. (2017). Lyrics-to-Audio Alignment by Unsupervised Discovery of Repetitive Patterns in Vowel Acoustics. arXiv preprint arXiv:1701.06078.

Dzhambazov, G. and Serra, X. (2015) Modeling of phoneme durations for alignment between polyphonic audio and lyrics, in 12th Sound and Music Computing Conference

Dzhambazov, G. (2017). Knowledge-based probabilistic modeling for tracking lyrics in music audio signals, Ph.D. Thesis

Iskandar, D. Wang, Y. Kan, M.-Y. and Li, H. (2006) Syllabic level automatic synchronization of music signals and text lyrics, in Proceedings of the 14th ACM international conference on Multimedia, Santa Barbara, CA, USA

Kruspe, A. (2015). Keyword spotting in singing with duration- modeled hmms, in 23rd European Signal Processing Conference (EUSIPCO), Nice, France

Kruspe, A. (2016). Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing, ISMIR 2016

Mesaros, A. and Virtanen, T. (2008), Automatic alignment of music audio and lyrics, in Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 2008.

Mesaros, A. (2013). Singing voice identification and lyrics transcription for music information retrieval invited paper. 2013 7th Conference on Speech Technology and Human-Computer Dialogue (SpeD), 1-10.

Loscos, A. Cano, P. and Bonada, J. (1999) Low-delay singing voice alignment to text, in International Computer Music Conference, Beijing, China

Fujihara, H. Goto, M. Ogata, J. and Okuno, H. G. (2011) Lyricsynchronizer: Automatic synchronization system between musical audio signals and lyrics, IEEE Journal of Selected Topics in Signal Processing

Fujihara, H., & Goto, M. (2012). Lyrics-to-audio alignment and its application. In Dagstuhl Follow-Ups (Vol. 3). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

Mauch, M., Fujihara, H., & Goto, M. (2012). Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 200-210.

Gong, R. Cuvillier, P. Obin, N. and Cont, A. (2015) Real-time audio-to-score alignment of singing voice based on melody and lyric information, in Proceedings of Interspeech 2015

Pons, J. Gong, R. and Serra, X. (2017). Score-informed syllable segmentation for a cappella singing voice with convolutional neural networks. ISMIR 2017


自动歌词音频对齐在实际中还是挺有用的,比如卡拉OK的歌词显示和MTV的歌词对齐。它也常常是歌唱声合成和歌词音频联合分析的预处理步骤。大部分先前工作都使用语音识别里面的Forced alignment方法,比如[Loscos, A. et al. (1999), Mesaros, A. and Virtanen, T. (2008), Fujihara, H. et al. (2011)]。也有人从曲谱中提取一些辅助的信息来提高对齐的准确性,比如和弦信息[Mauch, M. et al. (2012)]、音符时长[Iskandar, D. (2006)]、音节或音素时长[Kruspe, A. (2015), Dzhambazov, G. and Serra, X. (2015), Gong, R. et al. (2015), Pons, J. (2017)]。但是一个只使用歌词信息作为输入的开源对齐系统还不存在(也可能是我没发现,谁知道请发信)。可能的原因有三:

  1. 没有带标注的、公开的大规模演唱数据库
  2. 乐器伴奏的干扰
  3. 音乐结构的复杂性,相比语音来说唱句的边界更难判断



  1. 无伴奏的中文流行歌曲
  2. 带伴奏的英文流行歌曲


大家可以用自己收集的训练数据库,或是任意修改、data augment以下将介绍的数据库或标注

分任务 1: 无伴奏中文流行歌曲

  ----------------    -----------------------------------------------------------
  | 无伴奏演唱音频 |    | 拼音歌词: wang le you duo jiu zai mei ting dao ni ... ... |
  ----------------    -----------------------------------------------------------
            |                                            |
                             | 对齐算法 |
                             | 0.123 	0.798 	wang  |
                             | 0.798 	1.123 	le    |
                             | 1.345 	2.176 	you   |
                             | ... ...                |

分任务1算法将接收两个输入 - 无伴奏演唱音频和它对应的拼音格式的歌词,将输出每个拼音音节的起始和结束时间位置(以秒为单位)。因为时间关系,我们没能提供汉字词单位的歌词标注和字典,所以我们也无法接收以汉字为输入算法。如果你想对词单位的标注和字典的建立做贡献,请看Ask for contribution部分。



你可以在这个链接下载到原始的数据库。 它有1000个歌曲小片段,立体声音频左边是伴奏,右边是清唱。 每个小片段时长在4秒到13秒之间。整个数据库的总时长是133分钟。 原始数据库也带有繁体中文格式的歌词。我们用自动的方法将其转化成了简体和拼音的格式。你可以在这个链接下载它们。拼音格式的歌词我们已经验证过了。


这个数据库有三个部分,每个部分都有拼音格式的唱句级别的标注 (annotation_txt文件):


MIREX 2018 中文流行歌曲数据库




测试数据库是和MIREX 2018 中文流行歌曲数据库同时收集的。它包含5首业余歌唱演唱的干净录音和5首分离后的录音。





  • MIR-1k: 16kHz采样率,左声道伴奏,右声道清唱。
  • 京剧清唱数据库:44.1kHz采样率,单声道。
  • MIREX 2018 中文流行歌曲数据库:44.1kHz采样率,单声道。