Difference between revisions of "2017:Automatic Lyrics-to-Audio Alignment"

From MIREX Wiki
(Mauch's Dataset)
(Mauch's Dataset)
Line 26: Line 26:
  
 
==== Mauch's Dataset ====
 
==== Mauch's Dataset ====
The dataset contains 20 songs of popular music with annotations of only beginning-timestamps of each word.  
+
The dataset contains 20 songs of popular music with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, in order to enable comparison to previous work, evaluated on this dataset.
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/o2hhk39cjjl88zw/AACzIdZV-3exXyN55nRITwdaa?dl=0 here]
+
The audio has instrumental accompaniment. An example song can be seen [https://www.dropbox.com/sh/8pp4u2xg93z36d4/AAAsCE2eYW68gxRhKiPH_VvFa?dl=0 here]
  
 
You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into
 
You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into
 
HMM-Based Lyrics-to-Audio Alignment]. The dataset has been kindly provided by Sungkyun Chang.
 
HMM-Based Lyrics-to-Audio Alignment]. The dataset has been kindly provided by Sungkyun Chang.
  
* file duration up to 4 minutes (total time: 38 minutes)
+
* file duration up to 5:40  (total time: 1:19:12 hours)
* Y boundaries on total
+
* 5050 boundaries on total
  
===== Audio Format =====
+
==== Audio Format ====
  
The data are sound wav files, plus the associated word boundaries (in csv-like .txt files)
+
The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files)
  
 
* CD-quality (PCM, 16-bit, 44100 Hz)
 
* CD-quality (PCM, 16-bit, 44100 Hz)

Revision as of 12:20, 11 August 2017

Description

The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics. The start and end timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases. For this task word-level alignment is required.

Task specific mailing list

Data

Training Dataset

The DAMP dataset contains a large number (34 000+) of a cappella recordings from a wide variety of amateur singers in different recording conditions, but generally with good audio quality. The recordings cover 301 pop songs and are collected with the Sing! Karaoke mobile app.


Evaluation Datasets

Hansen's Dataset

The dataset contains 10 songs of popular music with annotations of both beginning- and ending-timestamps of each word. Non-vocal segments are assigned a special word BREATH. Sentence-level annotations are also provided. The audio has two versions: the original with instrumental accompaniment and a cappella singing voice only one. An example song can be seen here

You can read in detail about how the dataset was made here: Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients. The dataset has been kindly provided by Jens Kofod Hansen.

  • file duration up to 4 minutes (total time: 38 minutes)
  • X boundaries on total

Mauch's Dataset

The dataset contains 20 songs of popular music with annotations of beginning-timestamps of each word. Non-vocal sections are not explicitly annotated (but remain included in the last preceding word). We prefer to leave it this way, in order to enable comparison to previous work, evaluated on this dataset. The audio has instrumental accompaniment. An example song can be seen here

You can read in detail about how the dataset was used for the first time here: [https://pdfs.semanticscholar.org/547d/7a5d105380562ca3543bf05b4d5f7a8bee66.pdf Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment]. The dataset has been kindly provided by Sungkyun Chang.

  • file duration up to 5:40 (total time: 1:19:12 hours)
  • 5050 boundaries on total

Audio Format

The data are sound wav/mp3 files, plus the associated word boundaries (in csv-like .txt/.tsv files)

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono) for a cappella and two channels for original

Evaluation

The submitted algorithms will be evaluated at the boundaries of words for the original multi-instrumental songs. Evaluation metrics on the a cappella versions will be reported as well, for the sake of getting insights on the impact of instrumental accompaniment on the algorithm, but will not be considered for the ranking.

Average absolute error/deviation Initially utilized in Mesaros and Virtanen (2008), the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song. To evaluate it use this python script


Percentage of correct segments The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song - a metric, suggested by Fujihara et al. (2011, Figure 9. To evaluate it use this python script

Note that both scripts depend on mir_eval. We have a fix that will be pulled soon. If this is still not the case, pull and install mir_eval from the fork with the fix.

Submission Format

Submissions to this task will have to conform to a specified format detailed below. Submissions should be packaged and contain at least two files: The algorithm itself and a README containing contact information and detailing, in full, the use of the algorithm.

Input Data

Participating algorithms will have to read audio in the following format:

  • Audio for the original songs in wav (stereo)
  • Lyrics in .txt file where each word is separated by a space, each lyrics line is separated by a new line.

Output File Format

The alignment output file format is a tab-delimited ASCII text format.

Three column text file of the format

<onset_time(sec)>\t<offset_time(sec)>\t<word>\n
<onset_time(sec)>\t<offset_time(sec)>\t<word>\n
...

where \t denotes a tab, \n denotes the end of line. The < and > characters are not included. An example output file would look something like:

0.000    5.223    word1
5.223    15.101   word2
15.101   20.334   word3


Command line calling format

The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input_audio; the lyrics .txt file as %input_txt and the output file path and name as %output, a program called foobar could be called from the command-line as follows:

foobar %input_audio %input_txt %output
foobar -i %input_audio -it %input_txt  -o %output


README File

A README file accompanying each submission should contain explicit instructions on how to to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.

Packaging submissions

Please provide submissions as a binary or source code.

Time and hardware limits

Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed. A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result.

Submission opening date

21 July

Submission closing date

4 September

Potential Participants

Nikolaos Tsipas nitsipas [at] auth [dot] gr