2017:Automatic Lyrics-to-Audio Alignment
From MIREX Wiki
The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics. The start and end timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases. For this task word-level alignment and sentence (lyrics lines) granularity are required.
Task specific mailing list
The evaluation dataset contains 11 songs of popular music with annotations of timestamps of the words and the sentences. The audio has two versions: the original with instrumental accompaniment and a cappella singing voice only one.
You can read in detail about how the dataset was made here: Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients. The dataset has been kindly provided by Jens Kofod Hansen.
The data are monophonic sound files, with the associated lyrics units boundaries (in csv-like .txt files)
- CD-quality (PCM, 16-bit, 44100 Hz)
- single channel (mono)
- file duration up to 4 minutes (total time: 38 minutes)
Average absolute error/deviation Initially utilized in Mesaros and Virtanen (2008), the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song.
Percentage of correct segments The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song - a metric, suggested by Fujihara et al. (2011, Figure 9.
Both metrics are implemented here
Submissions to this task will have to conform to a specified format detailed below. Submissions should be packaged and contain at least two files: The algorithm itself and a README containing contact information and detailing, in full, the use of the algorithm.
Participating algorithms will have to read audio in the following format:
- Sample rate: 44.1 KHz
- Sample size: 16 bit
- Number of channels: 1 (mono)
- Encoding: WAV
The lyrics are in .txt file where each word is separated by a space, each lyrics line is separated by a new line.
Output File Format
The alignment output file format is a tab-delimited ASCII text format.
Three column text file of the format
<onset_time(sec)>\t<offset_time(sec)>\t<word>\n <onset_time(sec)>\t<offset_time(sec)>\t<word>\n ...
where \t denotes a tab, \n denotes the end of line. The < and > characters are not included. An example output file would look something like:
0.000 5.223 word1 5.223 15.101 word2 15.101 20.334 word3
Command line calling format
The submitted algorithm must take as arguments .wav file, .txt file as well as the full output path and filename of the output file. The ability to specify the output path and file name is essential. Denoting the input .wav file path and name as %input_audio; the lyrics .txt file as %input_txt and the output file path and name as %output, a program called foobar could be called from the command-line as follows:
foobar %input_audio %input_txt %output foobar -i %input_audio -it %input_txt -o %output
A README file accompanying each submission should contain explicit instructions on how to to run the program (as well as contact information, etc.). In particular, each command line to run should be specified, using %input for the input sound file and %output for the resulting text file.
Please provide submissions as a binary or source code.
Time and hardware limits
Due to the potentially high number of participants in this and other audio tasks, hard limits on the runtime of submissions will be imposed. A hard limit of 24 hours will be imposed on analysis times. Submissions exceeding this limit may not receive a result.