Difference between revisions of "2017:Automatic Lyrics-to-Audio Alignment"

From MIREX Wiki
(Data)
(Evaluation)
Line 23: Line 23:
 
==Evaluation==
 
==Evaluation==
  
 +
'''Average absolute error/deviation''' Initially utilized in [http://www.cs.tut.fi/~mesaros/pubs/autalign_cr.pdf Mesaros and Virtanen (2008)], the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song.
 +
 +
'''Percentage of correct segments''' The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song - a metric, suggested by [https://www.researchgate.net/publication/224241940_LyricSynchronizer_Automatic_Synchronization_System_Between_Musical_Audio_Signals_and_Lyrics Fujihara et al. (2011, Figure 9].
 +
 +
Both metrics are implemented [https://github.com/georgid/AlignmentEvaluation here]
  
 
== Submission Format ==
 
== Submission Format ==

Revision as of 16:43, 31 May 2017

Description

The task of automatic lyrics-to-audio alignment has as an end goal the synchronization between an audio recording of singing and its corresponding written lyrics. The start and end timestamps of lyrics units can be estimated on different granularity: phonemes, words, lyrics lines, phrases. For this task word-level alignment and sentence (lyrics lines) granularity are required.

Task specific mailing list

Data

The evaluation dataset contains 11 songs of popular music with annotations of timestamps of the words and the sentences. The audio has two versions: the original with instrumental accompaniment and a cappella singing voice only one.

You can read in detail about how the dataset was made here: Recognition of Phonemes in A-cappella Recordings using Temporal Patterns and Mel Frequency Cepstral Coefficients. The dataset has been kindly provided by Jens Kofod Hansen.


Audio Formats

The data are monophonic sound files, with the associated lyrics units boundaries (in csv-like .txt files)

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono)
  • file duration up to 4 minutes (total time: 38 minutes)

Evaluation

Average absolute error/deviation Initially utilized in Mesaros and Virtanen (2008), the absolute error measures the time displacement between the actual timestamp and its estimate at the beginning and the end of each lyrical unit. The error is then averaged over all individual errors. An error in absolute terms has the drawback that the perception of an error with the same duration can be different depending on the tempo of the song.

Percentage of correct segments The perceptual dependence on tempo is mitigated by measuring the percentage of the total length of the segments, labeled correctly to the total duration of the song - a metric, suggested by Fujihara et al. (2011, Figure 9.

Both metrics are implemented here

Submission Format

Audio Format

Command line calling format

I/O format

Packaging submissions

Time and hardware limits

Submission opening date

Submission closing date

Potential Participants