2024:Music Audio Generation

From MIREX Wiki
Revision as of 22:08, 26 August 2024 by Junyan (talk | contribs)

Task Description

The MIREX 2024 Music Audio Generation Task challenges participants to develop models capable of generating high-quality, original music audio clips. This task aims to advance the state-of-the-art in music generation by encouraging the creation of systems that can produce coherent, aesthetically pleasing, and musically diverse outputs across various genres and styles.

Participants will be required to generate music clips based on textual prompts or other conditioning information provided in the dataset. The generated audio will be evaluated based on its musical quality, creativity, adherence to the provided prompt, and overall listenability.

Dataset

Description

The MusicGen2024 dataset will serve as the benchmark for this task. This dataset is specially curated to facilitate the generation of music in response to specific prompts. It includes:

  • Audio Clips: A collection of diverse music clips across various genres, ranging from classical to electronic music, to help in training and evaluation.
  • Textual Prompts: Detailed prompts associated with each music clip, describing the desired musical characteristics such as mood, genre, instrumentation, and tempo.

The dataset is designed to support both the training of generative models and the evaluation of their outputs.

Description of Audio Files

The audio files in the MusicGen2024 dataset are selected to represent a broad spectrum of musical genres and styles. Each clip is provided in a high-quality format, ensuring that the nuances of musical elements are preserved. The dataset includes clips of varying lengths, with a focus on short to medium-length excerpts (10 to 30 seconds) to facilitate manageable training and evaluation cycles.

Description of Text

The textual prompts provided in the dataset are carefully crafted to guide the generation process. These prompts include specific instructions regarding the desired genre, mood, instrumentation, and other musical characteristics. They are designed to challenge the generative models to produce music that is not only coherent but also closely aligned with the given descriptions.

Description of Split

The MusicGen2024 dataset is divided into training, validation, and evaluation subsets. Participants must not use the evaluation subset for training or validation purposes to ensure a fair and unbiased assessment of model performance. The dataset split ensures a diverse representation of musical styles and genres in both the training and evaluation phases.

Baseline

Gen-MusicTransformer: Model Architecture

Gen-MusicTransformer employs a transformer-based architecture tailored for music generation tasks. The model is designed to handle sequential data, making it well-suited for generating coherent and contextually rich music clips.

  • Encoder: The encoder processes the input textual prompt, transforming it into a series of embeddings that capture the key aspects of the prompt, such as mood, genre, and instrumentation.
  • Decoder: The decoder is responsible for generating the music audio. It utilizes a series of transformer blocks to predict the next audio feature based on the previous context, producing a continuous stream of audio data. The model generates log-mel spectrograms, which are subsequently converted into audio waveforms using a vocoder.
  • Conditioning: The model can be conditioned on additional inputs, such as specific musical motifs or rhythms, allowing for more controlled generation outputs.

Gen-MusicTransformer is pre-trained on a large corpus of music data and fine-tuned on the MusicGen2024 dataset to optimize its performance on the specific task of prompt-based music generation.

Metrics

The evaluation of the generated music will be based on a combination of objective and subjective metrics:

  • MOS (Mean Opinion Score): A subjective evaluation metric where human listeners rate the overall quality and aesthetic appeal of the generated music.
  • Inception Score (IS): An objective metric that evaluates the diversity and quality of the generated music, based on a pre-trained music classification model.
  • FAD (Fréchet Audio Distance): Measures the similarity between the distribution of generated music and real music, capturing both quality and diversity.
  • Prompt Adherence Score: A metric designed to assess how well the generated music aligns with the provided textual prompts.

Each metric will contribute to the final ranking, with MOS and Prompt Adherence Score being given the highest weight.

Download

The MusicGen2024 dataset, including both the audio clips and corresponding textual prompts, will be made available for download. Participants can access the dataset via a link that will be posted here.

Rules

Participants are allowed to utilize external datasets and pre-trained models to develop their systems. However, the use of the MusicGen2024 evaluation split for training or validation is strictly prohibited. Participants must ensure that their submissions are original and do not overlap with the evaluation data.

Submission

Submissions will be evaluated using CodaBench for automated assessment.

Participants are required to submit the following:

  • Audio Files: A set of generated music clips corresponding to the prompts in the evaluation dataset.
  • PDF File: A detailed report describing the system architecture, training process, and any external data or models used.

Each participant or team may submit up to three versions of their system. The final ranking will be based on the metrics outlined above.