2025:Music Structure Analysis
Contents
Music Structure Analysis (MIREX 2025)
Important Note: MIREX 2025 will be held as a workshop of ISMIR 2025. Papers accepted and presented at MIREX 2025 will have the opportunity to be showcased in the ISMIR 2025 Late Breaking Demo Track.
Description
The aim of the MIREX Music Structure Analysis task is to identify and label key structural sections in musical audio. Understanding the musical form (e.g., intro, verse, chorus) is fundamental to music understanding and a crucial component in many music information retrieval applications. While traditional approaches focused on segmenting music into internally consistent, but arbitrarily labeled, sections (e.g., A, B, C), this task has evolved.
Since 2020, a new paradigm has emerged, focusing on functional structure analysis. The goal is to segment the audio and assign a specific functional label to each segment from a predefined set of common musical functions. This task challenges systems to perform both accurate boundary detection and correct functional classification.
This task builds upon a history of structural segmentation evaluations, first run in MIREX 2009. Recent works driving this updated focus include:
- Wang, J. C., Hung, Y. N., & Smith, J. B. (2022, May). To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 416-420). IEEE.
- Kim, T., & Nam, J. (2023, October). All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio. In 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 1-5). IEEE.
- Buisson, M., McFee, B., Essid, S., & Crayencour, H. C. (2024). Self-supervised learning of multi-level audio representations for music segmentation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
For MIREX 2025, participants are required to segment musical audio and classify each segment into one of seven functional categories: ‘intro’, ‘verse’, ‘chorus’, ‘bridge’, ‘inst’ (instrumental), ‘outro’, or ‘other’. The 'other' category can be used for segments that do not fit into the primary six functional labels or for non-musical content if explicitly defined by the dataset annotations being mapped.
Data
Collections
The evaluation will utilize datasets previously established in MIREX. Annotations from these diverse collections will be mapped to the seven target functional labels for consistent evaluation.
- The MIREX 2009 Collection: 297 pieces, largely derived from the work of the Beatles.
- MIREX 2010 RWC collection: 100 pieces of popular music. This collection has two sets of ground truths. The first was originally included with the RWC dataset. The second set provides segment boundary annotations (see Pechuho et al., 2010 for details).
- MIREX 2012 dataset: Over 1,000 annotated pieces covering a range of musical styles, with the majority annotated by two independent annotators.
Participants should be aware that original labels in these datasets (e.g., 'verse1', 'solo', 'fade-out') will need to be mapped to the seven specified functional categories for evaluation. Guidelines for this mapping will be provided, or a standard mapping will be applied during evaluation.
Audio Formats (Input to Algorithms)
Algorithms should be prepared to process audio with the following characteristics:
- Sample rate: 44.1 kHz
- Bit depth: 16 bit
- Number of channels: 1 (mono)
- Encoding: WAV
Submission Format
Submissions will be handled via CodeBench. Participants are required to submit their results in a specific format, as detailed below. You will upload a single file containing the segmentation results for all test audio files.
Output Data Format
The output must be a list of dictionaries in a text-based format (e.g., JSON parsable). Each dictionary in the list corresponds to one audio file and must contain two keys: 'id' (the identifier of the audio file, e.g., '1.wav') and 'result' (a list of segment predictions). Each segment prediction is a list containing two elements: a two-element list with the [start_time, end_time] of the segment in seconds, and the label string for that segment.
The labels must be one of the seven target functional categories: 'intro', 'verse', 'chorus', 'bridge', 'inst', 'outro', 'other'.
Example of the content of the submitted file:
[ { 'id': 'track01.wav', 'result': [ [[0.000, 15.500], 'intro'], [[15.500, 45.230], 'verse'], [[45.230, 75.800], 'chorus'], [[75.800, 90.000], 'outro'] ] }, { 'id': 'track02.wav', 'result': [ [[0.000, 20.100], 'verse'], [[20.100, 38.500], 'chorus'], [[38.500, 55.000], 'verse'], [[55.000, 72.600], 'chorus'], [[72.600, 89.000], 'bridge'], [[89.000, 105.000], 'chorus'], [[105.000, 115.500], 'outro'] ] } ]
Ensure that offset_time of one segment is the onset_time of the next segment, and segments cover the entire duration of the piece analyzed. The first segment must start at 0.0.
Evaluation Procedures
Evaluation will focus on both the accuracy of the detected segment boundaries and the correctness of the assigned functional labels. The primary metrics are:
- Frame-Level Accuracy (ACC):
- Both the system output and the ground truth will be converted into time-series of labels at a fine temporal resolution (e.g., 10ms or 100ms frames). Accuracy is calculated as the proportion of frames that are correctly labeled by the system compared to the ground truth across the entire dataset. This metric evaluates the overall correctness of segment labels and their temporal extents.
- Boundary Retrieval Hit Rate F-Measures (HR.5F and HR3F):
- This metric assesses the system's ability to correctly identify segment boundaries.
- * A predicted boundary is considered a hit if it falls within a certain tolerance window of a ground truth boundary.
- * Two tolerance windows will be used:
- ** 0.5 seconds: For finer precision.
- ** 3.0 seconds: For coarser, more perceptually relevant boundaries.
- * Based on these hits, Precision (P), Recall (R), and F-measure (F1-score) will be calculated for boundary detection at both tolerance levels.
- * The reported metrics will be HR.5F (F-measure with 0.5s tolerance) and HR3F (F-measure with 3s tolerance).
Baseline
The performance of the method described in Kim, T., & Nam, J. (2023). All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio. will serve as a baseline for this task. Participants are encouraged to develop systems that surpass this baseline.
Relevant Development Collections
While the MIREX datasets will be used for evaluation, participants may find the following publicly available annotated corpora useful for development. Please note that the annotations in these corpora will also need to be mapped to the 7-class functional labeling scheme if used for training models for this task.
- Jouni Paulus's structure analysis page links to a corpus of 177 Beatles songs (zip file). The TUTstructure07 dataset, containing 557 songs, is also listed here.
- Ewald Peiszer's thesis page links to a portion of his corpus: 43 non-Beatles pop songs (including 10 J-pop songs) (zip file).
These public corpora offer over 200 songs that can be adapted for development purposes.
Time and Hardware Limits
Due to the nature of the CodeBench platform and the potentially high number of participants, limits on the runtime and computational resources for submissions may be imposed. Specific details regarding these limits will be provided closer to the submission deadline. A general guideline is that analysis should be computationally feasible. For reference, a hard limit of 24 hours for total analysis time over the evaluation dataset was imposed in previous iterations, and a similar constraint might apply.