Difference between revisions of "2008:Audio Music Mood Classification"
(→Participants) |
|||
Line 41: | Line 41: | ||
# Haiba Wang, haiba_access@yahoo.cn | # Haiba Wang, haiba_access@yahoo.cn | ||
# IMIRSEL, xiaohu@illinois.edu | # IMIRSEL, xiaohu@illinois.edu | ||
+ | # Michael Mandel, mim (at) ee.columbia.edu | ||
== Introduction == | == Introduction == |
Revision as of 22:09, 11 August 2008
Contents
- 1 2008 AMC EVALUATION SCENARIO OVERVIEW
- 1.1 1. Feature extraction list file
- 1.2 2. Training list file
- 1.3 3. Test (classification) list file
- 1.4 Classification output files
- 1.5 Participants
- 1.6 Introduction
- 1.7 Mood Categories
- 1.8 Exemplar Songs in Each Category
- 1.9 Two Evaluation Scenarios
- 1.10 Groundtruth Set
- 1.11 Evaluation Metrics
- 1.12 Important Dates
- 1.13 Packaging your Submission
- 1.14 Submission Format
- 1.15 File Formats
- 1.16 Evaluation Scenario 2
- 1.17 Challenging Issues
- 1.18 Moderators
- 1.19 Related Papers
2008 AMC EVALUATION SCENARIO OVERVIEW
This section is put here to clarify what will happen for this year's run of the Audio Mood Classification (AMC) task.
- We will operate the AMC task as a classic train-test classification task.
- We will n-fold the runs with n to be determined by the size of the final data set, number of participants, etc.
- We will hand-craft the n-fold test-train split lists.
- We will NOT be doing post-run human mood judgments this year using the Evalutron 6000.
- Audio files: 30 sec., 22kHz, mono, 16 bit
Do take a look at the Audio Genre Classification task wiki as we are basing the underlying structure of this task on Audio Genre. In fact, an Audio Genre submission should work out of the box with Audio Mood Classification. Note: we really want folks to do a FEATURE EXTRACTION phase first against all the files and then have these features cached some place for re-use during the TRAIN-TEST phase. This way we can really speed up the n-fold processing. Thus, like GENRE, we need to pass three input files to your algos:
1. Feature extraction list file
The list file passed for feature extraction will a simple ASCII list file. This file will contain one path per line with no header line.
2. Training list file
The list file passed for model training will be a simple ASCII list file. This file will contain one path per line, followed by a tab character and the genre label, again with no header line.
E.g. <example path and filename>\t<mood classification>
3. Test (classification) list file
The list file passed for testing classification will be a simple ASCII list file identical in format to the Feature extraction list file. This file will contain one path per line with no header line.
Classification output files
Participating algorithms should produce a simple ASCII list file identical in format to the Training list file. This file will contain one path per line, followed by a tab character and the MOOD label, again with no header line. E.g.:
<example path and filename>\t<mood classification>
The path to which this list file should be written must be accepted as a parameter on the command line.
Participants
If you think there is a slight chance that you might consider participating, please add your name and email address here.
- Haiba Wang, haiba_access@yahoo.cn
- IMIRSEL, xiaohu@illinois.edu
- Michael Mandel, mim (at) ee.columbia.edu
Introduction
In music psychology and music education, emotion component of music has been recognized as the most strongly associated with music expressivity.(e.g. Juslin et al 2006#Related Papers). Music information behavior studies (e.g.Cunningham, Jones and Jones 2004, Cunningham, Vignoli 2004, Bainbridge and Falconer 2006 #Related Papers) have also identified music mood/ emotion as an important criterion used by people in music seeking and organization. Several experiments have been conducted in the MIR community to classify music by mood (e.g. Lu, Liu and Zhang 2006, Pohle, Pampalk, and Widmer 2005, Mandel, Poliner and Ellis 2006, Feng, Zhuang and Pan 2003#Related Papers). Please note: the MIR community tends to use the word "mood" while musicpsychologists like to use "emotion". We follow the MIR tradition to use "mood" thereafter.
However, evaluation of music mood classification is difficult as music mood is a very subjective notion. Each aforementioned experiement used different mood categories and different datasets, making comparison on previous work a virtually impossible mission. A contest on music mood classification in MIREX will help build the first ever community available test set and precious ground truth.
This is the first time in MIREX to attempt a music mood classification evaluation. There are many issues involved in this evaluation task, and let us start discuss them on this wiki. If needed, we will set up a mailing list devoting to the discussion.
Mood Categories
The IMIRSEL has derived a set of 5 mood clusters from the AMG mood repository (Hu & Downie 2007#Related Papers). The mood clusters effectively reduce the diverse mood space into a tangible set of categories, and yet root in the social-cultural context of pop music. Therefore, we propose to use the 5 mood clusters as the categories in this yearΓÇÖs audio mood classification contest. Each of the clusters is a collection of the AMG mood labels which collectively define the cluster:
- Cluster_1: passionate, rousing, confident,boisterous, rowdy
- Cluster_2: rollicking, cheerful, fun, sweet, amiable/good natured
- Cluster_3: literate, poignant, wistful, bittersweet, autumnal, brooding
- Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry
- Cluster_5: aggressive, fiery,tense/anxious, intense, volatile,visceral
At this moment, the IMIRSEL and Cyril Laurier at the Music Technology Group of Barcelona have manually validated the mood clusters and exemplar songs in each cluster. Please see #Exemplar Songs in Each Category for details.
We are still seeking additional songs across different genres to enrich this set, and during the process, the cluster with least cross-listener consistency may be dropped, or two clusters often confusing each other may be combined.
Exemplar Songs in Each Category
Exemplar songs for each mood cluster are manually selected by multiple human assessors. The purpose is to further clarify the perceptual identities of the mood clusters.
There are 190 candidate songs in the intersection of AMG mood repository and the USPOP collection in IMIRSEL, and each of these songs has only one unanimous mood cluster label assigned by AMG editors. The mood labels by AMG editors are important benchmark which can help us reach cross-listener consistency on such a subjective task. So far, 6 human assessors have listened to the 190 songs and assigned cluster labels to them. 50 songs are unanimously labeled by the 6 human assessors, 42 songs are unanimously labeled by 5 of the 6 human assessors, and another 40 songs by 4 of the 6 human assessors.
The advantages of the exemplar songs are two folds: 1. they will help people better understand what kind of mood each cluster refers to; 2. they can possibly be taken as training data for the algorithms (see the section of #Training Set).
Note: Lyrics issue: when labeling the songs, the human assessors were asked to ignore lyrics. As this is a contest focuses on music audio, lyrics should not be taken into consideration.
Two Evaluation Scenarios
1. Evaluation on a closed groundtruth set. As in traditional classification problems, both training and testing data are labeled well before the contest. Pros: evaluation metrics are more rigorous; support cross-validation cons: training/testing set is limited
2. Training on a labeled set, but testing on an unlabeled audio pool As in audio similarity and retrieval contest, each algorithm returns a list of candidates in each mood category, then human assessors make judgments on the returned candidates. Pros: testing pool can be arbitrarily big; training set is bigger as well (which can be the whole groundtruth set in scenario 1 .) Cons: innovative but limited evaluation metrics (see below)
For both scenarios, this is a single-label classification contest, and thus each song can only be classified into one mood cluster.
We will go for scenario 1
Groundtruth Set
The IMIRSEL is preparing a ground-truth set of audio clips selected from the USPOP collection decribed above and the APM collection (www.apmmusic.com). The bibliographic information of the exemplar songs has been released as above, which is to help participants reach agreements on the meanings of the mood categories.
The APM audio set has been pre-labeled with the 5 mood clusters according to their metadata provided by APM, and covers a variety of genres: each category covers about 7 major genres (with 20-30 tracks each) and a few minor genres. To make the problem more interesting, the distribution among major genres within each category is made as even as possible.
To make sure the mood labels are correct, this APM audio collection will subject to human validation before the contest. We prepared a set of 1250 audio clips (250 per category). The audio clips whose mood category assignments reach agreements among 2 out of 3 human assessors will serve as a ground truth set. We are aiming at least 120 audio clips in each mood category.
After the human validation on this audio set, participating algorithms/ models will be trained and tested within IMIRSEL.
Audio format: 30 second clips, 22.05kHz, mono, 16bit, WAV files
Human Validation
Subjective judgments by human assessors will be collected for the above mentioned APM audio set using a web-based system, Evalutron6000, developed by the IMIRSEL.
Each audio clip is 30 seconds long, and will have 3 human judges listen to it and choose which mood category it belongs to. If 2 of the 3 judges agree on its category, an audio clip will be selected into the groundtruth set.
Evaluation Metrics
Metrics frequently used in classification problems include: accuracy, precision, recall and F measures (combining precision and recall). The single most important metrics would be accuracy, which allows direct system comparison:
Accuracy = # of correctly classified songs / #. of all songs.
Accuracy can be calculated for all clusters as a whole (macro average) or for each cluster then take average of them (micro average).
Test significance of differences among systems, possibly using
- a) McNemarΓÇÖs test
McNemarΓÇÖs test (Dietterich, 1997) is a statistical process that can validate the significance of differences between two classifiers. It was used in Audio Genre Classification and Audio Artist Identification contests in MIREX 2005.
- b) FriedmanΓÇÖs test
FriedmanΓÇÖs test used to detect differences in treatments across multiple test attempts. (http://en.wikipedia.org/wiki/Friedman_test). It was used in Audio Similarity, Audio cover song, and Query by Singing/Humming contests in MIREX 2006.
Besides, run time can be recorded and compared.
Important Dates
- Human Validation for Groundtruth Set: August 1 - August 15
- Algorithm Submission Deadline: August 25
Packaging your Submission
- Be sure that your submission follows the #Submission_Format outlined below.
- Be sure that your submission accepts the proper #Input_File format
- Be sure that your submission produces the proper #Output_File format
- Be sure to follow the [Best Coding Practices for MIREX]
- Be sure to follow the MIREX 2008 Submission Instructions
- In the README file that is included with your submission, please answer the following additional questions:
- Approximately how long will the submission take to process ~1000 wav files?
- Approximately how much scratch disk space will the submission need to store any feature/cache files?
- Any special notice regarding to running your algorith
Note that the information that you place in the README file is extremely important in ensuring that your submission is evaluated properly.
Submission Format
A submission to the Audio Music Mood Classification evaluation is expected to follow the Best Coding Practices for MIREX and must conform to the following for execution:
One Call Format
The one call format is appropriate for systems that perform all phases of the classification (typically features extraction, training and testing) in one step. A submission should be an executable program that takes 4 arguments:
- path/to/fileContainingListOfTrainingAudioClips - the path to the list of training audio clips (see #File Formats below)
- path/to/fileContainingListOfTestingAudioClips - the path to the list of testing audio clips (see #File Formats below)
- path/to/cacheDir - a directory where the submission can place temporary or scratch files. Note that the contents of this directory can be retained across runs, so if, for whatever reason, the submission needs to be restarted, the submission could make use of the contents of this directory to eliminate the need for reprocessing some inputs.
- path/to/output/Results - the file where the output classification results should be placed. (see #File Formats below)
Example:
doAMC "path/to/fileContainingListOfTrainingAudioClips" "path/to/fileContainingListOfTestingAudioClips" "path/to/cacheDir" "path/to/output/Results"
Two Call Format
The one call format is appropriate for systems that perform the training and testing separately. A submission should consists of two executable programs
- trainAMC - this takes 3 arguments:
- path/to/fileContainingListOfTrainingAudioClips - the path to the list of training audio clips (see #File Formats below)
- path/to/trainingCacheDir - a directory where the submission can place temporary or scratch files. Note that the contents of this directory can be retained across runs, so if, for whatever reason, the submission needs to be restarted, the submission could make use of the contents of this directory to eliminate the need for reprocessing some inputs.
- path/to/trainedClassificationModel - the file where the classification model should be placed
- testAMC - this takes 4 arguments:
- path/to/trainedClassificationModel
- path/to/fileContainingListofTestingAudioClips - the path to the list of testing audio clips (see #File Formats below)
- path/to/testingCacheDir - a directory where the submission can place temporary or scratch files.
- path/to/output/Results - the file where the output classification results should be placed. (see #File Formats below)
Example:
trainAMC "path/to/fileContainingListOfTrainingAudioClips" "path/to/trainingcacheDir" "path/to/trainedClassificationModel" testAMC "path/to/trainedClassificationModel" "path/to/fileContainingListofTestingAudioClips" "path/to/testingCacheDir" "path/to/output/Results"
Matlab format
Matlab will also be supported in the form of functions in the following formats:
Matlab One call format
doMyMatlabAMC('path/to/fileContainingListOfTrainingAudioClips','path/to/fileContainingListOfTestingAudioClips','path/to/cacheDir','path/to/output/Results')
Matlab Two call format
doMyMatlabTrainAMC('path/to/fileContainingListOfTrainingAudioClips','path/to/trainingcacheDir','path/to/trainedClassificationModel') doMyMatlabTestAMC('path/to/trainedClassificationModel','path/to/fileContainingListofTestingAudioClips','path/to/testingCacheDir','path/to/output/Results')
File Formats
Input Files
The input training list file format will be of the form:
path/to/training/audio/file/000001.wav\tCluster_3 path/to/training/audio/file/000002.wav\tCluster_5 path/to/training/audio/file/000003.wav\tCluster_2 ... path/to/training/audio/file/00000N.wav\tCluster_1
"\t" stands for tab.
The input testing list file format will be of the form:
path/to/testing/audio/file/000010.wav path/to/testing/audio/file/000020.wav path/to/testing/audio/file/000030.wav ... path/to/testing/audio/file/0000N0.wav
"\t" stands for tab.
Output File
The only output will be a file containing classification results in the following format:
Example Classification Results 0.1 (replace this line with your system name) path/to/testing/audio/file/000010.wav\tCluster_3 path/to/testing/audio/file/000020.wav\tCluster_1 path/to/testing/audio/file/000030.wav\tCluster_5 ... path/to/testing/audio/file/0000N0.wav\tCluster_2
"\t" indicates tab. All audio clips should have one and only one mood cluster label.
Evaluation Scenario 2
Training Set
Under evaluation scenario 2, the training set would be the whole ground truth set in scenario 1 (see #Groundtruth Set).
Unlabeled Song Pool
Under evaluation scenario 2, the pool of testing audio to be classified is from the same collection of the training set, i.e. USPOP and APM. We will make sure the audio covers a variety of genres in each mood cluster, which will make the contest harder and more interesting.
We will randomly select a certain number (say, 1000) of songs from the collections as the audio pool. This number should make the contest interesting enough, but not too hard. And the songs need to cover all 5 mood clusters.
Classification Results
Each algorithm will return the top X songs in each cluster.
This is a single-label classification contest, and thus each song can only be classified into one mood cluster.
Note: unlike traditional classification problems where all testing samples have ground truth available, this scenario does not have a well labeled testing set. Instead, we use a ΓÇ£poolingΓÇ¥ approach like in TREC and last yearΓÇÖs audio similarity and retrieval contest. This approach collects the top X results from each algorithm and asks human assessors to make judgments on this set of collected results while assuming all other samples are irrelevant or incorrect. This approach cannot measure the absolute ΓÇ£recallΓÇ¥ metrics, but it is valid in comparing relative performances among participating algorithms.
The actual value of X depends on human assessment protocol and number of available human assessors (see next section #Human Assessment).
Human Assessment
Subjective judgments by human assessors will be collected for the pooled results using a web-based system, Evalutron6000, developed by the IMIRSEL. (An introduction of this piece of Evalutron 6000 is shown here Evalutron6000_Walkthrough_For_Audio_Mood_Classification
How many judgments and assessors
Each algorithm returns X songs for each of the 5 mood clusters. Suppose there are Y algorithms, in the worst case, each cluster will have 5* X*Y songs to be judged. Suppose each song needs Z sets of ears, there will be 5*X*Y*Z judgments in total. When making a judgment, a human assessor will listen to the 30 second clip of a song, and label it with one of the 5 mood clusters.
Human evaluators will be drawn from the participating labs and volunteers from IMIRSEL or on the MIREX lists. Suppose we can get W evaluators, each evaluator will evaluate S = (5*X*Y*Z) / W songs.
At this moment, there are 10 potential participants on the Wiki, so letΓÇÖs say Y = 6. Suppose each candidate song will be evaluated by 3 judges, Z = 3, and suppose we can get 20 assessors: W = 20:
- If X = 20, number of judgments for each assessor: S = 90
- If X = 10, S = 45
- If X = 30, S = 135
- If X = 50, S = 225
- If X = 15, S = 67.5
- …
In audio similarity contest last year, each assessor made 205 judgments as average. As the judgment for mood is trickier, we may need to give our assessors less burden.
To eliminate possible bias, we will try to equally distribute candidates returned by each algorithm among human assessors.
Scoring
Each algorithm is graded by the number of votes its candidate songs win from the judges. For example, if a song, A, is judged as in Cluster_1 by 2 assessors and as in Cluster_2 by 1 assessors, then the algorithm classifying A as in Cluster_1 will score 2 on this song, while the algorithm classifiying A as Cluster_2 will score 1 on this song. An algorithmΓÇÖs final score is the sum of scores on all the songs it submits. Since each algorithm can only submit 100 songs, the one which wins the most votes of judges win the contest.
Evaluation Metrics
Algorithm score as mentioned in last section is a metrics that facilitates direct comparison.
Besides, metrics frequently used in classification problems include: accuracy, precision, recall and F measures (combining precision and recall). As mentioned above, the pooling approach results in a relative recall measure, therefore, the single most important metrics would be accuracy:
The original definition of accuracy is: Accuracy = # of correctly classified songs / #. of all songs.
According to the above human assessment method, ΓÇ£correctly classified songsΓÇ¥ in this scenario can be defined as songs classified as the majority vote of the judges and, in the case of ties, songs classified as any of the tie votes. For example, suppose each song has 3 judges. If a song is labeled as Cluster_1 by at least 2 judges, then this song will be counted as correct for algorithms classifying it to Cluster_1; if a song is labeled as Cluster_1, Cluster_2 and Cluster_3 once by each of the judges, then this song will be counted as correct for algorithms classifying it to Cluster_1, Cluster_2 or Cluster_3.
Accuracy can be calculated for all clusters as a whole (macro average) or for each cluster then take average of them (micro average).
Test significance of differences among systems, possibly using
- a) McNemarΓÇÖs test
- b) FriedmanΓÇÖs test
Besides, run time can be recorded and compared.
Challenging Issues
- Mood changeable pieces: some pieces may start from one mood but end up with another one.
We will use 30 second clips instead of whole songs. The clips will be extracted automatically from the middle of the songs which have more chances to be representative.
- Multiple label classification: it is possible that one piece can have two or more correct mood labels, but as a start, we strongly suggest to hold a less confusing contest and leave the challenge to future MIREXs.So, for this year, this is a single label classification problem.
Moderators
- J. Stephen Downie (IMIRSEL, University of Illinois, USA) - [1]
- Xiao Hu (IMIRSEL, University of Illinois, USA) -[2]
- Cyril Laurier (Music Technology Group, Barcelona, Spain) -[3]
Related Papers
- Dietterich, T. (1997). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), 1895-1924.
- Hu, Xiao and J. Stephen Downie (2007). Exploring mood metadata: Relationships with genre, artist and usage metadata. Accepted in the Eighth International Conference on Music Information Retrieval (ISMIR 2007),Vienna, September 23-27, 2007.
- Juslin, P.N., Karlsson, J., Lindstr├╢m E., Friberg, A. and Schoonderwaldt, E(2006), Play It Again With Feeling: Computer Feedback in Musical Communication of Emotions. In Journal of Experimental Psychology: Applied 2006, Vol.12, No.2, 79-95.
- Vignoli (ISMIR 2004) Digital Music Interaction Concepts: A User Study
- Cunningham, Jones and Jones (ISMIR 2004) Organizing Digital Music For Use: An Examiniation of Personal Music Collections.
- Cunningham, Bainbridge and Falconer (ISMIR 2006) More of an Art than a Science': Supporting the Creation of Playlists and Mixes.
- Lu, Liu and Zhang (2006), Automatic Mood Detection and Tracking of Music Audio Signals. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 1, JANUARY 2006
Part of this paper appeared in ISMIR 2003 http://ismir2003.ismir.net/papers/Liu.PDF - Pohle, Pampalk, and Widmer (CBMI 2005) Evaluation of Frequently Used Audio Features for Classification of Music into Perceptual Categories.
It separates "mood" and "emotion" as two classifcation dimensions, which are mostly combined in other studies. - Mandel, Poliner and Ellis (2006) Support vector machine active learning for music retrieval. Multimedia Systems, Vol.12(1). Aug.2006.
- Feng, Zhuang and Pan (SIGIR 2003) Popular music retrieval by detecting mood
- Li and Ogihara (ISMIR 2003) Detecting emotion in music
- Hilliges, Holzer, Kl├╝ber and Butz (2006) AudioRadar: A metaphorical visualization for the navigation of large music collections.In Proceedings of the International Symposium on Smart Graphics 2006, Vancouver Canada.
It summarized implicit problems in traditional genre/artist based music organization. - Juslin, P. N., & Laukka, P. (2004). Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening. Journal of New Music Research, 33(3), 217-238.
- Yang, Liu, and Chen (ACMMM 2006) Music emotion classification: A fuzzy approach