Difference between revisions of "2018:Patterns for Prediction Results"

From MIREX Wiki
m (Training and Test Datasets)
m
 
(28 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
== Introduction ==
 
== Introduction ==
  
THIS PAGE IS UNDER CONSTRUCTION!
+
'''In brief''':
  
The task: ...
+
We look for
 +
 
 +
(1) Algorithms that take an excerpt of music as input (the prime), and output a predicted continuation of the excerpt.
 +
 
 +
(2) Additionally or alternatively, algorithms that take a prime and one or more continuations as input, and output the likelihood that each continuation is the genuine extension of the prime.
 +
 
 +
'''In more detail''': One facet of human nature comprises the tendency to form predictions about what will happen in the future (Huron, 2006). Music, consisting of complex temporally extended sequences, provides an excellent setting for the study of prediction, and this topic has received attention from fields including but not limited to psychology (Collins, Tillmann, et al., 2014; Janssen, Burgoyne and Honing, 2017; Schellenberg, 1997; Schmukler, 1989), neuroscience (Koelsch et al., 2005), music theory (Gjerdingen, 2007; Lerdahl & Jackendoff, 1983; Rohrmeier & Pearce, 2018), music informatics (Conklin & Witten, 1995; Cherla et al., 2013), and machine learning (Elmsley, Weyde, & Armstrong, 2017; Hadjeres, Pachet, & Nielsen, 2016; Gjerdingen, 1989; Roberts et al., 2018; Sturm et al., 2016). In particular, we are interested in the way exact and inexact repetition occurs over the short, medium, and long term in pieces of music (Margulis, 2014; Widmer, 2016), and how these repetitions may interact with "schematic, veridical, dynamic, and conscious" expectations (Huron, 2006) in order to form a basis for successful prediction.
 +
 
 +
We call for algorithms that may model such expectations so as to predict the next musical events based on given, foregoing events (the prime). We invite contributions from all fields mentioned above (not just pattern discovery researchers), as different approaches may be complementary in terms of predicting correct continuations of a musical excerpt. We would like to explore these various approaches to music prediction in a MIREX task. For subtask (1) above (see "In brief"), the development and test datasets will contain an excerpt of a piece up until a cut-off point, after which the algorithm is supposed to generate the next N musical events up until 10 quarter-note beats, and we will quantitatively evaluate the extent to which an algorithm's continuation corresponds to the genuine continuation of the piece. For subtask (2), in addition to containing a prime, the development and test datasets will also contain continuations of the prime, one of which will be genuine, and the algorithm should rate the likelihood that each continuation is the genuine extension of the prime, which again will be evaluated quantitatively.
 +
 
 +
What is the relationship between pattern discovery and prediction? The last five years have seen an increasing interest in algorithms that discover or generate patterned data, leveraging methods beyond typical (e.g., Markovian) limits (Collins & Laney, 2017; MIREX Discovery of Repeated Themes & Sections task; Janssen, van Kranenburg and Volk, 2017; Ren et al., 2017; Widmer, 2016). One of the observations to emerge from the above-mentioned MIREX pattern discovery task is that an algorithm that is "good" at discovering patterns ought to be extendable to make "good" predictions for what will happen next in a given music excerpt (Meredith, 2013). Furthermore, evaluating the ability to predict may provide a stronger (or at least complementary) evaluation of an algorithm's pattern discovery capabilities, compared to evaluating its output against expert-annotated patterns, where the notion of "ground truth" has been debated (Meredith, 2013).
  
 
== Contribution ==
 
== Contribution ==
Line 9: Line 19:
 
...
 
...
  
 +
For a more detailed introduction to the task, please see [[2018:Patterns for Prediction]].
  
[[File:mozartK282Mvt2.png|500px]]
+
== Datasets and Algorithms ==
  
'''Figure 1.''' Pattern discovery v segmentation. (A) Bars 1-12 of Mozart’s Piano Sonata in E-flat major K282 mvt.2, showing some ground-truth themes and repeated sections; (B-D) Three linear segmentations. Numbers below the staff in Fig. 1A and below the segmentation in Fig. 1D indicate crotchet beats, from zero for bar 1 beat 1.
+
The Patterns for Prediction Development Dataset (PPDD-Jul2018) has been prepared by processing a randomly selected subset of the [http://colinraffel.com/projects/lmd/ Lakh MIDI Dataset] (LMD, Raffel, 2016). It has audio and symbolic versions crossed with monophonic and polyphonic versions. The audio is generated from the symbolic representation, so it is not "expressive". The symbolic data is presented in CSV format. For example,
  
 +
20,64,62,0.5,0
 +
20.66667,65,63,0.25,0
 +
21,67,64,0.5,0
 +
...
  
For a more detailed introduction to the task, please see [[2018:Patterns for Prediction]].
+
would be the start of a prime where the first event had ontime 20 (measured in quarter-note beats -- equivalent to bar 6 beat 1 if the time signature were 4-4), MIDI note number (MNN) 64, estimated morphetic pitch number 62 (see [http://tomcollinsresearch.net/research/data/mirex/ppdd/mnn_mpn.pdf p. 352] from Collins, 2011 for a diagrammatic explanation; for more details, see Meredith, 1999), duration 0.5 in quarter-note beats, and channel 0. Re-exports to MIDI are also provided, mainly for listening purposes. We also provide a descriptor file containing the original Lakh MIDI Dataset id, the BPM, time signature, and a key estimate. The audio dataset contains all these files, plus WAV files. Therefore, the audio and symbolic variants are identical to one another, apart from the presence of WAV files. All other variants are non-identical, although there may be some overlap, as they were all chosen from LMD originally.
  
== Training and Test Datasets ==
+
The provenance of the Patterns for Prediction Test Dataset (PPTD) will '''not''' be disclosed, but it shares simiarlity with LMD and not from LMD, if you are concerned about overfitting.
  
...
+
There are small (100 pieces), medium (1,000 pieces), and large (10,000 pieces) variants of each dataset, to cater to different approaches to the task (e.g., a point-set pattern discovery algorithm developer may not want/need as many training examples as a neural network researcher). Each prime lasts approximately 35 sec (according to the BPM value in the original MIDI file) and each continuation covers the subsequent 10 quarter-note beats. We would have liked to provide longer primes (as 35 sec affords investigation of medium- but not really long-term structure), but we have to strike a compromise between ideal and tractable scenarios.
  
 +
Submissions to the symMono and symPoly variants of the tasks are listed in Table 1. There were no submissions to the audMono or audPoly variants of the tasks this year. The task captains prepared a first-order Markov model (MM) over a state space of measure beat and key-centralized MIDI note number. This enabled evaluation of the implicit subtask, and can also serve as a point of comparison for the explicit task. It should be noted, however, that this model had access to the full song/piece – '''not just the prime''' – so it is at an advantage compared to EN1 and FC1 in the explicit task.
  
 
{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
 
{| border="1" cellspacing="0" style="text-align: left; width: 800px;"
Line 36: Line 52:
 
|-
 
|-
 
! EN1
 
! EN1
| Algo name here ||  style="text-align: center;" |  [https://www.music-ir.org/mirex/abstracts/2018/EN1.pdf PDF] || [http://ericpnichols.com/ Eric Nichols]
+
| Seq2SeqP4P ||  style="text-align: center;" |  [https://www.music-ir.org/mirex/abstracts/2018/EN1.pdf PDF] || [http://ericpnichols.com/ Eric Nichols]
 
         |-
 
         |-
 
! FC1
 
! FC1
| Algo name here ||  style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2018/FC1.pdf PDF] || [https://scholar.google.com/citations?user=rpZVNKYAAAAJ&hl=en Florian Colombo]
+
| BachProp ||  style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2018/FC1.pdf PDF] || [https://scholar.google.com/citations?user=rpZVNKYAAAAJ&hl=en Florian Colombo]
 
|-
 
|-
         ! MM
+
         ! MM1
| Markov model  ||  style="text-align: center;" | N/A || Intended as 'baseline'
+
| First-order Markov model  ||  style="text-align: center;" | Task captains || For purposes of comparison
 
|-
 
|-
 
         |- style="background: green;"
 
         |- style="background: green;"
Line 51: Line 67:
 
|-
 
|-
 
! FC1
 
! FC1
| Algo name here ||  style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2018/FC1.pdf PDF] || [https://scholar.google.com/citations?user=rpZVNKYAAAAJ&hl=en Florian Colombo]
+
| BachProp ||  style="text-align: center;" | [https://www.music-ir.org/mirex/abstracts/2018/FC1.pdf PDF] || [https://scholar.google.com/citations?user=rpZVNKYAAAAJ&hl=en Florian Colombo]
 
|-
 
|-
         ! MM
+
         ! MM1
| Markov model  ||  style="text-align: center;" | N/A || Intended as 'baseline'
+
| First-order Markov model  ||  style="text-align: center;" | Task captains || For purposes of comparison
 
|-
 
|-
 
|}
 
|}
  
'''Table 1.''' Algorithms submitted to Patterns for Prediction 2018.
+
'''Table 1. Algorithms submitted to Patterns for Prediction 2018. Seg2SegPvP and BachProp are models based on LSTM networks.'''
  
 
== Results ==
 
== Results ==
  
Next to testing how well different algorithms compare when measured with the metrics introduced in earlier forms of this track, the goal of this year's run of drts was also to investigate alternative evaluation measures. Next to establishment, occurrence and three-layer measures, which determine the success of an algorithm to find annotated patterns, we also evaluated coverage and lossless compression of the algorithms, i.e., to what extent a piece is covered, or can be compressed, by discovered patterns.
+
We measure the performance of an algorithm to 1) predict a continuation, given a prime (explicit task), and 2) decide which of two versions is the true or foil continuation, given a prime (implicit task). To evaluate performance at the explicit task, we compare the true continuation to the generated continuation, and measure how many pitches and inter-onset intervals (with relation to the last onset of the prime) are correctly predicted at various time intervals after the last note of the prime. To evaluate performance at the implicit task, we measure accuracy as the number of correct decisions, divided by the total amount of decisions. (For mathematical definitions of the various metrics, please see [[2018:Patterns_for_Prediction#Evaluation_Procedure]].)
 
 
(For mathematical definitions of the various metrics, please see [[2017:Discovery_of_Repeated_Themes_&_Sections#Evaluation_Procedure]].)
 
  
 
===SymMono===
 
===SymMono===
VM1, successful in the last years' editions of drts, achieves overall highest results for the establishment metrics (cf. Figures 1-3), i.e., it finds a great number of the annotated patterns. It is only outperformed by other measures on piece 2 of the ground truth, where DM1 and DM3, and the new entry CS7 achieve higher results for establishment F1 score. CS7 is overall successful with respect to finding occurrences of patterns (cf. Figures 4-6), comparable to successful results of previous years. (NB: OL1 did not run on piece 5 of the ground truth, which is why values are missing.) The three-layer measures (Figures 7-9) show varying degrees of agreement with the ground truth for the ground truth pieces, and again CS7 and VM1 compare favourably to previous submissions.
+
For the implicit task, the two LSTM based models make their best predictions close to the cut-off point (i.e., last event of the prime). As the onset time after the cut-off point increases, the Markov Model outperforms the LSTM based models, with the exception of recall of inter-onset interval (cf. Figure 4), where FC1 consistently performs better than the Markov Model. Possibly in consequence of the poorer pitch performance, also fewer relevant pitch-ioi pairs were detected by the LSTM models as onset time after cutoff point increases.
 
 
The new measure coverage shows that VM1, DM1 and DM3 find patterns in almost all notes of the ground truth pieces (cf. Figure 10); other algorithms share this tendency for piece 2 and 5 of the ground truth, but in other pieces, most notably piece 4, CS7 and other algorithms seem to have a very sparse coverage of the piece with patterns. An abundance of overlapping patterns lead to poor lossless compression scores, and CS7 seems to find few overlapping patterns, as its score at lossless compression is overall highest (cf. Figure 11), notably also for piece 2 where it achieved high coverage, too: this means that it found most notes of the piece as patterns, and these patterns can be used to compress the piece very successfully.
 
  
Runtimes were not evaluated this year, as the comparison of the machines of new submissions to runtimes from previous years would not have been very conclusive. The new submission, CS7, completed analysis of the ground truth pieces in a few minutes on a 2 GHz Intel Core i5 machine.
+
For the explicit task, only FC1 was submitted. It outperforms chance level significantly, at 0.87, i.e., picking the correct continuation in almost 90% of the cases (See Table 1).
  
 
===SymPoly===
 
===SymPoly===
DM1, the most successful measure for polyphonic discovery, again compares favourably in the establishment, occurrence and three-layer metrics this year. The new submission, CS3, outperforms DM1-DM3 in piece 1 on establishment measures (cf. Figures 12-14) , and in piece 2 on occurrence precision (cf. Figure 15).
+
Only one LSTM model was submitted to SymPoly (FC1), with results comparable to SymMono (see Figures 10-18, Table 1).
 
 
Coverage shows again that DM1 and DM3 find patterns in almost all notes of the ground truth pieces (Figure 21). CS3, NF1 and DM2 (which was optimized for precision metrics) show lower coverage, CS3 lowest overall. CS3 achieves overall highest values in lossless compression (Figure 22).
 
  
Runtimes were not evaluated this year, as the comparison of the machines of new submissions to runtimes from previous years would not have been very conclusive. The new submission, CS3, completed analysis of the ground truth pieces in a few minutes on a 2 GHz Intel Core i5 machine.
 
 
==Discussion==
 
The new compression evaluation measures are not highly correlated with the metrics measuring retrieval of annotated patterns. This may be caused by the fact that lossless compression is lower for algorithms which find overlapping patterns: human annotators, and also some pattern discovery algorithms, may find valid overlapping patterns, as patterns may be hierarchically layered (e.g., motifs which are part of themes). We will add new, prediction based measures, and new ground truth pieces to the task next year.
 
 
Berit Janssen, Iris Ren, Tom Collins, Anja Volk.
 
 
==Figures==
 
==Figures==
===SymMono===
+
===symMono===
 
+
====Explicit task: generate music given a prime====
[[File:2017_Mono_R_est.png|600px]]
 
 
 
'''Figure 1.''' Establishment recall averaged over each piece/movement. Establishment recall answers the following question. On average, how similar is the most similar algorithm-output pattern to a ground-truth pattern prototype?
 
 
 
[[File:2017_Mono_P_est.png|600px]]
 
 
 
'''Figure 2.''' Establishment precision averaged over each piece/movement. Establishment precision answers the following question. On average, how similar is the most similar ground-truth pattern prototype to an algorithm-output pattern?
 
 
 
[[File:2017_Mono_F1_est.png|600px]]
 
 
 
'''Figure 3.''' Establishment F1 averaged over each piece/movement. Establishment F1 is an average of establishment precision and establishment recall.
 
 
 
[[File:2017_Mono_R_occ_75.png|600px]]
 
 
 
'''Figure 4.''' Occurrence recall (<math>c = .75</math>) averaged over each piece/movement. Occurrence recall answers the following question. On average, how similar is the most similar set of algorithm-output pattern occurrences to a discovered ground-truth occurrence set?
 
  
[[File:2017_Mono_P_occ_75.png|600px]]
+
[[File:2018_mono_R_pitch.png|600px]]
  
'''Figure 5.''' Occurrence precision (<math>c = .75</math>) averaged over each piece/movement. Occurrence precision answers the following question. On average, how similar is the most similar discovered ground-truth occurrence set to a set of algorithm-output pattern occurrences?
+
'''Figure 1.''' Recall of generated pitches after cutoff point.
  
[[File:2017_Mono_F1_occ75.png|600px]]
+
[[File:2018_mono_P_pitch.png|600px]]
  
'''Figure 6.''' Occurrence F1 (<math>c = .75</math>) averaged over each piece/movement. Occurrence F1 is an average of occurrence precision and occurrence recall.
+
'''Figure 2.''' Precision of generated pitches after cutoff point.
  
[[File:2017_Mono_R3.png|600px]]
+
[[File:2018_mono_F1_pitch.png|600px]]
  
'''Figure 7.''' Three-layer recall averaged over each piece/movement. Rather than using <math>|P \cap Q|/\max\{|P|, |Q|\}</math> as a similarity measure (which is the default for establishment recall), three-layer recall uses <math>2|P \cap Q|/(|P| + |Q|)</math>, which is a kind of F1 measure.
+
'''Figure 3.''' F1 measure of generated pitches after cutoff point.
  
[[File:2017_Mono_P3.png|600px]]
+
[[File:2018_mono_R_ioi.png|600px]]
  
'''Figure 8.''' Three-layer precision averaged over each piece/movement. Rather than using <math>|P \cap Q|/\max\{|P|, |Q|\}</math> as a similarity measure (which is the default for establishment precision), three-layer precision uses <math>2|P \cap Q|/(|P| + |Q|)</math>, which is a kind of F1 measure.
+
'''Figure 4.''' Recall of generated inter-onset intervals after cutoff point.
  
[[File:2017_Mono_TLF1.png|600px]]
+
[[File:2018_mono_P_ioi.png|600px]]
  
'''Figure 9.''' Three-layer F1 (TLF) averaged over each piece/movement. TLF is an average of three-layer precision and three-layer recall.
+
'''Figure 5.''' Precision of generated inter-onset intervals after cutoff point.
  
[[File:Mono_Coverage.png|600px]]
+
[[File:2018_mono_F1_ioi.png|600px]]
  
'''Figure 10.''' Coverage of the discovered patterns of each piece/movement. Coverage measures the fraction of notes of a piece covered by discovered patterns.
+
'''Figure 6.''' F1 measure of generated inter-onset intervals after cutoff point.
  
[[File:2017_Mono_LC.png|600px]]
+
[[File:2018_mono_R_pairs.png|600px]]
  
'''Figure 11.''' Lossless compression achieved by representing each piece/movement in terms of patterns discovered by a given algorithm. Next to patterns and their repetitions, also the uncovered notes are represented, such that the complete piece could be reconstructed from the compressed representation.
+
'''Figure 7.''' Recall of generated pitch-ioi pairs after cutoff point.
  
===SymPoly===
+
[[File:2018_mono_P_pairs.png|600px]]
  
[[File:2017_Poly_R_est.png|600px]]
+
'''Figure 8.''' Precision of generated pitch-ioi pairs after cutoff point.
  
'''Figure 12.''' Establishment recall averaged over each piece/movement. Establishment recall answers the following question. On average, how similar is the most similar algorithm-output pattern to a ground-truth pattern prototype?
+
[[File:2018_mono_F1_pitch.png|600px]]
  
[[File:2017_Poly_P_est.png|600px]]
+
'''Figure 9.''' F1 measure of generated pitch-ioi pairs after cutoff point.
  
'''Figure 13.''' Establishment precision averaged over each piece/movement. Establishment precision answers the following question. On average, how similar is the most similar ground-truth pattern prototype to an algorithm-output pattern?
+
===Explicit Task: Polyphonic===
  
[[File:2017_Poly_F1_est.png|600px]]
+
[[File:2018_poly_R_pitch.png|600px]]
  
'''Figure 14.''' Establishment F1 averaged over each piece/movement. Establishment F1 is an average of establishment precision and establishment recall.
+
'''Figure 10.''' Recall of generated pitches after cutoff point.
  
[[File:2017_Poly_R_occ_75.png|600px]]
+
[[File:2018_poly_P_pitch.png|600px]]
  
'''Figure 15.''' Occurrence recall (<math>c = .75</math>) averaged over each piece/movement. Occurrence recall answers the following question. On average, how similar is the most similar set of algorithm-output pattern occurrences to a discovered ground-truth occurrence set?
+
'''Figure 11.''' Precision of generated pitches after cutoff point.
  
[[File:2017_Poly_P_occ_75.png|600px]]
+
[[File:2018_poly_F1_pitch.png|600px]]
  
'''Figure 16.''' Occurrence precision (<math>c = .75</math>) averaged over each piece/movement. Occurrence precision answers the following question. On average, how similar is the most similar discovered ground-truth occurrence set to a set of algorithm-output pattern occurrences?
+
'''Figure 12.''' F1 measure of generated pitches after cutoff point.
  
[[File:2017_Poly_F1_occ_75.png|600px]]
+
[[File:2018_poly_R_ioi.png|600px]]
  
'''Figure 17.''' Occurrence F1 (<math>c = .75</math>) averaged over each piece/movement. Occurrence F1 is an average of occurrence precision and occurrence recall.
+
'''Figure 13.''' Recall of generated inter-onset intervals after cutoff point.
  
[[File:2017_Poly_R3.png|600px]]
+
[[File:2018_poly_P_ioi.png|600px]]
  
'''Figure 18.''' Three-layer recall averaged over each piece/movement. Rather than using <math>|P \cap Q|/\max\{|P|, |Q|\}</math> as a similarity measure (which is the default for establishment recall), three-layer recall uses <math>2|P \cap Q|/(|P| + |Q|)</math>, which is a kind of F1 measure.
+
'''Figure 14.''' Precision of generated inter-onset intervals after cutoff point.
  
[[File:2017_Poly_P3.png|600px]]
+
[[File:2018_poly_F1_ioi.png|600px]]
  
'''Figure 19.''' Three-layer precision averaged over each piece/movement. Rather than using <math>|P \cap Q|/\max\{|P|, |Q|\}</math> as a similarity measure (which is the default for establishment precision), three-layer precision uses <math>2|P \cap Q|/(|P| + |Q|)</math>, which is a kind of F1 measure.
+
'''Figure 15.''' F1 measure of generated inter-onset intervals after cutoff point.
  
[[File:2017_Poly_TLF1.png|600px]]
+
[[File:2018_poly_R_pairs.png|600px]]
  
'''Figure 20.''' Three-layer F1 (TLF) averaged over each piece/movement. TLF is an average of three-layer precision and three-layer recall.
+
'''Figure 16.''' Recall of generated pitch-ioi pairs after cutoff point.
  
[[File:2017_Poly_Coverage.png|600px]]
+
[[File:2018_poly_P_pairs.png|600px]]
  
'''Figure 21.''' Coverage of the discovered patterns of each piece/movement. Coverage measures the fraction of notes of a piece covered by discovered patterns.
+
'''Figure 17.''' Precision of generated pitch-ioi pairs after cutoff point.
  
[[File:2017_Poly_LC.png|600px]]
+
[[File:2018_poly_F1_pairs.png|600px]]
  
'''Figure 22.''' Lossless compression achieved by representing each piece/movement in terms of patterns discovered by a given algorithm. Next to patterns and their repetitions, also the uncovered notes are represented, such that the complete piece could be reconstructed from the compressed representation.
+
'''Figure 18.''' F1 measure of generated pitch-ioi pairs after cutoff point.
  
 
==Tables==
 
==Tables==
===SymMono===
+
====Implicit task: decide which of two continuations is the true one, given the prime====
[https://www.dropbox.com/s/ajhvidekc4xl8vp/2017_drts_mono_patterns.csv?dl=0 Click to download SymMono pattern retrieval results table]
+
{| border="1" cellspacing="0" style="text-align: left; width: 280px;"
 
+
|-
[https://www.dropbox.com/s/ipt2j4jw0qmkvkh/2017_drts_mono_compression.csv?dl=0 Click to download SymMono compression results table]
+
! width="120" | Algorithm
 
+
! width="80" | Monophonic
===SymPoly===
+
! width="80" | Polyphonic
[https://www.dropbox.com/s/h8b731nlu8v1re0/2017_drts_poly_compression.csv?dl=0 Click to download SymPoly pattern retrieval results table]
+
      |-
 
+
      |-
[https://www.dropbox.com/s/2q0rj40szi2ybjn/2017_drts_poly_patterns.csv?dl=0 Click to download SymPoly compression results table]
+
      ! FC1
 +
      | 0.87 || 0.92
 +
      |-
 +
      |-
 +
      ! Sig. > chance
 +
      | 0.54 || 0.54
 +
      |-
 +
|}

Latest revision as of 04:13, 5 November 2019

Introduction

In brief:

We look for

(1) Algorithms that take an excerpt of music as input (the prime), and output a predicted continuation of the excerpt.

(2) Additionally or alternatively, algorithms that take a prime and one or more continuations as input, and output the likelihood that each continuation is the genuine extension of the prime.

In more detail: One facet of human nature comprises the tendency to form predictions about what will happen in the future (Huron, 2006). Music, consisting of complex temporally extended sequences, provides an excellent setting for the study of prediction, and this topic has received attention from fields including but not limited to psychology (Collins, Tillmann, et al., 2014; Janssen, Burgoyne and Honing, 2017; Schellenberg, 1997; Schmukler, 1989), neuroscience (Koelsch et al., 2005), music theory (Gjerdingen, 2007; Lerdahl & Jackendoff, 1983; Rohrmeier & Pearce, 2018), music informatics (Conklin & Witten, 1995; Cherla et al., 2013), and machine learning (Elmsley, Weyde, & Armstrong, 2017; Hadjeres, Pachet, & Nielsen, 2016; Gjerdingen, 1989; Roberts et al., 2018; Sturm et al., 2016). In particular, we are interested in the way exact and inexact repetition occurs over the short, medium, and long term in pieces of music (Margulis, 2014; Widmer, 2016), and how these repetitions may interact with "schematic, veridical, dynamic, and conscious" expectations (Huron, 2006) in order to form a basis for successful prediction.

We call for algorithms that may model such expectations so as to predict the next musical events based on given, foregoing events (the prime). We invite contributions from all fields mentioned above (not just pattern discovery researchers), as different approaches may be complementary in terms of predicting correct continuations of a musical excerpt. We would like to explore these various approaches to music prediction in a MIREX task. For subtask (1) above (see "In brief"), the development and test datasets will contain an excerpt of a piece up until a cut-off point, after which the algorithm is supposed to generate the next N musical events up until 10 quarter-note beats, and we will quantitatively evaluate the extent to which an algorithm's continuation corresponds to the genuine continuation of the piece. For subtask (2), in addition to containing a prime, the development and test datasets will also contain continuations of the prime, one of which will be genuine, and the algorithm should rate the likelihood that each continuation is the genuine extension of the prime, which again will be evaluated quantitatively.

What is the relationship between pattern discovery and prediction? The last five years have seen an increasing interest in algorithms that discover or generate patterned data, leveraging methods beyond typical (e.g., Markovian) limits (Collins & Laney, 2017; MIREX Discovery of Repeated Themes & Sections task; Janssen, van Kranenburg and Volk, 2017; Ren et al., 2017; Widmer, 2016). One of the observations to emerge from the above-mentioned MIREX pattern discovery task is that an algorithm that is "good" at discovering patterns ought to be extendable to make "good" predictions for what will happen next in a given music excerpt (Meredith, 2013). Furthermore, evaluating the ability to predict may provide a stronger (or at least complementary) evaluation of an algorithm's pattern discovery capabilities, compared to evaluating its output against expert-annotated patterns, where the notion of "ground truth" has been debated (Meredith, 2013).

Contribution

...

For a more detailed introduction to the task, please see 2018:Patterns for Prediction.

Datasets and Algorithms

The Patterns for Prediction Development Dataset (PPDD-Jul2018) has been prepared by processing a randomly selected subset of the Lakh MIDI Dataset (LMD, Raffel, 2016). It has audio and symbolic versions crossed with monophonic and polyphonic versions. The audio is generated from the symbolic representation, so it is not "expressive". The symbolic data is presented in CSV format. For example,

20,64,62,0.5,0
20.66667,65,63,0.25,0
21,67,64,0.5,0
...

would be the start of a prime where the first event had ontime 20 (measured in quarter-note beats -- equivalent to bar 6 beat 1 if the time signature were 4-4), MIDI note number (MNN) 64, estimated morphetic pitch number 62 (see p. 352 from Collins, 2011 for a diagrammatic explanation; for more details, see Meredith, 1999), duration 0.5 in quarter-note beats, and channel 0. Re-exports to MIDI are also provided, mainly for listening purposes. We also provide a descriptor file containing the original Lakh MIDI Dataset id, the BPM, time signature, and a key estimate. The audio dataset contains all these files, plus WAV files. Therefore, the audio and symbolic variants are identical to one another, apart from the presence of WAV files. All other variants are non-identical, although there may be some overlap, as they were all chosen from LMD originally.

The provenance of the Patterns for Prediction Test Dataset (PPTD) will not be disclosed, but it shares simiarlity with LMD and not from LMD, if you are concerned about overfitting.

There are small (100 pieces), medium (1,000 pieces), and large (10,000 pieces) variants of each dataset, to cater to different approaches to the task (e.g., a point-set pattern discovery algorithm developer may not want/need as many training examples as a neural network researcher). Each prime lasts approximately 35 sec (according to the BPM value in the original MIDI file) and each continuation covers the subsequent 10 quarter-note beats. We would have liked to provide longer primes (as 35 sec affords investigation of medium- but not really long-term structure), but we have to strike a compromise between ideal and tractable scenarios.

Submissions to the symMono and symPoly variants of the tasks are listed in Table 1. There were no submissions to the audMono or audPoly variants of the tasks this year. The task captains prepared a first-order Markov model (MM) over a state space of measure beat and key-centralized MIDI note number. This enabled evaluation of the implicit subtask, and can also serve as a point of comparison for the explicit task. It should be noted, however, that this model had access to the full song/piece – not just the prime – so it is at an advantage compared to EN1 and FC1 in the explicit task.

Sub code Submission name Abstract Contributors
Task Version symMono
EN1 Seq2SeqP4P PDF Eric Nichols
FC1 BachProp PDF Florian Colombo
MM1 First-order Markov model Task captains For purposes of comparison
Task Version symPoly
FC1 BachProp PDF Florian Colombo
MM1 First-order Markov model Task captains For purposes of comparison

Table 1. Algorithms submitted to Patterns for Prediction 2018. Seg2SegPvP and BachProp are models based on LSTM networks.

Results

We measure the performance of an algorithm to 1) predict a continuation, given a prime (explicit task), and 2) decide which of two versions is the true or foil continuation, given a prime (implicit task). To evaluate performance at the explicit task, we compare the true continuation to the generated continuation, and measure how many pitches and inter-onset intervals (with relation to the last onset of the prime) are correctly predicted at various time intervals after the last note of the prime. To evaluate performance at the implicit task, we measure accuracy as the number of correct decisions, divided by the total amount of decisions. (For mathematical definitions of the various metrics, please see 2018:Patterns_for_Prediction#Evaluation_Procedure.)

SymMono

For the implicit task, the two LSTM based models make their best predictions close to the cut-off point (i.e., last event of the prime). As the onset time after the cut-off point increases, the Markov Model outperforms the LSTM based models, with the exception of recall of inter-onset interval (cf. Figure 4), where FC1 consistently performs better than the Markov Model. Possibly in consequence of the poorer pitch performance, also fewer relevant pitch-ioi pairs were detected by the LSTM models as onset time after cutoff point increases.

For the explicit task, only FC1 was submitted. It outperforms chance level significantly, at 0.87, i.e., picking the correct continuation in almost 90% of the cases (See Table 1).

SymPoly

Only one LSTM model was submitted to SymPoly (FC1), with results comparable to SymMono (see Figures 10-18, Table 1).

Figures

symMono

Explicit task: generate music given a prime

2018 mono R pitch.png

Figure 1. Recall of generated pitches after cutoff point.

2018 mono P pitch.png

Figure 2. Precision of generated pitches after cutoff point.

2018 mono F1 pitch.png

Figure 3. F1 measure of generated pitches after cutoff point.

2018 mono R ioi.png

Figure 4. Recall of generated inter-onset intervals after cutoff point.

2018 mono P ioi.png

Figure 5. Precision of generated inter-onset intervals after cutoff point.

2018 mono F1 ioi.png

Figure 6. F1 measure of generated inter-onset intervals after cutoff point.

2018 mono R pairs.png

Figure 7. Recall of generated pitch-ioi pairs after cutoff point.

2018 mono P pairs.png

Figure 8. Precision of generated pitch-ioi pairs after cutoff point.

2018 mono F1 pitch.png

Figure 9. F1 measure of generated pitch-ioi pairs after cutoff point.

Explicit Task: Polyphonic

2018 poly R pitch.png

Figure 10. Recall of generated pitches after cutoff point.

2018 poly P pitch.png

Figure 11. Precision of generated pitches after cutoff point.

2018 poly F1 pitch.png

Figure 12. F1 measure of generated pitches after cutoff point.

2018 poly R ioi.png

Figure 13. Recall of generated inter-onset intervals after cutoff point.

2018 poly P ioi.png

Figure 14. Precision of generated inter-onset intervals after cutoff point.

2018 poly F1 ioi.png

Figure 15. F1 measure of generated inter-onset intervals after cutoff point.

2018 poly R pairs.png

Figure 16. Recall of generated pitch-ioi pairs after cutoff point.

2018 poly P pairs.png

Figure 17. Precision of generated pitch-ioi pairs after cutoff point.

2018 poly F1 pairs.png

Figure 18. F1 measure of generated pitch-ioi pairs after cutoff point.

Tables

Implicit task: decide which of two continuations is the true one, given the prime

Algorithm Monophonic Polyphonic
FC1 0.87 0.92
Sig. > chance 0.54 0.54