2018:Patterns for Prediction

From MIREX Wiki

Description

In brief: (1) Algorithms that take an excerpt of music as input (the prime), and output a predicted continuation of the excerpt.

(2) Additionally or alternatively, algorithms that take a prime and one or more continuations as input, and output the likelihood that each continuation is the genuine extension of the prime.

Your task captains are Iris Yuping Ren (yuping.ren.iris), Berit Janssen (berit.janssen), and Tom Collins (tomthecollins all at gmail.com). Feel free to copy in all three of us if you have questions/comments.

The submission deadline is August 25th. With the deadline being so close, we intend this task description and datasets provided below to help stimulate discourse that will lead to wide participation in 2019.

Relation to the pattern discovery task: The Patterns for Prediction task is an offshoot of the Discovery of Repeated Themes & Sections task (2013-2017). We hope to run the former (Patterns for Prediction) task and pause the latter (Discovery of Repeated Themes & Sections). In future years we may run both.

In more detail: One facet of human nature comprises the tendency to form predictions about what will happen in the future (Huron, 2006). Music, consisting of complex temporally extended sequences, provides an excellent setting for the study of prediction, and this topic has received attention from fields including but not limited to psychology (Collins, Tillmann, et al., 2014; Janssen, Burgoyne and Honing, 2017; Schellenberg, 1997; Schmukler, 1989), neuroscience (Koelsch et al., 2005), music theory (Gjerdingen, 2007; Lerdahl & Jackendoff, 1983; Rohrmeier & Pearce, 2018), music informatics (Conklin & Witten, 1995; Cherla et al., 2013), and machine learning (Elmsley, Weyde, & Armstrong, 2017; Hadjeres, Pachet, & Nielsen, 2016; Gjerdingen, 1989; Roberts et al., 2018; Sturm et al., 2016). In particular, we are interested in the way exact and inexact repetition occurs over the short, medium, and long term in pieces of music (Margulis, 2014; Widmer, 2016), and how these repetitions may interact with "schematic, veridical, dynamic, and conscious" expectations (Huron, 2006) in order to form a basis for successful prediction.

We call for algorithms that may model such expectations so as to predict the next musical events based on given, foregoing events (the prime). We invite contributions from all fields mentioned above (not just pattern discovery researchers), as different approaches may be complementary in terms of predicting correct continuations of a musical excerpt. We would like to explore these various approaches to music prediction in a MIREX task. For subtask (1) above (see "In brief"), the development and test datasets will contain an excerpt of a piece up until a cut-off point, after which the algorithm is supposed to generate the next N musical events up until 10 quarter-note beats, and we will quantitatively evaluate the extent to which an algorithm's continuation corresponds to the genuine continuation of the piece. For subtask (2), in addition to containing a prime, the development and test datasets will also contain continuations of the prime, one of which will be genuine, and the algorithm should rate the likelihood that each continuation is the genuine extension of the prime, which again will be evaluated quantitatively.

What is the relationship between pattern discovery and prediction? The last five years have seen an increasing interest in algorithms that discover or generate patterned data, leveraging methods beyond typical (e.g., Markovian) limits (Collins & Laney, 2017; MIREX Discovery of Repeated Themes & Sections task; Janssen, van Kranenburg and Volk, 2017; Ren et al., 2017; Widmer, 2016). One of the observations to emerge from the above-mentioned MIREX pattern discovery task is that an algorithm that is "good" at discovering patterns ought to be extendable to make "good" predictions for what will happen next in a given music excerpt (Meredith, 2013). Furthermore, evaluating the ability to predict may provide a stronger (or at least complementary) evaluation of an algorithm's pattern discovery capabilities, compared to evaluating its output against expert-annotated patterns, where the notion of "ground truth" has been debated (Meredith, 2013).

Data

The Patterns for Prediction Development Dataset (PPDD-Jul2018) has been prepared by processing a randomly selected subset of the Lakh MIDI Dataset (LMD, Raffel, 2016). It has audio and symbolic versions crossed with monophonic and polyphonic versions. The audio is generated from the symbolic representation, so it is not "expressive". The symbolic data is presented in CSV format. For example,

20,64,62,0.5,0
20.66667,65,63,0.25,0
21,67,64,0.5,0
...

would be the start of a prime where the first event had ontime 20 (measured in quarter-note beats -- equivalent to bar 6 beat 1 if the time signature were 4-4), MIDI note number (MNN) 64, estimated morphetic pitch number 62 (see p. 352 from Collins, 2011 for a diagrammatic explanation; for more details, see Meredith, 1999), duration 0.5 in quarter-note beats, and channel 0. Re-exports to MIDI are also provided, mainly for listening purposes. We also provide a descriptor file containing the original Lakh MIDI Dataset id, the BPM, time signature, and a key estimate. The audio dataset contains all these files, plus WAV files. Therefore, the audio and symbolic variants are identical to one another, apart from the presence of WAV files. All other variants are non-identical, although there may be some overlap, as they were all chosen from LMD originally.

The provenance of the Patterns for Prediction Test Dataset (PPTD) will not be disclosed, but it is not from LMD, if you are concerned about overfitting.

There are small (100 pieces), medium (1,000 pieces), and large (10,000 pieces) variants of each dataset, to cater to different approaches to the task (e.g., a point-set pattern discovery algorithm developer may not want/need as many training examples as a neural network researcher). Each prime lasts approximately 35 sec (according to the BPM value in the original MIDI file) and each continuation covers the subsequent 10 quarter-note beats. We would have liked to provide longer primes (as 35 sec affords investigation of medium- but not really long-term structure), but we have to strike a compromise between ideal and tractable scenarios.

Here are the PPDD-Jul2018 variants for download:

("Large" datasets were compressed using the p7zip package, installed via "brew install p7zip on Mac".)

Some examples

This prime finishes with two G’s followed by a D above. Looking at the piano roll or listening to the linked file, we can see/hear that this pitch pattern, in the exact same rhythm, has happened before (see bars 17-18 transition in the piano roll). Therefore, we and/or an algorithm, might predict that the first note of the continuation will follow the pattern established in the previous occurrence, returning to G 1.5 beats later.

This is another example where a previous occurrence of a pattern might help predict the contents of the continuation. Not all excerpts contain patterns (in fact, one of the motivations for running the task is to interrogate the idea that patterns are abundant in music and always informative in terms of predicting what comes next). This one, for instance, does not seem to contain many clues for what will come next. And finally, this one might not contain any obvious patterns, but other strategies (such as schematic or tonal expectations) might be recruited in order to predict the contents of the continuation.

Preparation of the data

Preparation of the monophonic datasets was more involved than the polyphonic datasets: for both, we imported each MIDI file, quantised it using a subset of the Farey sequence of order 6 (Collins, Krebs, et al., 2014), and then excerpted a prime and continuation at a randomly selected time. For the monophonic datasets, we filtered for:

  • channels that contained at least 20 events in the prime;
  • channels that were at least 80% monophonic at the outset, meaning that at least 80% of their segments (Pardo & Birmingham, 2002) contained no more than one event;
  • channels where the maximum inter-ontime interval in the prime was no more than 8 quarter-note beats.
  • we then "skylined" these channels (independently) so that no two events had the same start time (maximum MNN chosen in event of a clash), and double-checked that they still contained at least 20 events;
  • one suitable channel was then selected at random, and the prime appears in the dataset if the continuation contained at least 10 events.

If any of the above could not be satisfied for the given input, we skipped this MIDI file.

For the polyphonic data, we applied the minimum note criteria of 20 in the prime and 10 in the continuation, as well as the prime maximum inter-ontime interval of 8, but it was not necessary to measure monophony or perform skylining.

Audio files were generated by importing the corresponding CSV and descriptor files and using a sample bank of piano notes from the Google Magenta NSynth dataset (Engel et al., 2017) to construct and export the waveform.

The foil continuations were generated using a Markov model of order 1 over the whole texture (polyphonic) or channel (monophonic) in question, and there was no attempt to nest this generation process in any other process cognisant of repetitive or phrasal structure. See Collins and Laney (2017) for details of the state space and transition matrix.

Submission Format

In terms of input representations, we will evaluate 4 largely independent versions of the task: audio, monophonic; audio, polyphonic; symbolic, monophonic; symbolic, polyphonic. Participants may submit algorithms to 1 or more of these versions, and should list these versions clearly in their readme. Irrespective of input representation, all output for subtask (1) should be in "ontime", "MNN" CSV files. The CSV may contain other information, but "ontime" and "MNN" should be in the first two columns, respectively. All output for subtask (2) should be an indication whether of the two presented continuations, "A" or "B" is judged by the algorithm to be genuine. This should be one CSV file for an entire dataset, with first column "id" referring to the file name of a prime-continuation pair, second column "A" containing a likelihood value in [0, 1] for the genuineness of the continuation in folder A, and column "B" similarly for the continuation in folder B.

All submissions should be statically linked to all dependencies and include a README file including the following information:

  • input representation(s), should be 1 or more of "audio, monophonic"; "audio, polyphonic"; "symbolic, monophonic"; "symbolic, polyphonic";
  • subtasks you would like your algorithm to be evaluated on, should be "1", "2", or "1 and 2" (see first sentences of 2018:Patterns_for_Prediction#Description for a reminder);
  • command line calling format for all executables and an example formatted set of commands;
  • number of threads/cores used or whether this should be specified on the command line;
  • expected memory footprint;
  • expected runtime;
  • any required environments and versions, e.g. Python, Java, Bash, MATLAB.

Example Command Line Calling Format

Python:

python <your_script_name.py> -i <input_folder> -o <output_folder>

Evaluation Procedure

In brief: For subtask (1), we match the algorithmic output with the original continuation and compute a match score (see implementation at GitHub). For subtask (2), we count up how many times an algorithm judged the genuine continuation as most likely.

The input excerpt ends with a final note event: <math>(x_0, y_0, z_0)</math>, where <math>x_0</math> is ontime (start time measured in quarter-note beats starting with 0 for bar 1 beat 1), <math>y_0</math> is MNN, and <math>z_0</math> is duration (also measured in quarter-note beats).

The algorithm predicts the continuations: <math>(\hat{x}_1, \hat{y}_1, \hat{z}_1)</math>, <math>(\hat{x}_2, \hat{y}_2, \hat{z}_2)</math>, ..., <math>(\hat{x}_{n^\prime}, \hat{y}_{n^\prime}, \hat{z}_{n^\prime})</math>, where <math>\hat{x}_i</math> are predicted ontimes, <math>\hat{y}_i</math> are predicted MNNs, and <math>\hat{z}_i</math> are predicted durations. The true continuations are notated <math>(x_1, y_1, z_1), (x_2, y_2, z_2),..., (x_n, y_n, z_n)</math>. The predicted continuation ontimes are strictly increasing, that is <math>x_0 < \hat{x}_1 < \cdots < \hat{x}_{n^\prime}</math>, and so are the true continuation ontimes, that is <math>x_0 < x_1 < \cdots < x_n</math>.

IOI

This stands for inter-ontime interval 1. It evaluates whether the algorithm's prediction for the time between the excerpt ending (x_0) and the continuation beginning (x_1) is correct. The metric IOI takes the value 1 if <math>\hat{x}_1 = x_1</math>, and takes the value 0 otherwise.

Pitch

This metric evaluates whether the algorithm's prediction (<math>\hat{y}_1</math>) for the continuation's first MNN (<math>y_1</math>) is correct.

IOI_4

Let <math>P = \{x_1,\ldots, x_n\}</math> be the set of true continuation in the first four beats following the end of the excerpt, and <math>Q = \{\hat{x}_1,\ldots, \hat{x}_{n^\prime}\}</math> be the corresponding set predicted by an algorithm. Then the precision of the algorithm is <math>\mathrm{Prec}(P, Q) = |P \cap Q|/|Q|</math>, the recall of the algorithm is <math>\mathrm{Rec}(P, Q) = |P \cap Q|/|P|</math>, and IOI_4 is defined as the typical F1 ratio of precision and recall, IOI_4 = 2*Prec(P, Q)*Rec(P, Q)/(Prec(P, Q) + Rec(P, Q)). These intersections will probably be calculated "up to translation", meaning that a correct but time- or pitch-shifted solution would not be punished.

IOI_10

...is defined in exactly the same way as IOI_4, but for ten beats (or 2.5 measures in 4-4 time) following the end of the prime.

Pitch_4 and Pitch_10

...are defined in the same ways as IOI_4 and IOI_10 respectively, but applied to the MNN sets <math>P = \{y_1,\ldots, y_n\}</math> and <math>Q = \{\hat{y}_1,\ldots, \hat{y}_{n^\prime}\}</math>. (Strictly speaking these may contain repeated elements, so the unique elements would be determined before calculating Prec, Rec, and F1.)

Combo_4 and Combo_10

In addition to evaluating rhythmic and pitch capacities independently, the metrics Combo_4 and Combo_10 capture the joint ioi-pitch predictive capabilities of algorithms, by applying the above definitions to the sets <math>P = \{(x_1, y_1),\ldots, (x_n, y_n)\}</math> and <math>Q = \{(\hat{x}_1, \hat{y}_1),\ldots, (\hat{x}_{n^\prime}, \hat{y}_{n^\prime})\}</math>.

Polyphonic Version

The polyphonic version of the task will be evaluated in the same way as the monophonic version of the task. Only the Pitch metric needs to change, because the true continuation's first event may consist of several MNNs, <math>P = \{y_{1,1},\ldots, y_{1,m}\}</math>, as may the algorithm's prediction, <math>Q = \{\hat{y}_1,\ldots, \hat{y}_{n^\prime}\}</math>. We will apply the concepts of precision, recall, and F1 to <math>P</math> and <math>Q</math> here, as above. While the above definitions have focused on the first predicted events and events in time windows of 4 and 10 quarter-note beats in length, we will probably also produce graphs with a sliding time window length, to more accurately pinpoint changes in performance.

Entropy

Some existing work in this area (e.g., Conklin & Witten, 1995; Pearce & Wiggins, 2006; Temperley, 2007) evaluates algorithm performance in terms of entropy. If we have time to collect human listeners' judgments of likely (or not) continuations for given excerpts, then we will be in a position to compare the entropy of listener-generated distributions with the corresponding algorithm distributions. This would open up the possibility of entropy-based metrics, but we consider this of secondary importance to the metrics outlined above.

Questions (Q), Answers (A), and Comments (C)

Q. Instead of evaluating continuations, have you considered evaluating an algorithm's ability to predict content between two timepoints, or before a timepoint?

A. Yes we considered including this also, but opted not to for sake of simplicity. Furthermore, these alternatives do not have the same intuitive appeal as predicting future events.

Q. Why do some files sound like they contain a drum track rendered on piano?

A. Some of the MIDI files import as a single channel, but upon listening to them it is evident that they contain multiple instruments. For the sake of simplicity, we removed percussion channels where possible, but if everything was squashed down into a single channel, there was not much we could do.

C. to_the_sun--at--gmx.com writes: "This is exactly what I'm interested in! I have an open-source project called The Amanuensis (https://github.com/to-the-sun/amanuensis) that uses an algorithm to predict where in the future beats are likely to fall.

"Amanuensis constructs a cohesive song structure, using the best of what you give it, looping around you and growing in real-time as you play. All you have to do is jam and fully written songs will flow out behind you wherever you go.

"My algorithm right now is only rhythm-based and I'm sure it's not sophisticated enough to be entered into your contest, but I would be very interested in the possibility of using any of the algorithms that are, in place of mine in The Amanuensis. Would any of your participants be interested in some collaboration? What I can bring to the table would be a real-world application for these algorithms, already set for implementation."

Q. I'm interested in performing this task on the symbolic dataset, but I don't have an audio-based algorithm. It was unclear to me if the inputs are audio, symbolic, both, or either.

A. We have clarified, at the top of 2018:Patterns_for_Prediction#Submission_Format, that submissions in 1-4 representational categories are acceptable. It's also OK, say, for an audio-based algorithm to make use of the descriptor file in order to determine beat locations. (You could do this by looking at the <math>u = \mathrm{bpm}</math> value, and then you would know that the main beats in the WAV file are at <math>0, 60/u, 2 \cdot 60/u,\ldots</math> sec.)

Time and Hardware Limits

A total runtime limit of 72 hours will be imposed on each submission.

Seeking Contributions

  • We would like to evaluate against real (not just synthesized-from-MIDI) audio versions. If you have a good idea of how we might make this available to participants, let us know. We would be happy to acknowledge individuals and/or companies for helping out in this regard.
  • More suggestions/comments/ideas on the task is always welcome!

Acknowledgments

Thank you to Anja Volk, Darrell Conklin, Srikanth Cherla, David Meredith, Matevz Pesek, and Gissel Velarde for discussions!

References

  • Cherla, S., Weyde, T., Garcez, A., and Pearce, M. (2013). A distributed model for multiple-viewpoint melodic prediction. In In Proceedings of the International Society for Music Information Retrieval Conference (pp. 15-20). Curitiba, Brazil.
  • Conklin, D., and Witten, I. H. (1995). Multiple viewpoint systems for music prediction. Journal of New Music Research, 24(1), 51-73.
  • Elmsley, A., Weyde, T., & Armstrong, N. (2017). Generating time: Rhythmic perception, prediction and production with recurrent neural networks. Journal of Creative Music Systems, 1(2).
  • Engel, J., Resnick, C., Roberts, A., Dieleman, S., Eck, D., Simonyan, K., & Norouzi, M. (2017). Neural audio synthesis of musical notes with WaveNet autoencoders. https://arxiv.org/abs/1704.01279
  • Gjerdingen, R. O. (1989). Using connectionist models to explore complex musical patterns. Computer Music Journal, 13(3), 67-75.
  • Gjerdingen, R. (2007). Music in the galant style. New York, NY: Oxford University Press.
  • Hadjeres, G., Pachet, F., & Nielsen, F. (2016). Deepbach: A steerable model for Bach chorales generation. arXiv preprint arXiv:1612.01010.
  • Huron, D. (2006). Sweet Anticipation. Music and the Psychology of Expectation. Cambridge, MA: MIT Press.
  • Janssen, B., Burgoyne, J. A., & Honing, H. (2017). Predicting variation of folk songs: A corpus analysis study on the memorability of melodies. Frontiers in Psychology, 8, 621.
  • Janssen, B., van Kranenburg, P., & Volk, A. (2017). Finding occurrences of melodic segments in folk songs employing symbolic similarity measures. Journal of New Music Research, 46(2), 118-134.
  • Koelsch, S., Gunter, T. C., Wittfoth, M., & Sammler, D. (2005). Interaction between syntax processing in language and in music: an ERP study. Journal of Cognitive Neuroscience, 17(10), 1565-1577.
  • Lerdahl, F., and Jackendoff, R. (1983). "A generative theory of tonal music. Cambridge, MA: MIT Press.
  • Margulis, E. H. (2014). On repeat: How music plays the mind. New York, NY: Oxford University Press.
  • Meredith, D. (1999). The computational representation of octave equivalence in the Western staff notation system. In Proceedings of the Cambridge Music Processing Colloquium. Cambridge, UK.
  • Meredith, D. (2013). COSIATEC and SIATECCompress: Pattern discovery by geometric compression. In Proceedings of the 10th Annual Music Information Retrieval Evaluation eXchange (MIREX'13). Curitiba, Brazil.
  • Pardo, B., & Birmingham, W. P. (2002). Algorithms for chordal analysis. Computer Music Journal, 26(2), 27-49.
  • Pearce, M. T., & Wiggins, G. A. (2006). Melody: The influence of context and learning. Music Perception, 23(5), 377–405.
  • Raffel, C. (2016). "Learning-based methods for comparing sequences, with applications to audio-to-MIDI alignment and matching". PhD Thesis.
  • Ren, I.Y., Koops, H.V, Volk, A., Swierstra, W. (2017). In search of the consensus among musical pattern discovery algorithms. In Proceedings of the International Society for Music Information Retrieval Conference (pp. 671-678). Suzhou, China.
  • Roberts, A., Engel, J., Raffel, C., Hawthorne, C., & Eck, D. (2018). A hierarchical latent vector model for learning long-term structure in music. In Proceedings of the International Conference on Machine Learning (pp. 4361-4370). Stockholm, Sweden.
  • Rohrmeier, M., & Pearce, M. (2018). Musical syntax I: theoretical perspectives. In Springer Handbook of Systematic Musicology (pp. 473-486). Berlin, Germany: Springer.
  • Schellenberg, E. G. (1997). Simplifying the implication-realization model of melodic expectancy. Music Perception, 14(3), 295-318.
  • Schmuckler, M. A. (1989). Expectation in music: Investigation of melodic and harmonic processes. Music Perception, 7(2), 109-149.
  • Sturm, B.L., Santos, J.F., Ben-Tal., O., & Korshunova, I. (2016), Music transcription modelling and composition using deep learning. In Proceedings of the International Conference on Computer Simulation of Musical Creativity. Huddersfield, UK.
  • Temperley, D. (2007). Music and probability. Cambridge, MA: MIT Press.
  • Widmer, G. (2017). Getting closer to the essence of music: The con espressione manifesto. ACM Transactions on Intelligent Systems and Technology (TIST), 8(2), 19.