Predicting Dynamics in Violin Pieces with Features from Melodic Motifs

Ortega, Fábio Jose Muneratti; Perez-Carrillo, Alfonso; Ramírez, Rafael

doi:10.1007/978-3-030-43887-6_46

Fábio Jose Muneratti Ortega⁸,
Alfonso Perez-Carrillo⁸ &
Rafael Ramírez⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1168))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

Abstract

We present a machine–learning model for predicting the performance dynamics in melodic motifs from classical pieces based on musically–meaningful features calculated from score–like symbolic representation. This model is designed to be capable of providing expressive directions to musicians within tools for expressive performance practice, and for that reason, in contrast with previous research, all modeling is done on a phrase level rather than note level. Results show the model is powerful but struggles with the generalization of predictions. The robustness of the chosen summarized representation of dynamics makes its application possible even in cases of low accuracy.

You have full access to this open access chapter, Download conference paper PDF

An Evaluation of Score Descriptors Combined with Non-linear Models of Expressive Dynamics in Music

An evaluation of linear and non-linear models of expressive dynamics in classical piano and symphonic music

Article Open access 09 March 2017

Logistic Modeling of Note Transitions

Keywords

1 Introduction

The development of computer models of music expression has been an active field of research for over 30 years with a wide range of approaches [9]. Most models that have been proposed by the research community share the trait of being designed with the goal of improving computer performance. Our motivation, on the other hand, is to make use of smart technologies to improve the tools available for learning to play music, and in particular, to help musicians improve their expressive performance skills. In our envisioned scenario [12], a computer system powered by a meaningful expressive performance model could give musicians expressive directions during performance or visual feedback regarding a recording based on information extracted from a musical score. As an effort to enable such scenario, this paper presents a machine learning model for predicting the dynamics of an ensemble based on high-level features extracted from a score–like symbolic representation of the musical piece. The proposed method focuses on modeling long-term dynamics variations, so as to allow a musician to follow the modeled dynamics suggestions during performance.

1.1 Related Work

Several computer models of expression have been successful in the generation of convincing performances, particularly of classical piano pieces, as could be witnessed in the RENCON competition [8]. Most recent are the automatic compositions in the context of project Magenta [7] but the nature of the model makes composition and performance inherently inseparable whereas our learning scenario primarily requires producing performances for already composed and well-established pieces. An approach more related to our own is seen in the YQX system [16], which predicts timing, dynamics and articulation variations in classical piano pieces, and in [6] where a system for predictions of ornamentations in jazz guitar melodies is described. In both cases, melodic lines play an important role, characterized by their Narmour Implication/Realization model classes [11]. In our case, the same type of information is presented to our machine–learning algorithm using a different representation based on pitch curve coefficients. We differ from both, however, by predicting phrase–level instead of note–level expression. Most applicable to our desired scenario are the models reported in [3], which, as in our case, are able to output predictions of expressive parameters based on score information, but a sensible difference is that these models use dynamics markings from the score as a starting point whereas ours seek to generate predictions without using any input indication of expression.

2 Materials and Methods

2.1 Materials

An adequate dataset for the intended model design required a wide variety of melodic themes in both audio and synchronized symbolic representation. Corpi of solo piano pieces such as the MAESTRO [7] were not optimal for the problem since we were interested in mapping the relationship between melody and harmony in the modeled features, and these elements tend to be fully blended in piano parts. The MusicNet dataset [14] provides audio–to–score synchronization as well as the necessary melodic diversity and still allows a clear distinction between main melodic lines and harmonies thanks to the abundance of chamber music pieces with individual instrument parts, and was thus chosen for the task. To distinguish the main melody from harmony, violin parts were treated as melodies and all other instruments, as harmony. Only the subset of pieces which contained a violin were used, resulting in 122 pieces and a total of 874 min of recordings. For estimating the dynamics performed by the ensembles, the momentary loudness in windows of 0.1s according to the EBU R128 standard [4] was computed with the help of the Essentia library [1].

2.2 Methods

The designed model consists of a feed–forward neural network trained to predict the dynamics curve of a musical motif, that is, a short phrase of roughly one or two bars. An important aspect of the modeling is that each training instance represents a motif rather than a single note. This design decision is motivated by two beliefs: first, that musicians plan and execute their expressive movements considering a horizon of a few notes rather than momentarily focusing on each one; and second, that in our music learning scenario, performance suggestions based on model outputs can be best visualized and interpreted in that level of granularity. As a consequence of choosing to train the model on motifs, it is necessary to determine musically-relevant motif boundaries in the pieces as well as appropriate features for this representation. The motif boundary detection is done by applying the LBDM [2] algorithm to estimate boundary probabilities, and recursively dividing the piece until one of two conditions is met: either no boundary probability is two standard deviations larger that the rest, or the resulting segment has fewer than 10 notes. Table 1 summarizes the input features used for training. Piece keys and modes were estimated from pitch profiles as detailed in [13]. The output features of the model should represent the dynamics of the motif and its variation on time. We have summarized that information by approximating the performance loudness curve extracted with Essentia by a parabola, fit using the least-squares method. This is consistent with the observation by Todd [10] and other researchers [5, 15] that dynamics variations tend to follow a quadratic profile. Given this approximation, the task of the neural network is optimizing the three coefficients that define the dynamics curve.

To facilitate the optimization task, some data conditioning was performed. Loudness measurements of each piece were normalized to zero mean and unit variance to eliminate differences caused by inconsistent recording conditions. Motifs with less than 4 notes and outliers (z-score above 10 in any feature) were discarded, all nominal features were converted to “one-hot” format and all numeric features were standardized. The resulting dataset had around 10.000 instances, which were divided into training and test sets containing 90% and 10% of instances, respectively.

The feed-forward network was programmed in the PyTorch^{Footnote 1} framework and built with two hidden layers of 25 nodes each, using ReLU as an activation function and standard mean-squared error as a loss function. The training was run for 1800 epochs in stochastic gradient descent optimization with batches of 100 instances, learning rate of 0.2 and momentum of 0.1. The learning rate was decreased by a factor of 10 every 600 epochs. All parameters were cross-validated using a subdivision of the training set prior to the final training round.

Table 1. Input features of the model.

Full size table

3 Results and Discussion

3.1 Results

Table 2 shows the obtained correlation coefficients for each of the dynamics coefficients predicted for all instances in the test set. Examples of the loudness curve, ground-truth quadratic approximation and predicted curve for three motifs can be seen in Fig. 1. Table 3 provides some perspective on the accuracy of the modeled dynamics by indicating root-mean-square errors for a deadpan prediction, for the ground-truth approximations, and for the model’s output.

Table 2. Correlation coefficients for output features.

Full size table

3.2 Discussion

The Pearson correlation coefficients obtained for the training set show that the model is sufficiently powerful to predict the complex relationships present in this scenario, but the lower correlation values seen in the test set indicate that some overfitting occurred, and the meaningful correlations detected in the data only partially explain the observed dynamics. The deadpan-level (\(E_d\)) and ground-truth-level (\(E_g\)) errors in Table 3 can be seen as lower and upper boundaries of accuracy, indicating that this modeling approach offers a potential reduction in prediction error of up to \(E_d - E_g = 2.17\) dB. The 3.39 dB value obtained with our predictions implies an error reduction of \(E_d - 3.39 = 0.47 \) dB compared to the deadpan baseline, which corresponds to \(0.47/2.17 = 21.65\%\) of the predicted potential. That is consistent with the correlation coefficient values and shows that the prediction of coefficients translates well into prediction of dynamics levels.

Table 3. RMS error in loudness levels prediction.

Full size table

The prediction examples highlighted in Fig. 1 illustrate some relevant conclusions: It can be seen that most of the short–term variation in loudness levels occurs on note boundaries due to note articulation, and in terms of perceived dynamics can be understood as noise. The quadratic approximation (labeled ground-truth) provides a cleaner and more intuitive visualization of the variation of loudness in a phrase, and in most individually–inspected cases represents it quite well. The leftmost example is an exception, as it shows a case in which the phrase boundaries chosen by the algorithm don’t seem to match the performer’s choice, hence the silence during the phrase and the poor results even in the proposed ground-truth approximation. In many observed cases, as shown in the middle and rightmost graphs, the predicted curve shows robustness, especially with relation to the \(x^2\) and \(x^{1}\) coefficients, since some variation in their predictions doesn’t affect the character of the interpretation. It is reasonable to assume that despite the difference between ground-truth and predicted values in such cases, performances executed according to instructions from the latter could be considered just as pleasing.

Logical improvements to our proposed approach under consideration include adding information that relates to the repetition of motifs, detecting key modulations or modal harmony in pieces, augmenting the training set with multiple different divisions of motifs per piece and experimenting with different treatments of time–series data such as training with long short–term memory networks.

Notes

1.
http://pytorch.org.

References

Bogdanov, D., et al.: Essentia: an audio analysis library for music information retrieval. In: 14th Conference of the International Society for Music Information Retrieval (ISMIR). International Society for Music Information Retrieval (ISMIR) (2013)
Google Scholar
Cambouropoulos, E.: The local boundary detection model (LBDM) and its application in the study of expressive timing. In: Proceedings of the International Computer Music Conference ICMC01, Havana, Cuba, pp. 232–235 (2001)
Google Scholar
Cancino-Chacón, C.E.: Computational modeling of expressive music performance with linear and non-linear basis function models. Ph.D. thesis, Johannes Kepler University Linz (2018)
Google Scholar
EBU Tc Committee: Tech 3341: Loudness metering: ‘EBU mode’ metering to supplement EBU R 128 loudness normalization. Technical report, EBU, Geneva (2016)
Google Scholar
Gabrielsson, A., Bengtsson, I., Gabrielsson, B.: Performance of musical rhythm in 3/4 and 6/8 meter. Scand. J. Psychol. 24(1), 193–213 (1983). https://doi.org/10.1111/j.1467-9450.1983.tb00491.x
Article Google Scholar
Giraldo, S.I., Ramirez, R.: A machine learning approach to discover rules for expressive performance actions in jazz guitar music. Front. Psychol. 7, 1965 (2016). https://doi.org/10.3389/fpsyg.2016.01965
Article Google Scholar
Hawthorne, C., et al.: Enabling factorized piano music modeling and generation with the MAESTRO dataset. In: International Conference on Learning Representations (2019)
Google Scholar
Katayose, H., Hashida, M., De Poli, G., Hirata, K.: On evaluating systems for generating expressive music performance: the Rencon experience. J. New Music Res. 41(4), 299–310 (2012). https://doi.org/10.1080/09298215.2012.745579
Article Google Scholar
Kirke, A., Miranda, E.R.: An overview of computer systems for expressive music performance. In: Kirke, A., Miranda, E. (eds.) Guide to Computing for Expressive Music Performance, pp. 1–47. Springer, London (2013). https://doi.org/10.1007/978-1-4471-4123-5_1
Chapter Google Scholar
McAngus Todd, N.P.: The dynamics of dynamics: a model of musical expression. J. Acoust. Soc. Am. 91(6), 3540–3550 (1992). https://doi.org/10.1121/1.402843
Article Google Scholar
Narmour, E.: The Analysis and Cognition of Melodic Complexity: Theimplication-Realization Model. University of Chicago Press, Chicago (1992)
Google Scholar
Ramirez, R., Ortega, F.J.M., Giraldo, S.I.: Technology enhanced learning of expressive music performance. In: Proceedings of the 16th Brazilian Symposium on Computer Music (2017)
Google Scholar
Temperley, D.: What’s key for key? The krumhansl-schmuckler key-finding algorithm reconsidered. Music Percept. Interdisc. J. 17(1), 65–100 (1999). https://doi.org/10.2307/40285812
Article Google Scholar
Thickstun, J., Harchaoui, Z., Foster, D.P., Kakade, S.M.: Invariances and data augmentation for supervised music transcription. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2018)
Google Scholar
Tobudic, A., Widmer, G.: Relational IBL in music with a new structural similarity measure. In: Proceedings of the 13th International Conference on Inductive Logic Programming, pp. 365–382 (2003)
Google Scholar
Widmer, G., Flossmann, S., Grachten, M.: YQX plays chopin. AI Mag. 30(3), 35 (2009). https://doi.org/10.1609/aimag.v30i3.2249
Article Google Scholar

Download references

Author information

Authors and Affiliations

Music Technology Group, Machine Learning and Music Lab, Department of Communication and Information Technology, Pompeu Fabra University, Barcelona, Spain
Fábio Jose Muneratti Ortega, Alfonso Perez-Carrillo & Rafael Ramírez

Authors

Fábio Jose Muneratti Ortega
View author publications
You can also search for this author in PubMed Google Scholar
Alfonso Perez-Carrillo
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Ramírez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fábio Jose Muneratti Ortega .

Editor information

Editors and Affiliations

Institut National des Sciences Appliquées, Rennes, France
Peggy Cellier
Maastricht University, Maastricht, The Netherlands
Kurt Driessens

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ortega, F.J.M., Perez-Carrillo, A., Ramírez, R. (2020). Predicting Dynamics in Violin Pieces with Features from Melodic Motifs. In: Cellier, P., Driessens, K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Communications in Computer and Information Science, vol 1168. Springer, Cham. https://doi.org/10.1007/978-3-030-43887-6_46

Download citation

DOI: https://doi.org/10.1007/978-3-030-43887-6_46
Published: 28 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43886-9
Online ISBN: 978-3-030-43887-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Predicting Dynamics in Violin Pieces with Features from Melodic Motifs

Abstract

Similar content being viewed by others

An Evaluation of Score Descriptors Combined with Non-linear Models of Expressive Dynamics in Music

An evaluation of linear and non-linear models of expressive dynamics in classical piano and symphonic music

Logistic Modeling of Note Transitions

Keywords

1 Introduction

1.1 Related Work

2 Materials and Methods

2.1 Materials

2.2 Methods

3 Results and Discussion

3.1 Results

3.2 Discussion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships