Keywords

1 Introduction

Musicians introduce deviations to the score when performing a musical piece in order to achieve a particular expressive intention. Computational expressive music performance modelling (CEMPM) aims to characterise such deviations using computational techniques (e.g. machine learning techniques). In this context, CEMPM aims to formulate a hypothesis on the expressive devices musicians use when performing (consciously or unconsciously), which can be empirically verified on measured performance data. Empirical models are often obtained from the quantitative analysis of musical performances, based on measurements of timing, dynamics, and articulation (e.g Shaffer et al. 1985; Clarke 1985; Gabrielsson 1987; Palmer 1996a; Repp 1999; Goebl 2001, to name a few). State of the art reviews are presented in Gabrielsson (2003). Computational models have been implemented as rule-based models (Friberg et al. 2000; the KTH model), mathematical models (Todd 1992), structure-level models (Mazzola 2002).

Machine Learning techniques have been used to predict performance variations in timing, articulation and energy (e.g. Widmer 2002), to model concrete expressive intentions (e.g. mood, musical style, performer etc). Most of the literature focus on classical piano music (e.g. Widmer 2002). The piano keys work as ON/OFF switching devices (e.g. MIDI pianos), which simplifies the process of data acquisition, where performance data has to be converted into machine readable data. Some exceptions can be found in in jazz saxophone music where case-based reasoning (Arcos et al. 1998) and inductive logic programming (Ramirez et al. 2011) have been used. Jazz guitar expressive performance modelling has been studied by Giraldo and Ramirez (2016a, b), in which special emphasis is done in melodic ornamentation.

However, few studies have been done in the context of classical guitar, aiming to study the intrinsic variations performers introduce when no specific expressive intentions are provided. In this study, we present a machine learning approach in which CEMPM techniques are applied to study the expressive variations that nine different guitarists introduce when performing the same musical piece, for which no performance indications are provided. We study the correlations on the variations in timing and energy. We extract features from the score to obtain predictive models for each musician to later cross-validate among them.

2 Materials and Methods

For this study we obtained recordings of nine professional guitarists performing the same musical piece. The piece was written for classical guitar, and was composed specifically for this study. Musicians did not knew the piece before hand, and any particular expressive/performance indications were provided (nor written or verbally). The performers were allowed to freely introduce the expressive variations to their taste/criteria. Musicians were also allowed to practice the piece as long as they wanted, until they were satisfied with their interpretation, before recording. The recordings took place at different studios/institutions and were collected by the Department of Music form the Faculty of Arts of the University of Quebec in Montreal (UQAM) Canada.

2.1 Framework

The general framework of the project is depicted in Fig. 1.

Fig. 1.
figure 1

Framework and data processing flow.

Data Processing. The musical score was created in musicXML format from which we obtained machine readable data (MIDI type) information of each note, i.e. its onset (in seconds), duration (in seconds), pitch, and velocity (which refers to volume). We used the score as the dead-pan performance (i.e. robotic or inexpressive performance).

In a second stage we obtained machine readable data of the performance in MIDI type format. This process was performed in a semi-automatic fashion, where we used score-informed Non-negative Matrix Factorisation (NMF). The NMF method decomposes an input spectrogram \(X \in ^{KxN}\) with K frequency bins and N frames as:

$$\begin{aligned} X = WH \end{aligned}$$
(1)

where \(W \in ^{KxR}\) contains the spectral bases for each of the R pitches and \(H \in ^{RxN}\) is the pitch activity matrix across time. The number of R pitches and the W and W matrices initial weights were initialised, informed by the score (for an overview see Clarke 1985). Later, manual correction was performed over the spectrum. Finally, energy information (i.e velocity) was obtained from the RMS value, calculated over the audio wave, in between the obtained note boundaries.

Similarly, we performed automatic beat extraction (Zapata et al. 2014), followed by manual correction to obtain the beat information (in seconds) over the audio signal.

Data-set Creation. Feature extraction from the score was performed by extracting local information of the notes in the score (e.g pitch, duration, etc.) as well as contextual information in which the note occurs (e.g. previous/next interval, metrical strength, harmonic/melodic analysis, etc.). For an overview see Giraldo and Ramirez (2016a, b). A total of 27 descriptors were extracted for each note. Later, deviations in tempo variation, measured in Beats Per Minute (BPM) and Inter Onset Interval (IOI) for each note/performer, were calculated by considering the difference among the theoretical BPM/IOI values in the score and the corresponding values in the performance. Finally, we obtained data-sets for each of the nine performers, as well as for each of the three performance deviations considered (i.e. energy, BPM, and IOI deviations). A total of 27 data-sets were obtained, where each instance is composed by the feature set extracted for each note, and the considered deviations are the value to be predicted.

Machine Learning Modelling. Each of the nine performer data-sets were used as both train and test sets in a all-vs-all cross-validation fashion. This consisted in obtaining a predictive model for each performer (i.e all performer data sets were used as train set) and applying each of them to all the performers (i.e all the performer data sets were used as test set), and finally obtaining a model evaluation for each pair.

Evaluation. A preliminary evaluation consisted in obtaining the correlations among the actual deviations for each note, among all performers. Later, at the Machine Learning stage, the performance of the predictive models was addressed by obtaining the Correlation Coefficient (CC) among the predicted values of the model and the actual values at the test set. The algorithms considered were Support vector regression (SVR, with radial kernel), Regression Trees (RT, with pruning), and Artificial Neural Networks (ANN, fully connected with one hidden layer), for which CC’s on preliminary tests are presented in Table 1. Given that ANN out performed at the prediction of the three considered expressive deviations, in this paper we report on the CC obtained with ANN.

Table 1. Mean Correlation Coefficient (CC) comparison among models.
Fig. 2.
figure 2

BPM percentage of deviation among nine performers for each consecutive note

3 Results

Figure 2 show the measured deviations in percentage of BPM of each consecutive note for each performer. It can be noticed the correspondence of peaks and valleys (with different amplitude/deviation degree) among performers. Figure 3 present the scaled graph of the obtained correlation coefficients using ANNs. The numbers on the vertical axis indicate the performer data sets (numbered from 1 to 9) used as train set, whereas the horizontal axis represent the performers data set when used as test set. At each intersection, the colour map represents the correlation coefficient obtained using each pair of train/test data sets. As expected, the diagonal shows higher correlations, representing the performance on the train set (i.e train and test set are the same performer). Higher correlations can be found at the BPM and IOI deviations. Also, a similar pattern can be observed on the CCs obtained among performers. This might indicate that the majority of the performers introduce similar timing variations based on the information provided by the score. This tendency can be observed as well at Fig. 2 (e.g. as seen at the ritardando introduced by most performers at the end of the piece). In contrast, lower level of correlations were obtained on the energy deviation models, which might indicate that the decision on the loudness of a note is less consistent among performers. However, other external factors, such as different recording conditions (e.g. the use of a different guitar, or recordings being done at different studios) might bias this result.

Fig. 3.
figure 3

Scaled graph of the correlations obtained for each performer model, for each of the three expressive deviations considered (from left to right: BPM, IOI and Energy). Vertical axis correspond to performer data used as train set (from 1 to 9), and horizontal axis corresponds to performer data used as test set (from 1 to 9)

4 Conclusion

In this paper we have presented a machine learning approach based on computational modelling of expressive music performance to study the correlations on the intrinsic expressive deviations that musicians introduce when performing a musical piece. We have obtained recordings of the same musical piece by nine professional guitarists, in which any indications of expressiveness is indicated, and performers have freely choose on the expressive actions performed. We have extracted descriptors from the score, and measure the deviations introduced on the performance by each performer in terms of the BPM, IOI and Energy deviations. We have obtained machine learning models using ANNs, and for each performer, and cross-validated the performance among interpreters’ models based on CC. Preliminary results indicate, that performer take similar actions in terms of timing deviations, whereas less correlation was obtained in energy deviations.