1 Introduction

Linear discriminant analysis (LDA) is a popular strategy still widely used (Graf et al., 2024; McLaughlin and Su, 2023; Zhu et al., 2022) for classification and dimensionality reduction. The origins trace back to Fisher’s discriminant (Fisher, 1936) and its multiclass extension (Rao, 1948). The main assumption of LDA is that samples of class \(i=1,2,\dots ,K\) follow a multivariate normal distribution of mean \(\mu _{i} \in \mathbb {R}^d\) and shared covariance matrix \(\Sigma \in \mathbb {R}^{d\times d}\) of full rank, denoted \(\mathcal {N}_{d}(\mu _{i}, \Sigma )\). Then the Bayes optimal rule for classification is applied to predict the class \(i(\varvec{x})\) of a new sample \(\varvec{x} \in \mathbb {R}^d\)

$$\begin{aligned} i(\varvec{x})&\triangleq \mathop {\mathrm {arg\,max}}\limits _{i \in \{1,2,\dots ,K\}} \mathbb {P}(C=i | X=\varvec{x})\;. \end{aligned}$$

As \(\mathbb {P}(C=i | X=\varvec{x}) \propto \mathcal {N}_{d}(\varvec{x}; \mu _{i}, \Sigma )\pi _i,\) where the prior probability of class i is \(\pi _i\), by applying the Bayes theorem, and by reordering and ignoring terms that are constant across classes, one obtains a decision function which is linear in \(\varvec{x}\)

$$\begin{aligned} i(\varvec{x})&= \mathop {\mathrm {arg\,max}}\limits _{i \in \{1,2,\dots ,K\}} (\mu _{i}^\top \Sigma ^{-1})\varvec{x}-\frac{1}{2}\mu _{i}^\top \Sigma \mu _{i}+\log \pi _i\;. \end{aligned}$$
(1)

Parameters \((\mu _{i})_{i \le K}\), \(\Sigma\), \((\pi _i)_{i \le K}\) can be inferred by maximizing the joint likelihood on a set of training samples.

The advance of functional linear discriminant analysis (FLDA) (James and Hastie, 2001) allows us to handle cases where \(\varvec{x}\) is not simply a \(d\)–dimensional vector, but the output of a time–dependent multivariate function f associated with sample s: \(\varvec{x}_s=f_s(t) \in \mathbb {R}^{d}\). Note that FLDA is restricted to the univariate case, i.e.  when \(d=1\). In practice, one has of course no access to the true function \(f_s: \mathbb {N}^* \rightarrow \mathbb {R}^{d}\), but only to some of its values \(f_s(t^s_1), f_s(t^s_2), \dots , f_s(t^s_{n_s})\) at a small number \(n_s\) of time points \(t^s_1\), \(t^s_2\), \(\dots\), \(t^s_{n_s}\). Such short time series appear in many practical scenarios.

Fig. 1
figure 1

Example of a short multivariate time series with irregular sampling intervals and missing data. There are 3 classes with 1 sample in each class. Each colour corresponds to a class, and each plot represents a single feature. Some features or time points might be missing across samples. For instance, Feature 2 is not measured for the sample in Class 2, and both Feature 1 and Feature 2 are not measured at the same time points for the samples in Class 1 and Class 2

For instance, think of medical check-ups, clinical trials, or psychological studies, where patients (samples) from different conditions (classes: diseased, treated with a drug, healthy controls, ...) are surveyed for several variables (features: electrocardiogram measurements, concentration of a specific protein, ...) sparsely over weeks or months. Ignoring that those values come from the same function and simply concatenating values across all time points will lead to high–dimensional data strongly correlated across time points. This, in turn, would lead to numerical issues that prevent the estimation of the covariance matrix \(\Sigma\).

Another challenge is that sampling occurs at irregular, short intervals of time points across samples (i.e.  different, small values of \(n_s\) and different \((t^s_i)_i\) across samples s). Moreover, different sets of features might be measured in two distinct samples, meaning that even the value of \(d\) can vary across samples. Those challenges are exemplified in Figure 1. Notably, in this example, there is substantial missing data, such as Feature 2 in Class 1 and Class 2, making traditional padding and imputation approaches impractical (Bier et al., 2022).

1.1 Related work

A first line of related research is functional data analysis (Yao et al., 2005). Time–series observations from sample s are considered as noisy samples from the underlying function \(f_s\). This approach is particularly promising when one is interested in the overall shape of data. For example, for medical longitudinal data (Wang et al., 2022) or data arising from psychological studies (Jebb et al., 2015), one is more interested in trends or the overall pattern of the entire cohort, rather than in individual fluctuations and noise sources. In many practical scenarios, the training data is collected from a large number of subjects at a small number of time points (Yoon et al., 2019). The grouping of subjects is based on a similarity of the curves or trends displayed. The learned model can then be used as a classifier to predict class membership for data from new subjects not in the training data set. In such cases, conventional LDA as shown in (1) can’t be directly applied, and time interval discretization is required. James and Hastie (2001) addressed this issue by proposing a model which approximates observations as a noisy linear combination of spline curves, and applies a LDA–like inference procedure to the linear combination coefficients. However, this model only holds for univariate functional data. Applying FLDA individually to each curve is suboptimal due to the high dimension of data points and the fact that FLDA is not designed to estimate inter–feature correlations. Statistical models tailored to multivariate functional data might overcome the issue of interpretability. Similarly to our work, Gardner-Lubbe (2021) presented an extension of FLDA to multivariate data. However, this approach completely ignores the issue of missing features. An alternative method proposed by Fukuda et al. (2023) involves the regularization of the underlying time–dependent functions, under the assumptions of regular sampling times and non–missing features. This thus proves impractical in our use case.

In the last decade, deep learning models have also been developed to tackle multivariate time–series classification (TSC). One instance is the autoregressive integrated moving average applied to multivariate data (marima) (Spliid, 2016). However, marima is only suitable for long time-series. The highest performing approaches for short TSC are today ensemble–of–classifiers methods, such as HIVE–COTE 1.0 (Lines et al., 2016) and 2.0 (Middlehurst et al., 2021), and kernel methods like ROCKET (Dempster et al., 2020) and one of its most recent extensions MultiRocket (Tan et al., 2022). HIVE–COTE algorithms combine the class probabilities returned by four different types of classifiers through a weighting step, which outputs the consensus probabilities, whereas ROCKET–like approaches apply a large number of data transformations using convolutional neural networks to create a massive amount of features. A linear classifier is then trained on those transformed features. Nonetheless, the ensemble architecture of HIVE–COTE and ROCKET might hinder the interpretability of the class probability outputs. Moreover, ROCKET–like methods require padding or imputation for handling missing data, which worsens the issue of high dimensional points and impedes interpretability again.

Finally, another line of research focuses on the growth mixture model (Ram and Grimm, 2009), which describes unobserved subpopulations as a noisy time–dependent linear combination of two latent variables. This approach is indeed adept at recognizing trends in longitudinal data, but the original model is also ill–equipped at handling missing data. The literature on the topic only partially solves this problem, as the issue of irregular sampling time points has been tackled in the prior works reviewed by Lee and Harring (2023). Even so, the proposed advancements cannot handle multiple, highly correlated and missing features.

1.2 Contributions

We introduce a multivariate functional linear discriminant analysis algorithm called MUDRA to tackle these issues. Our approach readily extends the scope of functional linear discriminant analysis described by James and Hastie (2001) to multivariate and missing data.

MUDRA only requires that that for each time-series sample in the data set, if it has any measurements for a given feature F, then F should also be measured at any time point for which the sample has measurements. A negative example would be a data set with containing a sample s and two features, Feature 1 and Feature 2, where Feature 1 is measured at time points \(t^s_1\) and \(t^s_3\), and Feature 2 is only measured at time points \(t^s_2\) and \(t^s_3\). To make this example comply with our statement, either Feature 1 should be measured at \(t^s_2\) and \(t^s_3\) only, or Feature 2 should be measured at \(t^s_1\) and \(t^s_3\) only.

By approaching the problem from the purview of reduced rank regression, we were able to incorporate both time and feature–based irregularity without imputation. MUDRA generates a low–dimensional representation, based on the PARAFAC (Bro, 1997) tensor decomposition algorithm, that is particularly efficient for data sets with missing data and generates interpretable patterns. It can also reconstruct the mean time–dependent function across all samples from a class. We experimentally show the superior predictive power of MUDRA compared to baselines for time–series classification on a synthetic data set and on the real-life “Articulary Word Recognition” data set, in the presence of a large amount of missing data and when a low dimension is enough to capture the information from data. When there is a large amount of missing data, a common occurrence in real–life applications, MUDRA consistently outperforms ROCKET for low and high dimensionality.

1.3 Notation

If X is an \(m\times n\) matrix, we will denote the rows and columns of X as \(\textbf{x}_{1,:}, \textbf{x}_{2,:},\ldots ,\textbf{x}_{m,:}\) and \(\textbf{x}_{:,1}, \textbf{x}_{:,2},\ldots ,\textbf{x}_{:,n}\) respectively. The \(ij^{th}\) element of X will be denoted by \(x_{ij}\). If \(\varvec{x}_t\)’s are m–dimensional vectors, the \(i^{th}\) component of \(\varvec{x}_t\) is denoted by \(x_{t,i}\). A general pseudo–inverse of X is denoted by \(X^-\). Recall that if X is square and non–singular, the general pseudo–inverse is equal to the inverse: \(X^-=X^{-1}\). We denote the identity matrix I regardless of its dimension. The dimensionality of a vector \(\varvec{x}\) will be denoted by \(dim(\varvec{x})\).

For an \(m\times n\) matrix X, vec(X) denotes the vectorization of X as an mn dimensional vector, formed by stacking all of the columns of X sequentially, i.e. 

$$\begin{aligned} vec(X) \triangleq \begin{bmatrix} \textbf{x}_{:,1}^\intercal&\textbf{x}_{:,2}^\intercal \cdots&\textbf{x}_{:,n}^\intercal \end{bmatrix}^\intercal \;. \end{aligned}$$

The Kronecker product of two matrices \(A_{m\times n}\) and \(B_{p\times q}\) is an \({mp\times nq}\) matrix, defined by

$$\begin{aligned} A\otimes B \triangleq \begin{bmatrix} a_{11}B & a_{12}B & \cdots & a_{1n}B\\ a_{21}B & a_{22}B & \cdots & a_{2n}B\\ & {\vdots }& & \\ a_{m1}B & a_{m2}B & \cdots & a_{mn}B\\ \end{bmatrix}\;. \end{aligned}$$

We will use Kronecker products extensively to turn tensor computations into simpler matrix computations. A random \(m\times n\) matrix X is said to follow a matrix–variate normal distribution with mean matrix \(M_{m\times n}\), row covariance matrix \(\Sigma _{m\times m}\) and column covariance matrix \(\Psi _{n\times n}\), if \(vec(X)\sim \mathcal {N}_{mn}(vec(M), \Psi \otimes \Sigma )\) (D. K. Nagar, 1999). It is denoted by \(X\sim \mathcal {N}_{m\times n}(M, \Sigma , \Psi )\).

We will denote the variance-covariance matrix of a vector \(\varvec{y}\) as \(\mathbb {V}(\varvec{y})\). For a matrix \(\Sigma\), \(||\varvec{y}||_\Sigma\) will denote the Mahalanobis norm of \(\varvec{y}\) with covariance \(\Sigma\), defined as \(||\varvec{y}||^2_\Sigma \triangleq \varvec{y}^\intercal \Sigma ^-\varvec{y}\). Similarly, for another vector \(\varvec{x}\), \(\langle \varvec{x},\varvec{y}\rangle _\Sigma\) will denote the Mahalanobis inner product with covariance \(\Sigma\), defined as \(\langle \varvec{x},\varvec{y}\rangle _\Sigma \triangleq \varvec{x}^\intercal \Sigma ^-\varvec{y}\). The Mahalanobis inner product thus considers the correlation and variance between two vectors, which will be useful in finding linearly independent discriminants.

Formally, we introduce the available short multivariate time-series training data as follows. We consider a class \(\mathcal {C}_{i}\) of samples in which, for any sample j, \(F_{ij}\) features at \(T_{ij}\) time points have been measured: features \(f^{i j}_{1}\), \(f^{i j}_{2}\), \(\dots\), \(f^{i j}_{F_{ij}} \in \mathbb {N}^*\) at times \(t^{i j}_{1}\), \(t^{i j}_{2}\), \(\dots\), \(t^{i j}_{T_{ij}} \in \mathbb {N}^*\), for \(i=1,2,\dots ,K\) and \(j \in \mathcal {C}_{i}\). The matrix of measurements of dimension \(T_{ij} \times F_{ij}\) for sample j in class i is denoted \(Y_{ij}\). The set of distinct features measured across all samples is denoted \(F\), and the set of all sampling time points is T, that is

$$\begin{aligned} F\triangleq \bigcup _{i \le K} \bigcup _{j \in \mathcal {C}_{i}} \bigcup _{k \le F_{ij}} \{ f^{i j}_{k} \} \quad \text { and } \quad T\triangleq \bigcup _{i \le K} \bigcup _{j \in \mathcal {C}_{i}} \bigcup _{k \le T_{ij}} \{ t^{i j}_{k} \} \;. \end{aligned}$$

Note that, since data is sampled irregularly, not all observations have been observed at some common time points or with common features. Thus, all matrices \(Y_{ij}\) will possibly have different dimensions. Finally, the number of samples in class i is \(|\mathcal {C}_{i}| = m_{i}\).

2 The MUDRA algorithm

In this section, we describe a model that extends FLDA (James and Hastie, 2001) to multivariate functional data, that is, that models the observations as a noisy linear combination of spline functions which incorporates multiple features for the first time. Then, we propose an efficient parameter inference algorithm for this model and illustrate how to use it for classification and dimension reduction.

2.1 A novel multivariate model for time-dependent functional observations

2.1.1 Description of MUDRA

Similarly to the FLDA model introduced by James and Hastie (2001), we consider \(b>0\) B–spline functions of order 3  (Paul and Eilers, 1996) denoted \(s_1, s_2, \dots , s_b\) to generate the space of functionals available to our model in a flexible fashion. Then for the multivariate observations associated with sample j in class \(\mathcal {C}_{i}\), its spline matrix \(S_{ij}\) of dimensions \(T_{ij}\times b\) is constructed by evaluating the B–spline functions at each time point \(t^{i j}_{k}\)

$$\begin{aligned} S_{ij} \triangleq \begin{bmatrix} s_1(t^{i j}_{1}) & s_2(t^{i j}_{1}) & \cdots & s_b(t^{i j}_{1})\\ s_1(t^{i j}_{2}) & s_2(t^{i j}_{2}) & \cdots & s_b(t^{i j}_{2})\\ \vdots & \vdots & \ddots & \vdots \\ s_1(t^{i j}_{T_{ij}}) & s_2(t^{i j}_{T_{ij}}) & \cdots & s_b(t^{i j}_{T_{ij}})\\ \end{bmatrix}\;. \end{aligned}$$
(2)

We describe now how our model deviates from FLDA and integrates multiple features. Let C be the \(F\times F\) identity matrix. Then, we truncate and reorder columns of C depending on the actual observed features for each sample. For sample \(j \in \mathcal {C}_{i}\), define \(C_{ij}\) as the \(F\times F_{ij}\) matrix formed by concatenating the \((f^{i j}_{1})^\text {th}\), \((f^{i j}_{2})^\text {th}\), \(\dots\), \((f^{i j}_{F_{ij}})^\text {th}\) columns of C. We also introduce the spline coefficient matrix \(\eta _{ij}\) of dimensions \(b\times F\). Similarly to the standard LDA model shown in (1),

$$\begin{aligned} \eta _{ij} = \mu _{i} + \gamma _{ij}\;, \end{aligned}$$

where \(\mu _{i} \in \mathbb {R}^{b \times F}\), \(\gamma _{ij}\sim \mathcal {N}_{b\times F}(0,\Sigma ,\Psi )\) is the autoregressive noise, and \(\Sigma\) and \(\Psi\) are both positive definite matrices. Finally, let \(\epsilon _{ij}\) be the random matrix of measurement errors on sample \(j \in \mathcal {C}_{i}\). We assume that it follows a matrix–variate normal distribution, i.e. \(\epsilon _{ij}\sim \mathcal {N}_{T_{ij}\times F_{ij}}(0,\sigma ^2I,I)\) for some \(\sigma >0\). We then propose the following model of the multivariate time-dependent observations from sample j in class i

$$\begin{aligned} Y_{ij} = S_{ij}(\mu _{i} + \gamma _{ij})C_{ij} + \epsilon _{ij}\;. \end{aligned}$$

We could leave our model at that, and go on inferring the values of parameters \(\mu _{i}\) for \(i=1,2,\dots ,K\), \(\sigma\), \(\Sigma\), and \(\Psi\). However, incorporating a reduced–rank regression into the model, by adding a supplementary rank constraint on the model parameters, leads to more interpretable models and easier dimension reduction (Izenman, 2008). Motivated by this reduced rank assumption, we fix the row rank r of our parameter space. Each class–specific mean parameter \(\mu _{i}\) satisfies

$$\begin{aligned} \mu _{i} = \lambda _0+\Lambda \alpha _i\xi \;, \end{aligned}$$

where \(\lambda _0\) is the \(b\times F\) matrix of spline coefficients for the population mean functional, the \((\alpha _i)_{i \le K}\) are the class–specific reduced–rank spline coefficient matrices of dimensions \(r \times r\), and \(\Lambda \in \mathbb {R}^{b \times r}\) and \(\xi \in \mathbb {R}^{r\times F}\) allow to reconstruct the full–rank spline coefficients for each class from the \(\alpha _i\)’s. Thus, the final multivariate time–dependent model for \(i=1,2,\dots ,K\) and \(j=1,2,\dots ,m_i\) is

$$\begin{aligned} Y_{ij}=S_{ij}(\lambda _0+\Lambda \alpha _i\xi +\gamma _{ij})C_{ij}+\epsilon _{ij}\;, \end{aligned}$$
(3)

where \(\epsilon _{ij}\sim \mathcal {N}_{T_{ij}\times F_{ij}}(0,\sigma ^2I,I)\), \(\gamma _{ij}\sim \mathcal {N}_{b\times F}(0,\Sigma ,\Psi )\), with parameters \(\lambda _0\), \(\Lambda\), \((\alpha _i)_{i \le K}\), \(\xi\), \(\sigma\), \(\Sigma\), \(\Psi\) and hyperparameters b and r where \(r \le b\). (3) handles fragmented curves, misaligned observation times, and accommodates measurement errors without imputation. The resulting decision function enables curve classification based on a representation of low dimension r. Class–specific characteristics are captured by the coefficients \((\alpha _i)_{i \le K}\).

However, in this model, \(\lambda _0, \Lambda\), \(\alpha _i\) and \(\xi\)’s are interdependent, making parameter estimation harder. To solve this issue, we additionally enforce that \(\Lambda \text { and }\xi ^\intercal\) are orthogonal matrices and \((\alpha _i)_i\) are normalized:

$$\begin{aligned} \Lambda ^\intercal \Lambda =I\text {, }\xi \xi ^\intercal =I \text { and } \quad \sum _{i=1}^Km_i\alpha _i=0\;. \end{aligned}$$
(4)

As previously mentioned, fitting the model described in (3) boils down to estimating parameters \(\lambda _0,\Lambda , \alpha _i, \xi , \Sigma , \Psi\) and \(\sigma\). We describe in Algorithm 1 an expectation/conditional–maximization (ECM) algorithm (Meng and Rubin, 1993) for parameter inference. The ECM algorithm is an extension to the well–known expectation/maximization (EM) algorithm for the iterative maximization of the joint log–likelihood on model parameters, which handles missing values in the E–step. This algorithm decouples the inference of most of the parameters across classes \(i=1,2,\dots ,K\) and samples \(j \in \mathcal {C}_{i}\). If \(\gamma _{ij}\sim \mathcal {N}_{b\times c}(0,\Sigma ,\Psi )\) then thanks to the properties of matrix–variate normal distributions, \(S_{ij}\gamma _{ij}C_{ij}\sim \mathcal {N}_{T_{ij}\times F_{ij}}(0,S_{ij}\Sigma S_{ij}^\intercal ,C_{ij}^\intercal \Psi C_{ij})\). Let us then denote \(\Sigma _{ij}' \triangleq S_{ij}\Sigma S_{ij}^\intercal\) and \(\Psi _{ij}' \triangleq C_{ij}^\intercal \Psi C_{ij}\), which are assumed to be positive definite matrices. Therefore, from (3),

$$\begin{aligned} \begin{aligned} vec(Y_{ij})&\sim \mathcal {N}_{T_{ij}F_{ij}}\left( vec(\mu ^Y_{i,j}),\Sigma ^Y_{ij}\right) \;, \end{aligned} \end{aligned}$$
(5)

where the mean matrix is \(\mu ^Y_{i,j} \triangleq S_{ij}(\lambda _0+\Lambda \alpha _i\xi )C_{ij}\) and the covariance matrix \(\Sigma ^Y_{ij} \triangleq \sigma ^2I_{T_{ij}F_{ij}}+\Psi _{ij}'\otimes \Sigma _{ij}'\). In this case, the joint log–likelihood \(\ell (\lambda _0, \Lambda , (\alpha _i)_i, \xi , \Sigma , \Psi , \sigma )\) on the parameters can be written as

$$\begin{aligned} & -\frac{1}{2}\sum _{i=1}^K\sum _{j=1}^{m_i} \big (vec(Y_{ij}-\mu ^Y_{i,j})^\intercal (\Sigma ^Y_{ij})^{-1}vec(Y_{ij}-\mu ^Y_{i,j}) +\log \det (\Sigma ^Y_{ij})+T_{ij}F_{ij}\log (2\pi )\big ). \end{aligned}$$
Algorithm 1
figure a

MUDRA Algorithm: Model Parameter Inference

Algorithm 1 aims at maximizing \(\ell (\lambda _0, \Lambda , (\alpha _i)_i, \xi , \Sigma , \Psi , \sigma )\) without computing any Kronecker product inverses to remain efficient. It relies on alternatively optimizing each parameter. Most prominently, this algorithm features the tensor decomposition algorithm PARAFAC (Bro, 1997) at Line 11. The Sylvester equation at Line 8 of Algorithm 1 is solved using the algorithm described in Bartels and Stewart (1972). The equation in Line 9 is solved by executing the generalized minimal residual (GMRES) method (Saad and Schultz, 1986). Further details are in Appendix A.

2.1.2 MUDRA as an extension to FLDA in a multivariate setting

(3) exactly retrieves FLDA when \(F_{ij}=|F|=1\) for all \(i=1,2,\dots ,K\) and \(j \in \mathcal {C}_{i}\) (without the additional constraints in (4)). Indeed, in that case, \(C=1\), and the class i–specific coefficients become \(\alpha _i \xi\).

2.1.3 Selection of hyperparameters in MUDRA

The selection of the two hyperparameters, b (number of spline functions) and r (parameter rank), has a large impact on the quality of the approximation. We assess the model’s performance on different values of r in the experimental study. For the selection of b, there is a tradeoff between the richness of the available space of functionals and the computation cost of the inference.

2.1.4 Theoretical time complexity of MUDRA

MUDRA mainly focuses on reliability in the face of missing data for time–series classifications. However, ensuring a tractable algorithm for parameter inference is crucial. In each ECM iteration k, there is (1) \(\sum _{i \le K} m_i\) calls to the Bartels-Stewart algorithm (Line 8), (2) \(K\) calls to the GMRES algorithm (Line 9), (3) one call to the alternating least squares (ALS) minimization procedure for the PARAFAC tensor decomposition at Line 11, and finally, (4) an iterative optimization step at Lines \(14-16\).

1. One call to the Bartels-Stewart algorithm on sample \(j \in \mathcal {C}_{i}\) has an overall computational cost in \(\mathcal {O}(T_{ij}^3+F_{ij}^3+T_{ij}F_{ij}^2+T_{ij}^2F_{ij})\), by computing the Schur decompositions with the algorithm in Golub et al. (1979).

2. GMRES becomes increasingly expensive at each iteration: running k steps of the GMRES method has a time complexity in \(\mathcal {O}(b Fk^2)\). However, in theory, GMRES converges at step \(k \le b \times F\), so in the worst case, any call to GMRES has a time complexity in \(\mathcal {O}(b^2 F^2)\).

3. The PARAFAC decomposition is computed by alternatively minimizing on w, \(\hat{\Lambda }\), \(\hat{\xi }\) and \(\varvec{c}\). The number of minimization steps across all parameters is limited to 100 in this function. Each single–parameter minimization is linear in this parameter, and then the time complexity is mainly driven by the matrix inversion and product operations.

4. Theoretically, the iterative optimization step in Lines \(14-16\) converges, but no convergence rate is provided. In practice, we chose to change the criteria to either stay under a maximum number of iterations (50 in our experiments) or until the value of \(\hat{\sigma }\) has converged: at step q of the iteration, \(|\hat{\sigma }_{q-1}^{-1}(\hat{\sigma }_{q-1}-\hat{\sigma }_q)| < 10^{-5}\). Then, the remaining cost of this step mainly stems from the matrix products.

We also compare the empirical runtimes of MUDRA and one algorithm from the state–of–the–art, ROCKET (Dempster et al., 2020) in the experimental study in Section 3.

2.1.5 Convergence of MUDRA

The convergence of the MUDRA algorithm can be shown as a special case of the convergence for ECM algorithms (Meng and Rubin (1993)), as detailed in Appendix C.

2.1.6 Implementation of MUDRA

In our implementation of MUDRA, we resort to the Bartels-Stewart algorithm and GMRES present in the Python package scipy (Virtanen et al., 2020). Moreover, we used the canonical tensor decomposition (as opposed to Tucker decomposition (Tucker, 1966)) to decompose the tensors, for which we used the PARAFAC decomposition function in Python package Tensorly (Kossaifi et al., 2016).

2.2 Classification and dimension reduction with MUDRA

Once the parameter values in (3) have been infered by running Algorithm 1, a LDA–like approach to classification  (1) can be applied to a new sample \(Y \in \mathbb {R}^{T' \times F'}\). Our model allows to define the density function \(f_i\) of class \(i=1,2,\dots ,K\) and the prior probability \(\pi _i\). In our experiments, we assume a uniform prior, that is, \(\pi _i = 1/K\) for all \(i=1,2,\dots ,K\), but other types of prior might also be considered. Then, the class i(Y) predicted for sample Y is

$$\begin{aligned} \mathop {\mathrm {arg\,max}}\limits _{i \le K} \log \mathbb {P}(C=i | X=Y) \propto \log f_i(Y) + \log \pi _i \;. \end{aligned}$$
(6)

Let us derive the expression of \(f_i\) from (5). Let us denote \(S_Y\) the spline matrix in (2) associated with time points in Y, \(C_Y\) the corresponding matrix of reordered feature columns, \(M_Y \triangleq \sigma ^2I+(C_Y^\intercal \Psi C_Y)\otimes (S_Y\Sigma S_Y^\intercal )\) and

$$\begin{aligned} \begin{aligned} vec(\hat{\alpha }_Y) \triangleq&\big ((\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal )M_Y^-(C_Y^\intercal \xi ^\intercal \otimes S_Y\Lambda )\big )^{-\frac{1}{2}} (\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal )M_Y^{-1}vec(Y-S_Y\lambda _0C_Y)\;. \end{aligned} \end{aligned}$$
(7)

Then, ignoring constant factors in i and leveraging the computations in Appendix B, \(\log f_i(Y)\) in (6) can be replaced with

$$\begin{aligned} \log f_i(Y)= & -\frac{1}{2}||vec(Y - S_Y(\lambda _0+\Lambda \alpha _i\xi )C_Y)||^2_{M_Y} + \log \det (M_Y)+T_YF_Y\log (2\pi )\\\propto & -\frac{1}{2}||vec(Y - S_Y(\lambda _0+\Lambda \alpha _i\xi )C_Y)||^2_{M_Y}\\\propto & -\frac{1}{2}||vec(\hat{\alpha }_Y)-\big ((\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal ) M_Y^-(C_Y^\intercal \xi ^\intercal \otimes S_Y\Lambda )\big )^{\frac{1}{2}}vec(\alpha _i)||. \end{aligned}$$

(B8) in Appendix B shows that the covariance of \(\hat{\alpha }_Y\), \(\mathbb {V}[vec(\hat{\alpha }_Y)]=I\). \(\hat{\alpha }_Y\) can be reshaped into an \(r\times r\) matrix. Then, we obtain an \(r^2\)–dimensional representation of Y.

3 Experimental study

We describe the empirical results of MUDRA in this section regarding predictive power, interpretability, and time efficiency. The overall experimental pipeline involves the extraction of features using an algorithm like MUDRA or ROCKET (Dempster et al., 2020), and then the classification of the samples using the extracted features via a Ridge Regression based classifier with balanced weights as shown in Figure 2. In our experiments, we required that the time-series are aligned based on time elapsed since the first observation. However, if there is an experimental timeline, we can align the data points along that timeline, and apply MUDRA straightforwardly. For example, if we know when our experiment starts, we can align our sample timelines even if the first observation is missing.

Fig. 2
figure 2

Pipeline for the experimental study in Section 3. The red box (“MUDRA”) can be replaced with any other corresponding algorithm, for instance ROCKET (Dempster et al., 2020)

We selected as baseline for MUDRA the ROCKET algorithm (Dempster et al. (2020)). A recent review by Ruiz et al. (2021) showed that ROCKET (Dempster et al., 2020) is currently the best-performing model, both in terms of efficiency and accuracy, for feature transformation in short–time–series classification. However, when data is missing, ROCKET cannot be applied directly. Inspired by Bier et al. (2022), we employed three forms of padding and imputation to the data set prior to running ROCKET:

  1. 1.

    “Imputation”–type padding: The recorded data points are aligned according to their sampling times. Missing time points and missing features are denoted with 0.

  2. 2.

    “End”–type padding: All missing features are imputed with 0. All measured time points are concatenated. 0s are appended at the end to obtain time series of the same length.

  3. 3.

    Last Observation Carried Forward (LOCF) imputation: The last observed non-missing value is used to fill in the missing values at later time points.

The experiments along with the environment configuration and package versions used for ROCKET and MUDRA are available in the GitHub repository (https://github.com/rbordoloi/MUDRA). The reported experiments were run on a CPU cluster (configuration: processor AMD EPYC 7763 64-Core, 20 cores @2.40GHz, RAM 8GB). None of the algorithms used a GPU.

3.1 Time–series classification on a synthetic data set

We built a data set with two features, 12 time points, two classes, and 300 samples as described in Appendix D. For this experiment, we set \(b=10\) spline basis functions as a tradeoff between performance and computational cost. To assess the predictive power of our model, we collected the \(\mathcal {L}^2\)-norm of the functional \(R^2\) values (Ramsay and Silverman, 2005) for each class as a goodness-of-fit measure, and the \(F_1\) scores for the corresponding classification task for multiple values of r. \(R^2\) values and \(F_1\) scores are reported in respectively Figure 3a and Figure 3b.

Fig. 3
figure 3

Synthetic experiments with \(b=10\). Left: Approximate \(\mathcal {L}^2\)-norm of the functional \(R^2\) values for \(r \in \{1,\ldots ,10\}\). Right: \(F_1\)–scores of ROCKET and MUDRA on the classification task on the synthetic data set for \(r \in \{1,\ldots ,10\}\)

We observe that the approximate \(\mathcal {L}^2\)-norms of the \(R^2\) values show a general increasing trend. An increase in the number of class-specific parameters results in a corresponding increase in the goodness-of-fit. Similarly, Figure 3b shows an increase in the classification accuracy in terms of F1-score as the output dimensionality increases. For low dimensionalities (\(r\le 4\)), we observe that MUDRA outperforms ROCKET(with Last Observation Carried Forward imputation) with a Ridge Regression-based classifier. Figure 4 shows that MUDRA is able to retrieve the true functionals in the synthetic data set.

Fig. 4
figure 4

Top: curves of Feature 1 (on the left) and Feature 2 (right) across the three classes. Bottom: corresponding estimated curves by MUDRA with \(r=3\), \(b=7\)

3.2 Time–series dimension reduction on real-life data

We also applied the MUDRA algorithm to the real-world data set “Articulary Word Recognition” from Dau et al. (2019). This data set compiles measurements of the movements of the tongue and lips during speech in native English speakers pronouncing 25 words. The time-varying tongue and lip movements were measured by an electromagnetic articulograph with 12 sensors that return 3D coordinates with a sampling rate of 200 Hz during the pronunciation of each word. Each word represents one class, and the goal is to identify the time-dependent movements associated with each word (Wang et al. 2013). 275 series/samples are in the training set, and 300 are in the testing set. In this use case, we restrict ourselves to 9 dimensions corresponding to the outputs from three sensors: one on the tongue tip, one on the upper lip, and the last on the lower lip (Shokoohi-Yekta et al. 2017).

Further details about the simulation of missing data are present in Appendix D. We applied MUDRA or ROCKET to perform a dimension reduction on the data set. Resulting features were fed to a ridge regression-based classifier. Such a procedure allows to test how informative the resulting representations are, which is a first step towards interpretability. We also classified the data directly through the Bayes–optimal classifier described in Section 2.2. Since ROCKET always reduces to an even number of dimensions, and MUDRA to \(r^2\) dimensions, to ensure fairness when comparing both algorithms, we considered for ROCKET the smallest even number which was larger than \(r^2\). The two plots in Figure 5 report the \(F_1\) scores obtained for the classification tasks for multiple values of output dimension r on complete data (i.e.  without any missing data) and on missing data, where the procedure in Appendix D has been applied to the data set. Figure 5a shows that MUDRA competes with ROCKET for very low–dimensional representations, but ROCKET overtakes MUDRA as we increase the output dimension.

Figure 5b shows that MUDRA clearly outperforms ROCKET in all of these cases.

Fig. 5
figure 5

ROCKET versus MUDRA with a ridge regression–based classifier on the “Articulary Word Recognition” data set with \(b=9\) and for \(r \in [1,9]\). Left: MUDRA versus ROCKET on complete data. The blue curve correspond to the Bayes–optimal rule in MUDRA, the orange one is the dimension reduction in MUDRA combined with a ridge regression classifier, the green curve correspond to ROCKET–transformed features fed to the classifier. Right: MUDRA versus ROCKET on missing data. The blue curve correspond to the Bayes–optimal rule in MUDRA, the orange one is the dimension reduction in MUDRA combined with a ridge regression classifier, the green (resp., red) curve correspond to ROCKET–transformed features with “imputation” (resp., “end”)–type padding fed to the classifier, and the purple curve corresponds to ROCKET-transformed features with LOCF imputation (Color figure online)

3.3 Empirical runtime of the MUDRA algorithm

To assess the empirical time complexity of MUDRA, we collected the runtimes over \(N=1,000\) iterations of both MUDRA and ROCKET combined with a ridge classifier on the real–life data set (without missing data, \(r=7\), \(b=9\)). Resulting runtimes are reported in Figure 6.

Fig. 6
figure 6

Runtimes for ROCKET and MUDRA on the complete real–life data set for \(r=7\) and \(b=9\) (\(N=1,000\) iterations per algorithm)

As evidenced by Figure 6, the current implementation of MUDRA is approximately 12 times slower in average than ROCKET. Since ROCKET is a simple algorithm that shifts most of the burden of inference to the subsequent classification algorithm, it is extremely fast. However, as previously mentioned, it lacks some of the predictive power of MUDRA.

4 Conclusion

This work presents a novel approach to dealing with missing data in multivariate functional data, which has historically proven to be a challenging problem for short–time–series classification. Contemporary algorithms, which achieve optimal performance, typically resort to padding; however, this strategy proves suboptimal for very short time series with a large proportion of missing data. Our proposed algorithm, MUDRA, uniquely enables model training on selective fragments of a high–dimensional surface without necessitating regularization to reconstruct the entire surface for each sample. This eliminates the need for padding and imputation. Our algorithm outperforms the current state-of-the-art models in the cases where a very low–dimensional representation is required and/or a large proportion of the data is missing. Such cases frequently arise in real-life data, especially in biomedical applications. Our approach MUDRA is also interpretable in several ways, in comparison to the state-of-the-art. Firstly, the estimated class functionals can be recovered, which is impossible with other methods for this problem (see Equation (3) on page 7). Additionally, since the model is based on a statistical functional regression approach, the estimated parameters are maximum likelihood estimators. This means that one could, in principle, construct confidence intervals and parametric hypothesis tests based on these estimates (see Appendix A). Finally, since the underlying parametric assumptions involve Gaussian distributions, one can also interpret the embeddings generated by the model as clusters and look at new data points based on their closeness to the various cluster centroids (see Section 3.2 on the experiments on the real-life data set “Articulary Word Recognition”).

While our approach demonstrates considerable flexibility and efficiency, it has some shortcomings. Empirically, we show that the MUDRA algorithm sacrifices some time efficiency in favor of interpretability. Improving on the implementation of Algorithm 1 by resorting to faster routines (Nguyen et al., 2016; Song et al., 2022), switching the programming language to C and Fortran, or approximating solution equations (Wang et al., 2022) could tackle this issue.

On the theoretical side, one issue we could not solve was irregularly sampled features for one subject. Consider data being recorded for a sample \(Y_{ij}\) at time points \(t^{i j}_{1}\) and \(t^{i j}_{2}\). If the features recorded for \(Y_{ij}\) at \(t^{i j}_{1}\) are not the same as at \(t^{i j}_{2}\), our model cannot train on that datapoint without resorting to the imputation or deletion of some of the features. Another significant limitation is the assumption of common covariance matrices \(\Sigma\) and \(\Psi\) for the autoregressive noise in each class, which stem from the LDA standpoint. This is sometimes not representative of real data, for example, clinical data, as certain conditions can have a major impact on the variability of levels of certain metabolites. A quadratic discriminant analysis (QDA) approach might have mitigated these issues. However, class-specific covariance matrices raise a number of computational problems in the ECM algorithm and do not lead to the relatively simple expression for the dimension reduction of a new sample. Tackling these shortcomings would improve the application of short–time–series classification algorithms to real-life use cases.