Abstract
The more extensive access to time-series data, especially for biomedical purposes, raises new methodological challenges, particularly regarding missing values. Functional linear discriminant analysis (FLDA) extends Linear Discriminant Analysis (LDA)-mediated multiclass classification and dimension reduction to data in the form of fragmented observations of a univariate function. For large multivariate and partially-observed data, there are two challenges: (i) statistical dependencies between different components of a multivariate function and (ii) heterogeneous sampling times with missing features. We here develop a multivariate version of FLDA, called MUDRA, to tackle these challenges and describe a computationally efficient expectation/conditional-maximisation (ECM) algorithm to infer its parameters without any tensor inversions. We assess its predictive power on the “Articulary Words” dataset and show its improvement over the state-of-the-art, especially in the case of missing data. This advancement in dimension reduction of multivariate functional data holds promise for enhancing classification accuracy in scenarios like partially observed short multivariate time series analysis.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Linear discriminant analysis (LDA) is a popular strategy still widely used (Graf et al., 2024; McLaughlin and Su, 2023; Zhu et al., 2022) for classification and dimensionality reduction. The origins trace back to Fisher’s discriminant (Fisher, 1936) and its multiclass extension (Rao, 1948). The main assumption of LDA is that samples of class \(i=1,2,\dots ,K\) follow a multivariate normal distribution of mean \(\mu _{i} \in \mathbb {R}^d\) and shared covariance matrix \(\Sigma \in \mathbb {R}^{d\times d}\) of full rank, denoted \(\mathcal {N}_{d}(\mu _{i}, \Sigma )\). Then the Bayes optimal rule for classification is applied to predict the class \(i(\varvec{x})\) of a new sample \(\varvec{x} \in \mathbb {R}^d\)
As \(\mathbb {P}(C=i | X=\varvec{x}) \propto \mathcal {N}_{d}(\varvec{x}; \mu _{i}, \Sigma )\pi _i,\) where the prior probability of class i is \(\pi _i\), by applying the Bayes theorem, and by reordering and ignoring terms that are constant across classes, one obtains a decision function which is linear in \(\varvec{x}\)
Parameters \((\mu _{i})_{i \le K}\), \(\Sigma\), \((\pi _i)_{i \le K}\) can be inferred by maximizing the joint likelihood on a set of training samples.
The advance of functional linear discriminant analysis (FLDA) (James and Hastie, 2001) allows us to handle cases where \(\varvec{x}\) is not simply a \(d\)–dimensional vector, but the output of a time–dependent multivariate function f associated with sample s: \(\varvec{x}_s=f_s(t) \in \mathbb {R}^{d}\). Note that FLDA is restricted to the univariate case, i.e. when \(d=1\). In practice, one has of course no access to the true function \(f_s: \mathbb {N}^* \rightarrow \mathbb {R}^{d}\), but only to some of its values \(f_s(t^s_1), f_s(t^s_2), \dots , f_s(t^s_{n_s})\) at a small number \(n_s\) of time points \(t^s_1\), \(t^s_2\), \(\dots\), \(t^s_{n_s}\). Such short time series appear in many practical scenarios.
Example of a short multivariate time series with irregular sampling intervals and missing data. There are 3 classes with 1 sample in each class. Each colour corresponds to a class, and each plot represents a single feature. Some features or time points might be missing across samples. For instance, Feature 2 is not measured for the sample in Class 2, and both Feature 1 and Feature 2 are not measured at the same time points for the samples in Class 1 and Class 2
For instance, think of medical check-ups, clinical trials, or psychological studies, where patients (samples) from different conditions (classes: diseased, treated with a drug, healthy controls, ...) are surveyed for several variables (features: electrocardiogram measurements, concentration of a specific protein, ...) sparsely over weeks or months. Ignoring that those values come from the same function and simply concatenating values across all time points will lead to high–dimensional data strongly correlated across time points. This, in turn, would lead to numerical issues that prevent the estimation of the covariance matrix \(\Sigma\).
Another challenge is that sampling occurs at irregular, short intervals of time points across samples (i.e. different, small values of \(n_s\) and different \((t^s_i)_i\) across samples s). Moreover, different sets of features might be measured in two distinct samples, meaning that even the value of \(d\) can vary across samples. Those challenges are exemplified in Figure 1. Notably, in this example, there is substantial missing data, such as Feature 2 in Class 1 and Class 2, making traditional padding and imputation approaches impractical (Bier et al., 2022).
1.1 Related work
A first line of related research is functional data analysis (Yao et al., 2005). Time–series observations from sample s are considered as noisy samples from the underlying function \(f_s\). This approach is particularly promising when one is interested in the overall shape of data. For example, for medical longitudinal data (Wang et al., 2022) or data arising from psychological studies (Jebb et al., 2015), one is more interested in trends or the overall pattern of the entire cohort, rather than in individual fluctuations and noise sources. In many practical scenarios, the training data is collected from a large number of subjects at a small number of time points (Yoon et al., 2019). The grouping of subjects is based on a similarity of the curves or trends displayed. The learned model can then be used as a classifier to predict class membership for data from new subjects not in the training data set. In such cases, conventional LDA as shown in (1) can’t be directly applied, and time interval discretization is required. James and Hastie (2001) addressed this issue by proposing a model which approximates observations as a noisy linear combination of spline curves, and applies a LDA–like inference procedure to the linear combination coefficients. However, this model only holds for univariate functional data. Applying FLDA individually to each curve is suboptimal due to the high dimension of data points and the fact that FLDA is not designed to estimate inter–feature correlations. Statistical models tailored to multivariate functional data might overcome the issue of interpretability. Similarly to our work, Gardner-Lubbe (2021) presented an extension of FLDA to multivariate data. However, this approach completely ignores the issue of missing features. An alternative method proposed by Fukuda et al. (2023) involves the regularization of the underlying time–dependent functions, under the assumptions of regular sampling times and non–missing features. This thus proves impractical in our use case.
In the last decade, deep learning models have also been developed to tackle multivariate time–series classification (TSC). One instance is the autoregressive integrated moving average applied to multivariate data (marima) (Spliid, 2016). However, marima is only suitable for long time-series. The highest performing approaches for short TSC are today ensemble–of–classifiers methods, such as HIVE–COTE 1.0 (Lines et al., 2016) and 2.0 (Middlehurst et al., 2021), and kernel methods like ROCKET (Dempster et al., 2020) and one of its most recent extensions MultiRocket (Tan et al., 2022). HIVE–COTE algorithms combine the class probabilities returned by four different types of classifiers through a weighting step, which outputs the consensus probabilities, whereas ROCKET–like approaches apply a large number of data transformations using convolutional neural networks to create a massive amount of features. A linear classifier is then trained on those transformed features. Nonetheless, the ensemble architecture of HIVE–COTE and ROCKET might hinder the interpretability of the class probability outputs. Moreover, ROCKET–like methods require padding or imputation for handling missing data, which worsens the issue of high dimensional points and impedes interpretability again.
Finally, another line of research focuses on the growth mixture model (Ram and Grimm, 2009), which describes unobserved subpopulations as a noisy time–dependent linear combination of two latent variables. This approach is indeed adept at recognizing trends in longitudinal data, but the original model is also ill–equipped at handling missing data. The literature on the topic only partially solves this problem, as the issue of irregular sampling time points has been tackled in the prior works reviewed by Lee and Harring (2023). Even so, the proposed advancements cannot handle multiple, highly correlated and missing features.
1.2 Contributions
We introduce a multivariate functional linear discriminant analysis algorithm called MUDRA to tackle these issues. Our approach readily extends the scope of functional linear discriminant analysis described by James and Hastie (2001) to multivariate and missing data.
MUDRA only requires that that for each time-series sample in the data set, if it has any measurements for a given feature F, then F should also be measured at any time point for which the sample has measurements. A negative example would be a data set with containing a sample s and two features, Feature 1 and Feature 2, where Feature 1 is measured at time points \(t^s_1\) and \(t^s_3\), and Feature 2 is only measured at time points \(t^s_2\) and \(t^s_3\). To make this example comply with our statement, either Feature 1 should be measured at \(t^s_2\) and \(t^s_3\) only, or Feature 2 should be measured at \(t^s_1\) and \(t^s_3\) only.
By approaching the problem from the purview of reduced rank regression, we were able to incorporate both time and feature–based irregularity without imputation. MUDRA generates a low–dimensional representation, based on the PARAFAC (Bro, 1997) tensor decomposition algorithm, that is particularly efficient for data sets with missing data and generates interpretable patterns. It can also reconstruct the mean time–dependent function across all samples from a class. We experimentally show the superior predictive power of MUDRA compared to baselines for time–series classification on a synthetic data set and on the real-life “Articulary Word Recognition” data set, in the presence of a large amount of missing data and when a low dimension is enough to capture the information from data. When there is a large amount of missing data, a common occurrence in real–life applications, MUDRA consistently outperforms ROCKET for low and high dimensionality.
1.3 Notation
If X is an \(m\times n\) matrix, we will denote the rows and columns of X as \(\textbf{x}_{1,:}, \textbf{x}_{2,:},\ldots ,\textbf{x}_{m,:}\) and \(\textbf{x}_{:,1}, \textbf{x}_{:,2},\ldots ,\textbf{x}_{:,n}\) respectively. The \(ij^{th}\) element of X will be denoted by \(x_{ij}\). If \(\varvec{x}_t\)’s are m–dimensional vectors, the \(i^{th}\) component of \(\varvec{x}_t\) is denoted by \(x_{t,i}\). A general pseudo–inverse of X is denoted by \(X^-\). Recall that if X is square and non–singular, the general pseudo–inverse is equal to the inverse: \(X^-=X^{-1}\). We denote the identity matrix I regardless of its dimension. The dimensionality of a vector \(\varvec{x}\) will be denoted by \(dim(\varvec{x})\).
For an \(m\times n\) matrix X, vec(X) denotes the vectorization of X as an mn dimensional vector, formed by stacking all of the columns of X sequentially, i.e.
The Kronecker product of two matrices \(A_{m\times n}\) and \(B_{p\times q}\) is an \({mp\times nq}\) matrix, defined by
We will use Kronecker products extensively to turn tensor computations into simpler matrix computations. A random \(m\times n\) matrix X is said to follow a matrix–variate normal distribution with mean matrix \(M_{m\times n}\), row covariance matrix \(\Sigma _{m\times m}\) and column covariance matrix \(\Psi _{n\times n}\), if \(vec(X)\sim \mathcal {N}_{mn}(vec(M), \Psi \otimes \Sigma )\) (D. K. Nagar, 1999). It is denoted by \(X\sim \mathcal {N}_{m\times n}(M, \Sigma , \Psi )\).
We will denote the variance-covariance matrix of a vector \(\varvec{y}\) as \(\mathbb {V}(\varvec{y})\). For a matrix \(\Sigma\), \(||\varvec{y}||_\Sigma\) will denote the Mahalanobis norm of \(\varvec{y}\) with covariance \(\Sigma\), defined as \(||\varvec{y}||^2_\Sigma \triangleq \varvec{y}^\intercal \Sigma ^-\varvec{y}\). Similarly, for another vector \(\varvec{x}\), \(\langle \varvec{x},\varvec{y}\rangle _\Sigma\) will denote the Mahalanobis inner product with covariance \(\Sigma\), defined as \(\langle \varvec{x},\varvec{y}\rangle _\Sigma \triangleq \varvec{x}^\intercal \Sigma ^-\varvec{y}\). The Mahalanobis inner product thus considers the correlation and variance between two vectors, which will be useful in finding linearly independent discriminants.
Formally, we introduce the available short multivariate time-series training data as follows. We consider a class \(\mathcal {C}_{i}\) of samples in which, for any sample j, \(F_{ij}\) features at \(T_{ij}\) time points have been measured: features \(f^{i j}_{1}\), \(f^{i j}_{2}\), \(\dots\), \(f^{i j}_{F_{ij}} \in \mathbb {N}^*\) at times \(t^{i j}_{1}\), \(t^{i j}_{2}\), \(\dots\), \(t^{i j}_{T_{ij}} \in \mathbb {N}^*\), for \(i=1,2,\dots ,K\) and \(j \in \mathcal {C}_{i}\). The matrix of measurements of dimension \(T_{ij} \times F_{ij}\) for sample j in class i is denoted \(Y_{ij}\). The set of distinct features measured across all samples is denoted \(F\), and the set of all sampling time points is T, that is
Note that, since data is sampled irregularly, not all observations have been observed at some common time points or with common features. Thus, all matrices \(Y_{ij}\) will possibly have different dimensions. Finally, the number of samples in class i is \(|\mathcal {C}_{i}| = m_{i}\).
2 The MUDRA algorithm
In this section, we describe a model that extends FLDA (James and Hastie, 2001) to multivariate functional data, that is, that models the observations as a noisy linear combination of spline functions which incorporates multiple features for the first time. Then, we propose an efficient parameter inference algorithm for this model and illustrate how to use it for classification and dimension reduction.
2.1 A novel multivariate model for time-dependent functional observations
2.1.1 Description of MUDRA
Similarly to the FLDA model introduced by James and Hastie (2001), we consider \(b>0\) B–spline functions of order 3 (Paul and Eilers, 1996) denoted \(s_1, s_2, \dots , s_b\) to generate the space of functionals available to our model in a flexible fashion. Then for the multivariate observations associated with sample j in class \(\mathcal {C}_{i}\), its spline matrix \(S_{ij}\) of dimensions \(T_{ij}\times b\) is constructed by evaluating the B–spline functions at each time point \(t^{i j}_{k}\)
We describe now how our model deviates from FLDA and integrates multiple features. Let C be the \(F\times F\) identity matrix. Then, we truncate and reorder columns of C depending on the actual observed features for each sample. For sample \(j \in \mathcal {C}_{i}\), define \(C_{ij}\) as the \(F\times F_{ij}\) matrix formed by concatenating the \((f^{i j}_{1})^\text {th}\), \((f^{i j}_{2})^\text {th}\), \(\dots\), \((f^{i j}_{F_{ij}})^\text {th}\) columns of C. We also introduce the spline coefficient matrix \(\eta _{ij}\) of dimensions \(b\times F\). Similarly to the standard LDA model shown in (1),
where \(\mu _{i} \in \mathbb {R}^{b \times F}\), \(\gamma _{ij}\sim \mathcal {N}_{b\times F}(0,\Sigma ,\Psi )\) is the autoregressive noise, and \(\Sigma\) and \(\Psi\) are both positive definite matrices. Finally, let \(\epsilon _{ij}\) be the random matrix of measurement errors on sample \(j \in \mathcal {C}_{i}\). We assume that it follows a matrix–variate normal distribution, i.e. \(\epsilon _{ij}\sim \mathcal {N}_{T_{ij}\times F_{ij}}(0,\sigma ^2I,I)\) for some \(\sigma >0\). We then propose the following model of the multivariate time-dependent observations from sample j in class i
We could leave our model at that, and go on inferring the values of parameters \(\mu _{i}\) for \(i=1,2,\dots ,K\), \(\sigma\), \(\Sigma\), and \(\Psi\). However, incorporating a reduced–rank regression into the model, by adding a supplementary rank constraint on the model parameters, leads to more interpretable models and easier dimension reduction (Izenman, 2008). Motivated by this reduced rank assumption, we fix the row rank r of our parameter space. Each class–specific mean parameter \(\mu _{i}\) satisfies
where \(\lambda _0\) is the \(b\times F\) matrix of spline coefficients for the population mean functional, the \((\alpha _i)_{i \le K}\) are the class–specific reduced–rank spline coefficient matrices of dimensions \(r \times r\), and \(\Lambda \in \mathbb {R}^{b \times r}\) and \(\xi \in \mathbb {R}^{r\times F}\) allow to reconstruct the full–rank spline coefficients for each class from the \(\alpha _i\)’s. Thus, the final multivariate time–dependent model for \(i=1,2,\dots ,K\) and \(j=1,2,\dots ,m_i\) is
where \(\epsilon _{ij}\sim \mathcal {N}_{T_{ij}\times F_{ij}}(0,\sigma ^2I,I)\), \(\gamma _{ij}\sim \mathcal {N}_{b\times F}(0,\Sigma ,\Psi )\), with parameters \(\lambda _0\), \(\Lambda\), \((\alpha _i)_{i \le K}\), \(\xi\), \(\sigma\), \(\Sigma\), \(\Psi\) and hyperparameters b and r where \(r \le b\). (3) handles fragmented curves, misaligned observation times, and accommodates measurement errors without imputation. The resulting decision function enables curve classification based on a representation of low dimension r. Class–specific characteristics are captured by the coefficients \((\alpha _i)_{i \le K}\).
However, in this model, \(\lambda _0, \Lambda\), \(\alpha _i\) and \(\xi\)’s are interdependent, making parameter estimation harder. To solve this issue, we additionally enforce that \(\Lambda \text { and }\xi ^\intercal\) are orthogonal matrices and \((\alpha _i)_i\) are normalized:
As previously mentioned, fitting the model described in (3) boils down to estimating parameters \(\lambda _0,\Lambda , \alpha _i, \xi , \Sigma , \Psi\) and \(\sigma\). We describe in Algorithm 1 an expectation/conditional–maximization (ECM) algorithm (Meng and Rubin, 1993) for parameter inference. The ECM algorithm is an extension to the well–known expectation/maximization (EM) algorithm for the iterative maximization of the joint log–likelihood on model parameters, which handles missing values in the E–step. This algorithm decouples the inference of most of the parameters across classes \(i=1,2,\dots ,K\) and samples \(j \in \mathcal {C}_{i}\). If \(\gamma _{ij}\sim \mathcal {N}_{b\times c}(0,\Sigma ,\Psi )\) then thanks to the properties of matrix–variate normal distributions, \(S_{ij}\gamma _{ij}C_{ij}\sim \mathcal {N}_{T_{ij}\times F_{ij}}(0,S_{ij}\Sigma S_{ij}^\intercal ,C_{ij}^\intercal \Psi C_{ij})\). Let us then denote \(\Sigma _{ij}' \triangleq S_{ij}\Sigma S_{ij}^\intercal\) and \(\Psi _{ij}' \triangleq C_{ij}^\intercal \Psi C_{ij}\), which are assumed to be positive definite matrices. Therefore, from (3),
where the mean matrix is \(\mu ^Y_{i,j} \triangleq S_{ij}(\lambda _0+\Lambda \alpha _i\xi )C_{ij}\) and the covariance matrix \(\Sigma ^Y_{ij} \triangleq \sigma ^2I_{T_{ij}F_{ij}}+\Psi _{ij}'\otimes \Sigma _{ij}'\). In this case, the joint log–likelihood \(\ell (\lambda _0, \Lambda , (\alpha _i)_i, \xi , \Sigma , \Psi , \sigma )\) on the parameters can be written as
Algorithm 1 aims at maximizing \(\ell (\lambda _0, \Lambda , (\alpha _i)_i, \xi , \Sigma , \Psi , \sigma )\) without computing any Kronecker product inverses to remain efficient. It relies on alternatively optimizing each parameter. Most prominently, this algorithm features the tensor decomposition algorithm PARAFAC (Bro, 1997) at Line 11. The Sylvester equation at Line 8 of Algorithm 1 is solved using the algorithm described in Bartels and Stewart (1972). The equation in Line 9 is solved by executing the generalized minimal residual (GMRES) method (Saad and Schultz, 1986). Further details are in Appendix A.
2.1.2 MUDRA as an extension to FLDA in a multivariate setting
(3) exactly retrieves FLDA when \(F_{ij}=|F|=1\) for all \(i=1,2,\dots ,K\) and \(j \in \mathcal {C}_{i}\) (without the additional constraints in (4)). Indeed, in that case, \(C=1\), and the class i–specific coefficients become \(\alpha _i \xi\).
2.1.3 Selection of hyperparameters in MUDRA
The selection of the two hyperparameters, b (number of spline functions) and r (parameter rank), has a large impact on the quality of the approximation. We assess the model’s performance on different values of r in the experimental study. For the selection of b, there is a tradeoff between the richness of the available space of functionals and the computation cost of the inference.
2.1.4 Theoretical time complexity of MUDRA
MUDRA mainly focuses on reliability in the face of missing data for time–series classifications. However, ensuring a tractable algorithm for parameter inference is crucial. In each ECM iteration k, there is (1) \(\sum _{i \le K} m_i\) calls to the Bartels-Stewart algorithm (Line 8), (2) \(K\) calls to the GMRES algorithm (Line 9), (3) one call to the alternating least squares (ALS) minimization procedure for the PARAFAC tensor decomposition at Line 11, and finally, (4) an iterative optimization step at Lines \(14-16\).
1. One call to the Bartels-Stewart algorithm on sample \(j \in \mathcal {C}_{i}\) has an overall computational cost in \(\mathcal {O}(T_{ij}^3+F_{ij}^3+T_{ij}F_{ij}^2+T_{ij}^2F_{ij})\), by computing the Schur decompositions with the algorithm in Golub et al. (1979).
2. GMRES becomes increasingly expensive at each iteration: running k steps of the GMRES method has a time complexity in \(\mathcal {O}(b Fk^2)\). However, in theory, GMRES converges at step \(k \le b \times F\), so in the worst case, any call to GMRES has a time complexity in \(\mathcal {O}(b^2 F^2)\).
3. The PARAFAC decomposition is computed by alternatively minimizing on w, \(\hat{\Lambda }\), \(\hat{\xi }\) and \(\varvec{c}\). The number of minimization steps across all parameters is limited to 100 in this function. Each single–parameter minimization is linear in this parameter, and then the time complexity is mainly driven by the matrix inversion and product operations.
4. Theoretically, the iterative optimization step in Lines \(14-16\) converges, but no convergence rate is provided. In practice, we chose to change the criteria to either stay under a maximum number of iterations (50 in our experiments) or until the value of \(\hat{\sigma }\) has converged: at step q of the iteration, \(|\hat{\sigma }_{q-1}^{-1}(\hat{\sigma }_{q-1}-\hat{\sigma }_q)| < 10^{-5}\). Then, the remaining cost of this step mainly stems from the matrix products.
We also compare the empirical runtimes of MUDRA and one algorithm from the state–of–the–art, ROCKET (Dempster et al., 2020) in the experimental study in Section 3.
2.1.5 Convergence of MUDRA
The convergence of the MUDRA algorithm can be shown as a special case of the convergence for ECM algorithms (Meng and Rubin (1993)), as detailed in Appendix C.
2.1.6 Implementation of MUDRA
In our implementation of MUDRA, we resort to the Bartels-Stewart algorithm and GMRES present in the Python package scipy (Virtanen et al., 2020). Moreover, we used the canonical tensor decomposition (as opposed to Tucker decomposition (Tucker, 1966)) to decompose the tensors, for which we used the PARAFAC decomposition function in Python package Tensorly (Kossaifi et al., 2016).
2.2 Classification and dimension reduction with MUDRA
Once the parameter values in (3) have been infered by running Algorithm 1, a LDA–like approach to classification (1) can be applied to a new sample \(Y \in \mathbb {R}^{T' \times F'}\). Our model allows to define the density function \(f_i\) of class \(i=1,2,\dots ,K\) and the prior probability \(\pi _i\). In our experiments, we assume a uniform prior, that is, \(\pi _i = 1/K\) for all \(i=1,2,\dots ,K\), but other types of prior might also be considered. Then, the class i(Y) predicted for sample Y is
Let us derive the expression of \(f_i\) from (5). Let us denote \(S_Y\) the spline matrix in (2) associated with time points in Y, \(C_Y\) the corresponding matrix of reordered feature columns, \(M_Y \triangleq \sigma ^2I+(C_Y^\intercal \Psi C_Y)\otimes (S_Y\Sigma S_Y^\intercal )\) and
Then, ignoring constant factors in i and leveraging the computations in Appendix B, \(\log f_i(Y)\) in (6) can be replaced with
(B8) in Appendix B shows that the covariance of \(\hat{\alpha }_Y\), \(\mathbb {V}[vec(\hat{\alpha }_Y)]=I\). \(\hat{\alpha }_Y\) can be reshaped into an \(r\times r\) matrix. Then, we obtain an \(r^2\)–dimensional representation of Y.
3 Experimental study
We describe the empirical results of MUDRA in this section regarding predictive power, interpretability, and time efficiency. The overall experimental pipeline involves the extraction of features using an algorithm like MUDRA or ROCKET (Dempster et al., 2020), and then the classification of the samples using the extracted features via a Ridge Regression based classifier with balanced weights as shown in Figure 2. In our experiments, we required that the time-series are aligned based on time elapsed since the first observation. However, if there is an experimental timeline, we can align the data points along that timeline, and apply MUDRA straightforwardly. For example, if we know when our experiment starts, we can align our sample timelines even if the first observation is missing.
We selected as baseline for MUDRA the ROCKET algorithm (Dempster et al. (2020)). A recent review by Ruiz et al. (2021) showed that ROCKET (Dempster et al., 2020) is currently the best-performing model, both in terms of efficiency and accuracy, for feature transformation in short–time–series classification. However, when data is missing, ROCKET cannot be applied directly. Inspired by Bier et al. (2022), we employed three forms of padding and imputation to the data set prior to running ROCKET:
-
1.
“Imputation”–type padding: The recorded data points are aligned according to their sampling times. Missing time points and missing features are denoted with 0.
-
2.
“End”–type padding: All missing features are imputed with 0. All measured time points are concatenated. 0s are appended at the end to obtain time series of the same length.
-
3.
Last Observation Carried Forward (LOCF) imputation: The last observed non-missing value is used to fill in the missing values at later time points.
The experiments along with the environment configuration and package versions used for ROCKET and MUDRA are available in the GitHub repository (https://github.com/rbordoloi/MUDRA). The reported experiments were run on a CPU cluster (configuration: processor AMD EPYC 7763 64-Core, 20 cores @2.40GHz, RAM 8GB). None of the algorithms used a GPU.
3.1 Time–series classification on a synthetic data set
We built a data set with two features, 12 time points, two classes, and 300 samples as described in Appendix D. For this experiment, we set \(b=10\) spline basis functions as a tradeoff between performance and computational cost. To assess the predictive power of our model, we collected the \(\mathcal {L}^2\)-norm of the functional \(R^2\) values (Ramsay and Silverman, 2005) for each class as a goodness-of-fit measure, and the \(F_1\) scores for the corresponding classification task for multiple values of r. \(R^2\) values and \(F_1\) scores are reported in respectively Figure 3a and Figure 3b.
We observe that the approximate \(\mathcal {L}^2\)-norms of the \(R^2\) values show a general increasing trend. An increase in the number of class-specific parameters results in a corresponding increase in the goodness-of-fit. Similarly, Figure 3b shows an increase in the classification accuracy in terms of F1-score as the output dimensionality increases. For low dimensionalities (\(r\le 4\)), we observe that MUDRA outperforms ROCKET(with Last Observation Carried Forward imputation) with a Ridge Regression-based classifier. Figure 4 shows that MUDRA is able to retrieve the true functionals in the synthetic data set.
3.2 Time–series dimension reduction on real-life data
We also applied the MUDRA algorithm to the real-world data set “Articulary Word Recognition” from Dau et al. (2019). This data set compiles measurements of the movements of the tongue and lips during speech in native English speakers pronouncing 25 words. The time-varying tongue and lip movements were measured by an electromagnetic articulograph with 12 sensors that return 3D coordinates with a sampling rate of 200 Hz during the pronunciation of each word. Each word represents one class, and the goal is to identify the time-dependent movements associated with each word (Wang et al. 2013). 275 series/samples are in the training set, and 300 are in the testing set. In this use case, we restrict ourselves to 9 dimensions corresponding to the outputs from three sensors: one on the tongue tip, one on the upper lip, and the last on the lower lip (Shokoohi-Yekta et al. 2017).
Further details about the simulation of missing data are present in Appendix D. We applied MUDRA or ROCKET to perform a dimension reduction on the data set. Resulting features were fed to a ridge regression-based classifier. Such a procedure allows to test how informative the resulting representations are, which is a first step towards interpretability. We also classified the data directly through the Bayes–optimal classifier described in Section 2.2. Since ROCKET always reduces to an even number of dimensions, and MUDRA to \(r^2\) dimensions, to ensure fairness when comparing both algorithms, we considered for ROCKET the smallest even number which was larger than \(r^2\). The two plots in Figure 5 report the \(F_1\) scores obtained for the classification tasks for multiple values of output dimension r on complete data (i.e. without any missing data) and on missing data, where the procedure in Appendix D has been applied to the data set. Figure 5a shows that MUDRA competes with ROCKET for very low–dimensional representations, but ROCKET overtakes MUDRA as we increase the output dimension.
Figure 5b shows that MUDRA clearly outperforms ROCKET in all of these cases.
ROCKET versus MUDRA with a ridge regression–based classifier on the “Articulary Word Recognition” data set with \(b=9\) and for \(r \in [1,9]\). Left: MUDRA versus ROCKET on complete data. The blue curve correspond to the Bayes–optimal rule in MUDRA, the orange one is the dimension reduction in MUDRA combined with a ridge regression classifier, the green curve correspond to ROCKET–transformed features fed to the classifier. Right: MUDRA versus ROCKET on missing data. The blue curve correspond to the Bayes–optimal rule in MUDRA, the orange one is the dimension reduction in MUDRA combined with a ridge regression classifier, the green (resp., red) curve correspond to ROCKET–transformed features with “imputation” (resp., “end”)–type padding fed to the classifier, and the purple curve corresponds to ROCKET-transformed features with LOCF imputation (Color figure online)
3.3 Empirical runtime of the MUDRA algorithm
To assess the empirical time complexity of MUDRA, we collected the runtimes over \(N=1,000\) iterations of both MUDRA and ROCKET combined with a ridge classifier on the real–life data set (without missing data, \(r=7\), \(b=9\)). Resulting runtimes are reported in Figure 6.
As evidenced by Figure 6, the current implementation of MUDRA is approximately 12 times slower in average than ROCKET. Since ROCKET is a simple algorithm that shifts most of the burden of inference to the subsequent classification algorithm, it is extremely fast. However, as previously mentioned, it lacks some of the predictive power of MUDRA.
4 Conclusion
This work presents a novel approach to dealing with missing data in multivariate functional data, which has historically proven to be a challenging problem for short–time–series classification. Contemporary algorithms, which achieve optimal performance, typically resort to padding; however, this strategy proves suboptimal for very short time series with a large proportion of missing data. Our proposed algorithm, MUDRA, uniquely enables model training on selective fragments of a high–dimensional surface without necessitating regularization to reconstruct the entire surface for each sample. This eliminates the need for padding and imputation. Our algorithm outperforms the current state-of-the-art models in the cases where a very low–dimensional representation is required and/or a large proportion of the data is missing. Such cases frequently arise in real-life data, especially in biomedical applications. Our approach MUDRA is also interpretable in several ways, in comparison to the state-of-the-art. Firstly, the estimated class functionals can be recovered, which is impossible with other methods for this problem (see Equation (3) on page 7). Additionally, since the model is based on a statistical functional regression approach, the estimated parameters are maximum likelihood estimators. This means that one could, in principle, construct confidence intervals and parametric hypothesis tests based on these estimates (see Appendix A). Finally, since the underlying parametric assumptions involve Gaussian distributions, one can also interpret the embeddings generated by the model as clusters and look at new data points based on their closeness to the various cluster centroids (see Section 3.2 on the experiments on the real-life data set “Articulary Word Recognition”).
While our approach demonstrates considerable flexibility and efficiency, it has some shortcomings. Empirically, we show that the MUDRA algorithm sacrifices some time efficiency in favor of interpretability. Improving on the implementation of Algorithm 1 by resorting to faster routines (Nguyen et al., 2016; Song et al., 2022), switching the programming language to C and Fortran, or approximating solution equations (Wang et al., 2022) could tackle this issue.
On the theoretical side, one issue we could not solve was irregularly sampled features for one subject. Consider data being recorded for a sample \(Y_{ij}\) at time points \(t^{i j}_{1}\) and \(t^{i j}_{2}\). If the features recorded for \(Y_{ij}\) at \(t^{i j}_{1}\) are not the same as at \(t^{i j}_{2}\), our model cannot train on that datapoint without resorting to the imputation or deletion of some of the features. Another significant limitation is the assumption of common covariance matrices \(\Sigma\) and \(\Psi\) for the autoregressive noise in each class, which stem from the LDA standpoint. This is sometimes not representative of real data, for example, clinical data, as certain conditions can have a major impact on the variability of levels of certain metabolites. A quadratic discriminant analysis (QDA) approach might have mitigated these issues. However, class-specific covariance matrices raise a number of computational problems in the ECM algorithm and do not lead to the relatively simple expression for the dimension reduction of a new sample. Tackling these shortcomings would improve the application of short–time–series classification algorithms to real-life use cases.
Code availability
The source code for MUDRA (Algorithm 1), along with documentation, is available at the following GitHub repository https://github.com/rbordoloi/MUDRA under a GPL-3 license. The main algorithm is available as a plug-and-play estimator for the sktime (Löning et al., 2019) Python module and can be installed using the Package Installer for Python (pip).
References
Bartels, R. H., & Stewart, G. W. (1972). Algorithm 432 [C2]: Solution of the matrix equation AX + XB = C [F4]. Communications of the ACM, 15(9), 820–826. https://doi.org/10.1145/361573.361582
Bier, A., Jastrzebska, A., & Olszewski, P. (2022). Variable-length multivariate time series classification using ROCKET: A case study of incident detection. IEEE Access, 10, 95701–95715. https://doi.org/10.1109/ACCESS.2022.3203523
Bro, R. (1997). PARAFAC. Tutorial and applications. Chemometrics and Intelligent Laboratory Systems, 38(2), 149–171. https://doi.org/10.1016/S0169-7439(97)00032-4
Dau, H. A., Bagnall, A., Kamgar, K., Yeh, C.-C.M., Zhu, Y., Gharghabi, S., Ratanamahatana, C. A., & Keogh, E. (2019). The UCR time series archive. IEEE/CAA Journal of Automatica Sinica, 6(6), 1293–1305. https://doi.org/10.1109/JAS.2019.1911747
Dempster, A., Petitjean, F., & Webb, G. I. (2020). ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery, 34(5), 1454–1495. https://doi.org/10.1007/s10618-020-00701-z
Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89–102.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188.
Fukuda, T., Matsui, H., Takada, H., Misumi, T., & Konishi, S. (2023). Multivariate functional subspace classification for high-dimensional longitudinal data. Japanese Journal of Statistics and Data Science. https://doi.org/10.1007/s42081-023-00226-x
Gardner-Lubbe, S. (2021). Linear discriminant analysis for multiple functional data analysis. Journal of Applied Statistics, 48(11), 1917–1933. https://doi.org/10.1080/02664763.2020.1780569
Glanz, H., & Carvalho, L. (2013) An Expectation-Maximization Algorithm for the Matrix Normal Distribution. arXiv . https://doi.org/10.48550/arXiv.1309.6609 . http://arxiv.org/abs/1309.6609 Accessed 2023-10-25
Golub, G., Nash, S., & Van Loan, C. (1979). A Hessenberg-Schur method for the problem AX+ XB= C. IEEE Transactions on Automatic Control, 24(6), 909–913.
Graf, R., Zeldovich, M., & Friedrich, S. (2024). Comparing linear discriminant analysis and supervised learning algorithms for binary classification-A method comparison study. Biometrical Journal, 66(1), 2200098.
Izenman, A. J. (2008). Multivariate regression. In A. J. Izenman (Ed.), Modern multivariate statistical techniques: regression, classification, and manifold learning (pp. 159–194). New York, NY: Springer.
James, G. M., & Hastie, T. J. (2001). Functional linear discriminant analysis for irregularly sampled curves. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63(3), 533–550.
Jebb, A. T., Tay, L., Wang, W., & Huang, Q. (2015). Time series analysis for psychological research: Examining and forecasting change. Frontiers in Psychology, 6, 727.
Kossaifi, J., Panagakis, Y., Anandkumar, A., & Pantic, M. (2016) Tensorly: Tensor learning in python. arXiv preprint arXiv:1610.09555https://doi.org/10.48550/arXiv.1610.09555
Lee, D. Y., & Harring, J. R. (2023). Handling missing data in growth mixture models. Journal of Educational and Behavioral Statistics, 48(3), 320–348. https://doi.org/10.3102/10769986221149140
Lines, J., Taylor, S., & Bagnall, A. (2016) HIVE-COTE: The Hierarchical Vote Collective of Transformation-Based Ensembles for Time Series Classification. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1041–1046 . https://doi.org/10.1109/ICDM.2016.0133 . https://ieeexplore.ieee.org/document/7837946 Accessed 2024-02-02
Löning, M., Bagnall, A., Ganesh, S., Kazakov, V., Lines, J., & Király, F.J. (2019) sktime: A unified interface for machine learning with time series. arXiv preprint arXiv:1909.07872
McLaughlin, C., & Su, L. (2023) FedLDA: Personalized Federated Learning Through Collaborative Linear Discriminant Analysis. In: International Workshop on Federated Learning in the Age of Foundation Models in Conjunction With NeurIPS 2023 . https://openreview.net/forum?id=1ww9tjEQVL
Meng, X.-L., & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2), 267–278. https://doi.org/10.2307/2337198
Middlehurst, M., Large, J., Flynn, M., Lines, J., Bostrom, A., & Bagnall, A. (2021). HIVE-COTE 2.0: A new meta ensemble for time series classification. Machine Learning, 110(11–12), 3211–3243.
Nagar, D. K., & Gupta, A. K. (1999). Matrix variate distributions. New York: Chapman and Hall/CRC.
Nguyen, V.-D., Abed-Meraim, K., & Linh-Trung, N. (2016) Fast adaptive PARAFAC decomposition algorithm with linear complexity. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6235–6239. IEEE, ???
Ram, N., & Grimm, K. J. (2009). Growth mixture modeling: A method for identifying differences in longitudinal change among unobserved groups. International Journal of Behavioral Development, 33(6), 565–576. https://doi.org/10.1177/0165025409343765
Ramsay, J. O., & Silverman, B. W. (2005). Functional linear models (pp. 217–222). New York, NY: Springer.
Rao, C. R. (1948). The utilization of multiple measurements in problems of biological classification. Journal of the Royal Statistical Society. Series B (Methodological), 10(2), 159–203.
Ruiz, A. P., Flynn, M., Large, J., Middlehurst, M., & Bagnall, A. (2021). The great multivariate time series classification bake off: A review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, 35(2), 401–449. https://doi.org/10.1007/s10618-020-00727-3
Saad, Y., & Schultz, M. H. (1986). GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 7(3), 856–869. https://doi.org/10.1137/0907058
Shokoohi-Yekta, M., Hu, B., Jin, H., Wang, J., & Keogh, E. (2017). Generalizing dtw to the multi-dimensional case requires an adaptive approach. Data Mining and Knowledge Discovery, 31, 1–31.
Song, Y., Sebe, N., & Wang, W. (2022) Fast differentiable matrix square root. arXiv preprint arXiv:2201.08663https://doi.org/10.48550/arXiv.2201.08663
Spliid, H. (2016) Multivariate Time Series Estimation using marima: 38. Symposium i Anvendt Statistik 2016. Symposium i anvendt statistik 2016, 108–123
Tan, C. W., Dempster, A., Bergmeir, C., & Webb, G. I. (2022). MultiRocket: Multiple pooling operators and transformations for fast and effective time series classification. Data Mining and Knowledge Discovery, 36(5), 1623–1646.
Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3), 279–311. https://doi.org/10.1007/BF02289464
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., & Bright, J. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17(3), 261–272.
Wang, W. K., Chen, I., Hershkovich, L., Yang, J., Shetty, A., Singh, G., Jiang, Y., Kotla, A., Shang, J. Z., Yerrabelli, R., Roghanizad, A. R., Shandhi, M. M. H., & Dunn, J. (2022). A systematic review of time series classification techniques used in biomedical applications. Sensors, 22(20), 8016. https://doi.org/10.3390/s22208016
Wang, T., Layton, S. K., & Barba, L. A. (2022). Inexact GMRES iterations and relaxation strategies with fast-multipole boundary element method. Advances in Computational Mathematics, 48(3), 32.
Wang, J., Balasubramanian, A., Vega, L.G.M., Green, J.R., Samal, A., & Prabhakaran, B. (2013) Word recognition from continuous articulatory movement time-series data using symbolic representations. In: Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies, pp. 119–127
Yao, F., Müller, H.-G., & Wang, J.-L. (2005). Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association, 100(470), 577–590.
Yoon, J., Zame, W. R., & Schaar, M. (2019). Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Transactions on Biomedical Engineering, 66(5), 1477–1490. https://doi.org/10.1109/TBME.2018.2874712
Zhu, F., Gao, J., Yang, J., & Ye, N. (2022). Neighborhood linear discriminant analysis. Pattern Recognition, 123, 108422.
Funding
Open Access funding enabled and organized by Projekt DEAL. C.R. has received funding from the European Union’s HORIZON 2020 Programme under grant agreement no. 101102016 (RECeSS, HORIZON MSCA Postdoctoral Fellowships–European Fellowships). O.W. acknowledges support from the German Research Foundation (DFG) FK515800538 (learning convex data spaces).
Author information
Authors and Affiliations
Contributions
R.B. designed and wrote the algorithm. O.T. and C.R. helped with calculations and cross-checking for the ECM algorithm. R.B. and C.R. wrote the manuscript and available code. S.B. and O.W. supervised the study and helped writing the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
No conflict of interest is declared.
Additional information
Editor: Annalisa Appice.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A The MUDRA algorithm: ECM
To guarantee convergence, we assume that \(S_{ij}\Sigma S_{ij}^\intercal\) and \(C_{ij}^\intercal \Psi C_{ij}\) are invertible for each i and j. Our objective is to maximize
1.1 A.1 The E–step
To simplify calculations, we will drop all indices and look at only one data point from one class, as the aggregate can be calculated by doing similar calculations for all data points and computing the arithmetic mean at the end. Thus the simplified model can be written as
where S and C are \(T\times b\) and \(n\times F\) matrices respectively. The idea is to compute the expectation of \(S\gamma C=\gamma '\) instead of \(\gamma\) itself like in James and Hastie (2001) and write Q in terms of \(\gamma '\). This will allow us to simplify and use the Bartels and Stewart (1972) algorithm to find the expectation without computing any tensor products. The log-likelihoods, ignoring proportionality constants, are as follows.
This is because \(\gamma \sim \mathcal {N}_{b\times n}(0,\Sigma ,\Psi )\implies \gamma '=S\gamma \sim \mathcal {N}_{m\times n}(0,S\Sigma S^\intercal , C^\intercal \Psi C)\) by Theorem 2.3.10 in D. K. Nagar (1999). Let us set \(\Sigma '=S\Sigma S^\intercal\) and \(\Psi '= C^\intercal \Psi C\). We know that \(\mathbb {E}[\gamma '|Y=y]=\int \gamma '\frac{f_{Y,\gamma '}(y,\gamma ')}{f_Y(y)}\,d\gamma '\), for all \(y: f_Y(y)>0\). Let us ignore all terms not depending on \(\gamma '\). We will work with the terms in the exponential depending on \(\gamma '\) in log space to avoid numerical instability.
Let \(M=(I+\sigma ^2(\Psi '\otimes \Sigma ')^{-1})\). Clearly, M is symmetric. Then by completing the square inside the term multiplied by \(-\frac{1}{\sigma ^2}\) we have
Clearly, \(vec(\gamma ')|Y\sim N(M^{-1}vec(Y-S\beta C), \sigma ^2M^{-1})\). So
Now, we can find \((\Psi '\otimes I+I\otimes \sigma ^2\Sigma '^{-1})^{-1}vec((Y-S\beta )\Psi '^\intercal )\) by solving the Sylvester equations \(\sigma ^2(\Sigma '^{-1})X+X\Psi '^\intercal =(Y-S\beta )\Psi '^\intercal\) using the Bartels and Stewart (1972) algorithm. Let this value be called \(\mu _{\gamma '}\). Also, let \(\mu _\gamma =\mathbb {E}[\gamma |Y]\) and \(Y'=Y-S\beta C\). From D. K. Nagar (1999) we know that if \(X\sim \mathcal {N}_{m\times n}(\mu ,\Sigma ,\Psi )\), \(\mathbb {E}[X^\intercal AX]=\text {tr}\{\Sigma A^\intercal \}\Psi + \mu ^\intercal A\mu\)
With some abuse of notation, we will call \(\hat{\mu }_{{\gamma '}_{ij}}\) as \(\gamma '_{ij}\).
1.2 A.2 The CM–step
Since many of the parameters are interdependent, we will use conditional maximization. For a parameter p, \(\hat{p}\) denotes the estimate at the next step. For convenience we will call \(\mu _{\gamma _{ij}'}\) as \(\gamma _{ij}'\). Our objective is to maximize
where \(||\cdot ||_F\) denotes the Frobenius norm.
1.2.1 A.2.1 Estimating the linear parameters
Equations of the form \(\sum _{i}A_iXB_i=C\) for X can be solved by the iterative GMRES algorithm (Saad and Schultz, 1986) without actually computing the inverse of \(\sum _{i}(B_i^\intercal \otimes A_i)\). So our task will be to reduce all estimates to this form.
Let \(Y_{ij}'=Y_{ij}-\gamma _{ij}'\) and \(\beta _i=\lambda _0+\Lambda \alpha _i\). In order to estimate each \(\beta _i\), we differentiate \(\sum _j||Y_{ij}'-S_{ij}\beta _i C_{ij}||_F^2\) and set to 0.
This is of the form above, so we solve it to estimate each \(\beta _i\), say \(\hat{\beta }_i\). Then to satisfy (4) we have,
Now, let \(\beta _i'=\hat{\beta }_i-\hat{\lambda }_0\). We know that \(\beta _i'=\hat{\Lambda }\hat{\alpha }_i\hat{\xi }\). Let \(\varvec{\beta }\) denote the 3-way \(b\times F\times K\)-tensor formed by stacking \(\beta '_i\)s. We apply the PARAFAC algorithm (Bro, 1997) for r factors on \(\varvec{\beta }\) to get the r-way canonical decomposition, denoted by \(\sum _{i=1}^rw_i(\varvec{\lambda }_i\otimes \varvec{\xi }_i\otimes \varvec{c}_i)\), where \(\varvec{\lambda }_i\), \(\varvec{\xi }_i\) and \(\varvec{c}_i\) are normalized. We easily get the estimates of \(\hat{\lambda }\) and \(\hat{\xi }^\intercal\) subject to (4) by stacking the \(\varvec{\lambda }_i\)s and \(\varvec{\xi }_i\)s respectively. Since \(\hat{\alpha }_i\)s are diagonal, they can be obtained by elementwise multiplication of \(w_i\)s with \(c_i\)s, i.e.,
1.2.2 A.2.2 Estimating the variance parameters
\(\Sigma\) and \(\Psi\) will be estimated by the flip-flop procedure, as suggested by Glanz and Carvalho (2013). Firstly, we need an estimate of \(\gamma _{ij}\). We have
We then compute the derivative with respect to \(\Sigma\) and set it to 0.
We perform a similar calculation for the derivative of Q with respect to \(\Psi\). Directly solving these equations will be computationally intensive, so instead, we ignore the quadratic terms and approximate \(\Sigma\) and \(\Psi\) by
Finally, \(\sigma ^2\) can be estimated as
After initially estimating the \(\hat{\gamma }_{ij}\) values, we proceed to iteratively estimate \(\hat{\Sigma }\), \(\hat{\Psi }\), and \(\hat{\sigma }^2\) until they reach convergence.
Appendix B Calculations for projection
Fix a Y, and let \(\varvec{z} = ((\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal )M_Y^-(C_Y^\intercal \xi ^\intercal \otimes S_Y\Lambda ))^{-\frac{1}{2}}vec(\hat{\alpha }_Y)=((\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal )M_Y^{-1}(C_Y^\intercal \xi ^\intercal \otimes S_Y\Lambda ))^-(\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal )M_Y^{-1}vec(Y-S_Y\lambda _0\xi _Y)\), \(A=(C^\intercal _Y\xi ^\intercal \otimes S_Y\Lambda )\) and \(\varvec{x}=vec(Y-S_Y\lambda _0C_Y)\). So \(\varvec{z}=(A^\intercal M_Y^{-1}A)^{-1}A^\intercal M_Y^{-1}\varvec{x}\). Then,
as \(M_Y\) is symmetric positive definite. Consider only the inner product term. We have,
Now consider the second norm term. We have,
Therefore,
We will also compute the covariance matrix of \(vec(\hat{\alpha }_Y)\)
Thus,
Appendix C Proof of convergence
Let us partition the parameter vector \(\varvec{\theta }\in \Theta\) as \(\varvec{\theta }^\intercal =[\varvec{\theta }_1^\intercal ,\varvec{\theta }_2^\intercal ,\varvec{\theta }_3^\intercal ,\varvec{\theta }_4^\intercal ,\theta _5]\) where
correspond to the \(S=5\) conditional-maximization (CM) steps in MUDRA. We will denote the space of values that the parameter \(\theta _s\) can take as \(\Theta _s\). Also, let \(\mathbb {R}^\Theta\) denote the Euclidean space in which the manifold \(\Theta\) is embedded. \(\mathbb {R}^{\Theta _s}\) denotes the subspaces in which the submanifolds \(\Theta _s\) are embedded, so \(\mathbb {R}^\Theta =\bigcup _{s=1}^5\mathbb {R}^{\Theta _s}\) and \(\mathbb {R}^{\Theta _i}\cap \mathbb {R}^{\Theta _j}=\{0\}\) if \(i\ne j\). For simplicity, we will assume that the maximum number of iterations done for estimating \(\Sigma\), \(\Psi\) and \(\sigma ^2\) until they converge to be 1, but the same proof will also work for multiple iterations.
By Theorem 3 in Meng and Rubin (1993), to show that our ECM algorithm converges, it is enough to show the following:
-
The constraint functions \(g_s\) associated with the ECM algorithm are differentiable.
-
The Jacobian of the constraint functions with respect to \(\varvec{\theta }\), \(\nabla g_s(\varvec{\theta })\), is full rank.
-
The column spaces of the Jacobians \(J_s(\varvec{\theta }) \triangleq \{\nabla g_s(\varvec{\theta }) \lambda \mid \lambda \in \mathbb {R}^{\Theta _s} \}\) only intersect at the origin, i.e.
$$\begin{aligned} \bigcap _{s=1}^5 J_s(\varvec{\theta })=\{0\}\;. \end{aligned}$$ -
All conditional maximizations are unique.
Here constraints refer to the parameters which are fixed at each step of the CM step. For instance, in the CM step where we estimate \(\varvec{\theta }_2\) in Algorithm 1 (Lines 13-14), \(\lambda _0\) is fixed, so \(\varvec{\theta }_1=vec(\lambda _0)\) will be one component of the constraint function \(g_2(\varvec{\theta })\). Notice that in our algorithm, at each CM step, the parameter being estimated is never used as a constraint. Thus, \(g_s(\varvec{\theta })=[\varvec{\theta }_i]_{i\ne s}\). Since these are mere projection functions from \(\Theta\) to a lower dimensional Euclidean space, they are differentiable.
Now consider the Jacobian of one of these constraint functions. The Jacobian of \(g_s\) with respect to \(\varvec{\theta }\) will be a matrix of dimension \(dim(\varvec{\theta })\times dim(g_s(\varvec{\theta }))\). But by our definition of \(g_s\), \(dim(g_s(\varvec{\theta }))=dim(\varvec{\theta })-dim(\varvec{\theta }_s)\). The Jacobian will be a block matrix of the following form
The last element is 1 because \(\theta _5=\sigma ^2\) is a scalar. Clearly, every column block has exactly one identity matrix, so this matrix is full rank.
Consider the decomposition of \(\mathbb {R}^\Theta\) as \(\bigcup _{s=1}^5\mathbb {R}^{\Theta _s}\). By our construction of \(\nabla g_s(\varvec{\theta })\), we observe that its column space is the subspace of \(\mathbb {R}^\Theta\) that is outside the span of \(\mathbb {R}^{\Theta _s}\), i.e. \(J_s(\varvec{\theta })=\text {span}(\mathbb {R}^\Theta \setminus \mathbb {R}^{\Theta _s})\). This is because the \(s^{th}\) row block only has 0 matrices, which means that any element of \(\mathbb {R}^{\Theta }\) which, in its basis representation, has non-zero values for the basis elements of \(\mathbb {R}^{\Theta _s}\) does not lie in the span of the columns of \(\nabla g_s(\varvec{\theta })\). Thus,
This satisfies all of the above conditions, hence our ECM algorithm converges.
Appendix D Experimental details
1.1 D.1 Synthetic data set
To generate simulated data for our study, we selected \(F=2\) features, \(T=12\) time points, \(K=3\) classes, and \(m_i=100\) samples per class. A library of six functions with range [0, 1] was created, and each feature-class pair was randomly assigned one function. The \(\Sigma\) (of dimensions \({T\times T}\)) and \(\Psi\) (of dimensions \(F\times F\)) matrices were randomly generated, and they were employed to produce autoregressive noise following the matrix-variate normal distribution with a mean matrix of 0, row covariance \(\Sigma\), and column covariance \(\Psi\). Additionally, measurement errors were generated by sampling from a matrix-variate normal distribution with a mean matrix of 0 and identity covariance matrix. Samples were drawn from these matrix-variate normal distributions and added to each selected function K times to generate each sample.
To simulate missing data, for each sample, an integer \(T_{ij}\) was randomly sampled from the range of 1 to 11, representing the selected number of time points to retain. Similarly, an integer \(F_{ij}\) was randomly chosen from the range of 1 to 2, indicating the chosen number of features to retain. The \(T_{ij}\) time points and \(F_{ij}\) features were randomly selected for each sample, and the rest of the data points were deleted.
1.2 D.2 The “Articulary Word Recognition” data set
This data set (Ruiz et al., 2021) comprises quite a long time series, with 144 time points. We only retained every \(12^{th}\) time point to shorten the length while preserving the overall trend. Then, we randomly deleted some time points and some features to simulate the missing data. The proportion of missing time points is capped at \(50\%\), whereas it is capped at \(55\%\) for missing features.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bordoloi, R., Réda, C., Trautmann, O. et al. Multivariate functional linear discriminant analysis for partially-observed time series. Mach Learn 114, 80 (2025). https://doi.org/10.1007/s10994-025-06741-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10994-025-06741-0