Multivariate functional linear discriminant analysis for partially-observed time series

Bordoloi, Rahul; Réda, Clémence; Trautmann, Orell; Bej, Saptarshi; Wolkenhauer, Olaf

doi:10.1007/s10994-025-06741-0

Multivariate functional linear discriminant analysis for partially-observed time series

Open access
Published: 11 February 2025

Volume 114, article number 80, (2025)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Multivariate functional linear discriminant analysis for partially-observed time series

Download PDF

543 Accesses
1 Citation
Explore all metrics

Abstract

The more extensive access to time-series data, especially for biomedical purposes, raises new methodological challenges, particularly regarding missing values. Functional linear discriminant analysis (FLDA) extends Linear Discriminant Analysis (LDA)-mediated multiclass classification and dimension reduction to data in the form of fragmented observations of a univariate function. For large multivariate and partially-observed data, there are two challenges: (i) statistical dependencies between different components of a multivariate function and (ii) heterogeneous sampling times with missing features. We here develop a multivariate version of FLDA, called MUDRA, to tackle these challenges and describe a computationally efficient expectation/conditional-maximisation (ECM) algorithm to infer its parameters without any tensor inversions. We assess its predictive power on the “Articulary Words” dataset and show its improvement over the state-of-the-art, especially in the case of missing data. This advancement in dimension reduction of multivariate functional data holds promise for enhancing classification accuracy in scenarios like partially observed short multivariate time series analysis.

Eigen-entropy based time series signatures to support multivariate time series classification

Article Open access 12 July 2024

Canonical correlation for principal components of time series

Article 18 June 2016

Functional classwise principal component analysis: a classification framework for functional data analysis

Article 02 December 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Linear discriminant analysis (LDA) is a popular strategy still widely used (Graf et al., 2024; McLaughlin and Su, 2023; Zhu et al., 2022) for classification and dimensionality reduction. The origins trace back to Fisher’s discriminant (Fisher, 1936) and its multiclass extension (Rao, 1948). The main assumption of LDA is that samples of class $i=1,2,\dots ,K$ follow a multivariate normal distribution of mean $\mu _{i} \in \mathbb {R}^d$ and shared covariance matrix $\Sigma \in \mathbb {R}^{d\times d}$ of full rank, denoted $\mathcal {N}_{d}(\mu _{i}, \Sigma )$. Then the Bayes optimal rule for classification is applied to predict the class $i(\varvec{x})$ of a new sample $\varvec{x} \in \mathbb {R}^d$

$$\begin{aligned} i(\varvec{x})&\triangleq \mathop {\mathrm {arg\,max}}\limits _{i \in \{1,2,\dots ,K\}} \mathbb {P}(C=i | X=\varvec{x})\;. \end{aligned}$$

As $\mathbb {P}(C=i | X=\varvec{x}) \propto \mathcal {N}_{d}(\varvec{x}; \mu _{i}, \Sigma )\pi _i,$ where the prior probability of class i is $\pi _i$, by applying the Bayes theorem, and by reordering and ignoring terms that are constant across classes, one obtains a decision function which is linear in $\varvec{x}$

$$\begin{aligned} i(\varvec{x})&= \mathop {\mathrm {arg\,max}}\limits _{i \in \{1,2,\dots ,K\}} (\mu _{i}^\top \Sigma ^{-1})\varvec{x}-\frac{1}{2}\mu _{i}^\top \Sigma \mu _{i}+\log \pi _i\;. \end{aligned}$$

(1)

Parameters $(\mu _{i})_{i \le K}$, $\Sigma$, $(\pi _i)_{i \le K}$ can be inferred by maximizing the joint likelihood on a set of training samples.

The advance of functional linear discriminant analysis (FLDA) (James and Hastie, 2001) allows us to handle cases where $\varvec{x}$ is not simply a $d$–dimensional vector, but the output of a time–dependent multivariate function f associated with sample s: $\varvec{x}_s=f_s(t) \in \mathbb {R}^{d}$. Note that FLDA is restricted to the univariate case, i.e. when $d=1$. In practice, one has of course no access to the true function $f_s: \mathbb {N}^* \rightarrow \mathbb {R}^{d}$, but only to some of its values $f_s(t^s_1), f_s(t^s_2), \dots , f_s(t^s_{n_s})$ at a small number $n_s$ of time points $t^s_1$, $t^s_2$, $\dots$, $t^s_{n_s}$. Such short time series appear in many practical scenarios.

For instance, think of medical check-ups, clinical trials, or psychological studies, where patients (samples) from different conditions (classes: diseased, treated with a drug, healthy controls, ...) are surveyed for several variables (features: electrocardiogram measurements, concentration of a specific protein, ...) sparsely over weeks or months. Ignoring that those values come from the same function and simply concatenating values across all time points will lead to high–dimensional data strongly correlated across time points. This, in turn, would lead to numerical issues that prevent the estimation of the covariance matrix $\Sigma$.

Another challenge is that sampling occurs at irregular, short intervals of time points across samples (i.e. different, small values of $n_s$ and different $(t^s_i)_i$ across samples s). Moreover, different sets of features might be measured in two distinct samples, meaning that even the value of $d$ can vary across samples. Those challenges are exemplified in Figure 1. Notably, in this example, there is substantial missing data, such as Feature 2 in Class 1 and Class 2, making traditional padding and imputation approaches impractical (Bier et al., 2022).

1.1 Related work

A first line of related research is functional data analysis (Yao et al., 2005). Time–series observations from sample s are considered as noisy samples from the underlying function $f_s$. This approach is particularly promising when one is interested in the overall shape of data. For example, for medical longitudinal data (Wang et al., 2022) or data arising from psychological studies (Jebb et al., 2015), one is more interested in trends or the overall pattern of the entire cohort, rather than in individual fluctuations and noise sources. In many practical scenarios, the training data is collected from a large number of subjects at a small number of time points (Yoon et al., 2019). The grouping of subjects is based on a similarity of the curves or trends displayed. The learned model can then be used as a classifier to predict class membership for data from new subjects not in the training data set. In such cases, conventional LDA as shown in (1) can’t be directly applied, and time interval discretization is required. James and Hastie (2001) addressed this issue by proposing a model which approximates observations as a noisy linear combination of spline curves, and applies a LDA–like inference procedure to the linear combination coefficients. However, this model only holds for univariate functional data. Applying FLDA individually to each curve is suboptimal due to the high dimension of data points and the fact that FLDA is not designed to estimate inter–feature correlations. Statistical models tailored to multivariate functional data might overcome the issue of interpretability. Similarly to our work, Gardner-Lubbe (2021) presented an extension of FLDA to multivariate data. However, this approach completely ignores the issue of missing features. An alternative method proposed by Fukuda et al. (2023) involves the regularization of the underlying time–dependent functions, under the assumptions of regular sampling times and non–missing features. This thus proves impractical in our use case.

In the last decade, deep learning models have also been developed to tackle multivariate time–series classification (TSC). One instance is the autoregressive integrated moving average applied to multivariate data (marima) (Spliid, 2016). However, marima is only suitable for long time-series. The highest performing approaches for short TSC are today ensemble–of–classifiers methods, such as HIVE–COTE 1.0 (Lines et al., 2016) and 2.0 (Middlehurst et al., 2021), and kernel methods like ROCKET (Dempster et al., 2020) and one of its most recent extensions MultiRocket (Tan et al., 2022). HIVE–COTE algorithms combine the class probabilities returned by four different types of classifiers through a weighting step, which outputs the consensus probabilities, whereas ROCKET–like approaches apply a large number of data transformations using convolutional neural networks to create a massive amount of features. A linear classifier is then trained on those transformed features. Nonetheless, the ensemble architecture of HIVE–COTE and ROCKET might hinder the interpretability of the class probability outputs. Moreover, ROCKET–like methods require padding or imputation for handling missing data, which worsens the issue of high dimensional points and impedes interpretability again.

Finally, another line of research focuses on the growth mixture model (Ram and Grimm, 2009), which describes unobserved subpopulations as a noisy time–dependent linear combination of two latent variables. This approach is indeed adept at recognizing trends in longitudinal data, but the original model is also ill–equipped at handling missing data. The literature on the topic only partially solves this problem, as the issue of irregular sampling time points has been tackled in the prior works reviewed by Lee and Harring (2023). Even so, the proposed advancements cannot handle multiple, highly correlated and missing features.

1.2 Contributions

We introduce a multivariate functional linear discriminant analysis algorithm called MUDRA to tackle these issues. Our approach readily extends the scope of functional linear discriminant analysis described by James and Hastie (2001) to multivariate and missing data.

MUDRA only requires that that for each time-series sample in the data set, if it has any measurements for a given feature F, then F should also be measured at any time point for which the sample has measurements. A negative example would be a data set with containing a sample s and two features, Feature 1 and Feature 2, where Feature 1 is measured at time points $t^s_1$ and $t^s_3$, and Feature 2 is only measured at time points $t^s_2$ and $t^s_3$. To make this example comply with our statement, either Feature 1 should be measured at $t^s_2$ and $t^s_3$ only, or Feature 2 should be measured at $t^s_1$ and $t^s_3$ only.

By approaching the problem from the purview of reduced rank regression, we were able to incorporate both time and feature–based irregularity without imputation. MUDRA generates a low–dimensional representation, based on the PARAFAC (Bro, 1997) tensor decomposition algorithm, that is particularly efficient for data sets with missing data and generates interpretable patterns. It can also reconstruct the mean time–dependent function across all samples from a class. We experimentally show the superior predictive power of MUDRA compared to baselines for time–series classification on a synthetic data set and on the real-life “Articulary Word Recognition” data set, in the presence of a large amount of missing data and when a low dimension is enough to capture the information from data. When there is a large amount of missing data, a common occurrence in real–life applications, MUDRA consistently outperforms ROCKET for low and high dimensionality.

1.3 Notation

If X is an $m\times n$ matrix, we will denote the rows and columns of X as $\textbf{x}_{1,:}, \textbf{x}_{2,:},\ldots ,\textbf{x}_{m,:}$ and $\textbf{x}_{:,1}, \textbf{x}_{:,2},\ldots ,\textbf{x}_{:,n}$ respectively. The $ij^{th}$ element of X will be denoted by $x_{ij}$. If $\varvec{x}_t$’s are m–dimensional vectors, the $i^{th}$ component of $\varvec{x}_t$ is denoted by $x_{t,i}$. A general pseudo–inverse of X is denoted by $X^-$. Recall that if X is square and non–singular, the general pseudo–inverse is equal to the inverse: $X^-=X^{-1}$. We denote the identity matrix I regardless of its dimension. The dimensionality of a vector $\varvec{x}$ will be denoted by $dim(\varvec{x})$.

For an $m\times n$ matrix X, vec(X) denotes the vectorization of X as an mn dimensional vector, formed by stacking all of the columns of X sequentially, i.e.

$$\begin{aligned} vec(X) \triangleq \begin{bmatrix} \textbf{x}_{:,1}^\intercal&\textbf{x}_{:,2}^\intercal \cdots&\textbf{x}_{:,n}^\intercal \end{bmatrix}^\intercal \;. \end{aligned}$$

The Kronecker product of two matrices $A_{m\times n}$ and $B_{p\times q}$ is an ${mp\times nq}$ matrix, defined by

$$\begin{aligned} A\otimes B \triangleq \begin{bmatrix} a_{11}B & a_{12}B & \cdots & a_{1n}B\\ a_{21}B & a_{22}B & \cdots & a_{2n}B\\ & {\vdots }& & \\ a_{m1}B & a_{m2}B & \cdots & a_{mn}B\\ \end{bmatrix}\;. \end{aligned}$$

We will use Kronecker products extensively to turn tensor computations into simpler matrix computations. A random $m\times n$ matrix X is said to follow a matrix–variate normal distribution with mean matrix $M_{m\times n}$, row covariance matrix $\Sigma _{m\times m}$ and column covariance matrix $\Psi _{n\times n}$, if $vec(X)\sim \mathcal {N}_{mn}(vec(M), \Psi \otimes \Sigma )$ (D. K. Nagar, 1999). It is denoted by $X\sim \mathcal {N}_{m\times n}(M, \Sigma , \Psi )$.

We will denote the variance-covariance matrix of a vector $\varvec{y}$ as $\mathbb {V}(\varvec{y})$. For a matrix $\Sigma$, $||\varvec{y}||_\Sigma$ will denote the Mahalanobis norm of $\varvec{y}$ with covariance $\Sigma$, defined as $||\varvec{y}||^2_\Sigma \triangleq \varvec{y}^\intercal \Sigma ^-\varvec{y}$. Similarly, for another vector $\varvec{x}$, $\langle \varvec{x},\varvec{y}\rangle _\Sigma$ will denote the Mahalanobis inner product with covariance $\Sigma$, defined as $\langle \varvec{x},\varvec{y}\rangle _\Sigma \triangleq \varvec{x}^\intercal \Sigma ^-\varvec{y}$. The Mahalanobis inner product thus considers the correlation and variance between two vectors, which will be useful in finding linearly independent discriminants.

Formally, we introduce the available short multivariate time-series training data as follows. We consider a class $\mathcal {C}_{i}$ of samples in which, for any sample j, $F_{ij}$ features at $T_{ij}$ time points have been measured: features $f^{i j}_{1}$, $f^{i j}_{2}$, $\dots$, $f^{i j}_{F_{ij}} \in \mathbb {N}^*$ at times $t^{i j}_{1}$, $t^{i j}_{2}$, $\dots$, $t^{i j}_{T_{ij}} \in \mathbb {N}^*$, for $i=1,2,\dots ,K$ and $j \in \mathcal {C}_{i}$. The matrix of measurements of dimension $T_{ij} \times F_{ij}$ for sample j in class i is denoted $Y_{ij}$. The set of distinct features measured across all samples is denoted $F$, and the set of all sampling time points is T, that is

$$\begin{aligned} F\triangleq \bigcup _{i \le K} \bigcup _{j \in \mathcal {C}_{i}} \bigcup _{k \le F_{ij}} \{ f^{i j}_{k} \} \quad \text { and } \quad T\triangleq \bigcup _{i \le K} \bigcup _{j \in \mathcal {C}_{i}} \bigcup _{k \le T_{ij}} \{ t^{i j}_{k} \} \;. \end{aligned}$$

Note that, since data is sampled irregularly, not all observations have been observed at some common time points or with common features. Thus, all matrices $Y_{ij}$ will possibly have different dimensions. Finally, the number of samples in class i is $|\mathcal {C}_{i}| = m_{i}$.

2 The MUDRA algorithm

In this section, we describe a model that extends FLDA (James and Hastie, 2001) to multivariate functional data, that is, that models the observations as a noisy linear combination of spline functions which incorporates multiple features for the first time. Then, we propose an efficient parameter inference algorithm for this model and illustrate how to use it for classification and dimension reduction.

2.1 A novel multivariate model for time-dependent functional observations

2.1.1 Description of MUDRA

Similarly to the FLDA model introduced by James and Hastie (2001), we consider $b>0$ B–spline functions of order 3 (Paul and Eilers, 1996) denoted $s_1, s_2, \dots , s_b$ to generate the space of functionals available to our model in a flexible fashion. Then for the multivariate observations associated with sample j in class $\mathcal {C}_{i}$, its spline matrix $S_{ij}$ of dimensions $T_{ij}\times b$ is constructed by evaluating the B–spline functions at each time point $t^{i j}_{k}$

$$\begin{aligned} S_{ij} \triangleq \begin{bmatrix} s_1(t^{i j}_{1}) & s_2(t^{i j}_{1}) & \cdots & s_b(t^{i j}_{1})\\ s_1(t^{i j}_{2}) & s_2(t^{i j}_{2}) & \cdots & s_b(t^{i j}_{2})\\ \vdots & \vdots & \ddots & \vdots \\ s_1(t^{i j}_{T_{ij}}) & s_2(t^{i j}_{T_{ij}}) & \cdots & s_b(t^{i j}_{T_{ij}})\\ \end{bmatrix}\;. \end{aligned}$$

(2)

We describe now how our model deviates from FLDA and integrates multiple features. Let C be the $F\times F$ identity matrix. Then, we truncate and reorder columns of C depending on the actual observed features for each sample. For sample $j \in \mathcal {C}_{i}$, define $C_{ij}$ as the $F\times F_{ij}$ matrix formed by concatenating the $(f^{i j}_{1})^\text {th}$, $(f^{i j}_{2})^\text {th}$, $\dots$, $(f^{i j}_{F_{ij}})^\text {th}$ columns of C. We also introduce the spline coefficient matrix $\eta _{ij}$ of dimensions $b\times F$. Similarly to the standard LDA model shown in (1),

$$\begin{aligned} \eta _{ij} = \mu _{i} + \gamma _{ij}\;, \end{aligned}$$

where $\mu _{i} \in \mathbb {R}^{b \times F}$, $\gamma _{ij}\sim \mathcal {N}_{b\times F}(0,\Sigma ,\Psi )$ is the autoregressive noise, and $\Sigma$ and $\Psi$ are both positive definite matrices. Finally, let $\epsilon _{ij}$ be the random matrix of measurement errors on sample $j \in \mathcal {C}_{i}$. We assume that it follows a matrix–variate normal distribution, i.e. $\epsilon _{ij}\sim \mathcal {N}_{T_{ij}\times F_{ij}}(0,\sigma ^2I,I)$ for some $\sigma >0$. We then propose the following model of the multivariate time-dependent observations from sample j in class i

$$\begin{aligned} Y_{ij} = S_{ij}(\mu _{i} + \gamma _{ij})C_{ij} + \epsilon _{ij}\;. \end{aligned}$$

We could leave our model at that, and go on inferring the values of parameters $\mu _{i}$ for $i=1,2,\dots ,K$, $\sigma$, $\Sigma$, and $\Psi$. However, incorporating a reduced–rank regression into the model, by adding a supplementary rank constraint on the model parameters, leads to more interpretable models and easier dimension reduction (Izenman, 2008). Motivated by this reduced rank assumption, we fix the row rank r of our parameter space. Each class–specific mean parameter $\mu _{i}$ satisfies

$$\begin{aligned} \mu _{i} = \lambda _0+\Lambda \alpha _i\xi \;, \end{aligned}$$

where $\lambda _0$ is the $b\times F$ matrix of spline coefficients for the population mean functional, the $(\alpha _i)_{i \le K}$ are the class–specific reduced–rank spline coefficient matrices of dimensions $r \times r$, and $\Lambda \in \mathbb {R}^{b \times r}$ and $\xi \in \mathbb {R}^{r\times F}$ allow to reconstruct the full–rank spline coefficients for each class from the $\alpha _i$’s. Thus, the final multivariate time–dependent model for $i=1,2,\dots ,K$ and $j=1,2,\dots ,m_i$ is

$$\begin{aligned} Y_{ij}=S_{ij}(\lambda _0+\Lambda \alpha _i\xi +\gamma _{ij})C_{ij}+\epsilon _{ij}\;, \end{aligned}$$

(3)

where $\epsilon _{ij}\sim \mathcal {N}_{T_{ij}\times F_{ij}}(0,\sigma ^2I,I)$, $\gamma _{ij}\sim \mathcal {N}_{b\times F}(0,\Sigma ,\Psi )$, with parameters $\lambda _0$, $\Lambda$, $(\alpha _i)_{i \le K}$, $\xi$, $\sigma$, $\Sigma$, $\Psi$ and hyperparameters b and r where $r \le b$. (3) handles fragmented curves, misaligned observation times, and accommodates measurement errors without imputation. The resulting decision function enables curve classification based on a representation of low dimension r. Class–specific characteristics are captured by the coefficients $(\alpha _i)_{i \le K}$.

However, in this model, $\lambda _0, \Lambda$, $\alpha _i$ and $\xi$’s are interdependent, making parameter estimation harder. To solve this issue, we additionally enforce that $\Lambda \text { and }\xi ^\intercal$ are orthogonal matrices and $(\alpha _i)_i$ are normalized:

$$\begin{aligned} \Lambda ^\intercal \Lambda =I\text {, }\xi \xi ^\intercal =I \text { and } \quad \sum _{i=1}^Km_i\alpha _i=0\;. \end{aligned}$$

(4)

As previously mentioned, fitting the model described in (3) boils down to estimating parameters $\lambda _0,\Lambda , \alpha _i, \xi , \Sigma , \Psi$ and $\sigma$. We describe in Algorithm 1 an expectation/conditional–maximization (ECM) algorithm (Meng and Rubin, 1993) for parameter inference. The ECM algorithm is an extension to the well–known expectation/maximization (EM) algorithm for the iterative maximization of the joint log–likelihood on model parameters, which handles missing values in the E–step. This algorithm decouples the inference of most of the parameters across classes $i=1,2,\dots ,K$ and samples $j \in \mathcal {C}_{i}$. If $\gamma _{ij}\sim \mathcal {N}_{b\times c}(0,\Sigma ,\Psi )$ then thanks to the properties of matrix–variate normal distributions, $S_{ij}\gamma _{ij}C_{ij}\sim \mathcal {N}_{T_{ij}\times F_{ij}}(0,S_{ij}\Sigma S_{ij}^\intercal ,C_{ij}^\intercal \Psi C_{ij})$. Let us then denote $\Sigma _{ij}' \triangleq S_{ij}\Sigma S_{ij}^\intercal$ and $\Psi _{ij}' \triangleq C_{ij}^\intercal \Psi C_{ij}$, which are assumed to be positive definite matrices. Therefore, from (3),

$$\begin{aligned} \begin{aligned} vec(Y_{ij})&\sim \mathcal {N}_{T_{ij}F_{ij}}\left( vec(\mu ^Y_{i,j}),\Sigma ^Y_{ij}\right) \;, \end{aligned} \end{aligned}$$

(5)

where the mean matrix is $\mu ^Y_{i,j} \triangleq S_{ij}(\lambda _0+\Lambda \alpha _i\xi )C_{ij}$ and the covariance matrix $\Sigma ^Y_{ij} \triangleq \sigma ^2I_{T_{ij}F_{ij}}+\Psi _{ij}'\otimes \Sigma _{ij}'$. In this case, the joint log–likelihood $\ell (\lambda _0, \Lambda , (\alpha _i)_i, \xi , \Sigma , \Psi , \sigma )$ on the parameters can be written as

$$\begin{aligned} & -\frac{1}{2}\sum _{i=1}^K\sum _{j=1}^{m_i} \big (vec(Y_{ij}-\mu ^Y_{i,j})^\intercal (\Sigma ^Y_{ij})^{-1}vec(Y_{ij}-\mu ^Y_{i,j}) +\log \det (\Sigma ^Y_{ij})+T_{ij}F_{ij}\log (2\pi )\big ). \end{aligned}$$

Algorithm 1 aims at maximizing $\ell (\lambda _0, \Lambda , (\alpha _i)_i, \xi , \Sigma , \Psi , \sigma )$ without computing any Kronecker product inverses to remain efficient. It relies on alternatively optimizing each parameter. Most prominently, this algorithm features the tensor decomposition algorithm PARAFAC (Bro, 1997) at Line 11. The Sylvester equation at Line 8 of Algorithm 1 is solved using the algorithm described in Bartels and Stewart (1972). The equation in Line 9 is solved by executing the generalized minimal residual (GMRES) method (Saad and Schultz, 1986). Further details are in Appendix A.

2.1.2 MUDRA as an extension to FLDA in a multivariate setting

(3) exactly retrieves FLDA when $F_{ij}=|F|=1$ for all $i=1,2,\dots ,K$ and $j \in \mathcal {C}_{i}$ (without the additional constraints in (4)). Indeed, in that case, $C=1$, and the class i–specific coefficients become $\alpha _i \xi$.

2.1.3 Selection of hyperparameters in MUDRA

The selection of the two hyperparameters, b (number of spline functions) and r (parameter rank), has a large impact on the quality of the approximation. We assess the model’s performance on different values of r in the experimental study. For the selection of b, there is a tradeoff between the richness of the available space of functionals and the computation cost of the inference.

2.1.4 Theoretical time complexity of MUDRA

MUDRA mainly focuses on reliability in the face of missing data for time–series classifications. However, ensuring a tractable algorithm for parameter inference is crucial. In each ECM iteration k, there is (1) $\sum _{i \le K} m_i$ calls to the Bartels-Stewart algorithm (Line 8), (2) $K$ calls to the GMRES algorithm (Line 9), (3) one call to the alternating least squares (ALS) minimization procedure for the PARAFAC tensor decomposition at Line 11, and finally, (4) an iterative optimization step at Lines $14-16$.

1. One call to the Bartels-Stewart algorithm on sample $j \in \mathcal {C}_{i}$ has an overall computational cost in $\mathcal {O}(T_{ij}^3+F_{ij}^3+T_{ij}F_{ij}^2+T_{ij}^2F_{ij})$, by computing the Schur decompositions with the algorithm in Golub et al. (1979).

2. GMRES becomes increasingly expensive at each iteration: running k steps of the GMRES method has a time complexity in $\mathcal {O}(b Fk^2)$. However, in theory, GMRES converges at step $k \le b \times F$, so in the worst case, any call to GMRES has a time complexity in $\mathcal {O}(b^2 F^2)$.

3. The PARAFAC decomposition is computed by alternatively minimizing on w, $\hat{\Lambda }$, $\hat{\xi }$ and $\varvec{c}$. The number of minimization steps across all parameters is limited to 100 in this function. Each single–parameter minimization is linear in this parameter, and then the time complexity is mainly driven by the matrix inversion and product operations.

4. Theoretically, the iterative optimization step in Lines $14-16$ converges, but no convergence rate is provided. In practice, we chose to change the criteria to either stay under a maximum number of iterations (50 in our experiments) or until the value of $\hat{\sigma }$ has converged: at step q of the iteration, $|\hat{\sigma }_{q-1}^{-1}(\hat{\sigma }_{q-1}-\hat{\sigma }_q)| < 10^{-5}$. Then, the remaining cost of this step mainly stems from the matrix products.

We also compare the empirical runtimes of MUDRA and one algorithm from the state–of–the–art, ROCKET (Dempster et al., 2020) in the experimental study in Section 3.

2.1.5 Convergence of MUDRA

The convergence of the MUDRA algorithm can be shown as a special case of the convergence for ECM algorithms (Meng and Rubin (1993)), as detailed in Appendix C.

2.1.6 Implementation of MUDRA

In our implementation of MUDRA, we resort to the Bartels-Stewart algorithm and GMRES present in the Python package scipy (Virtanen et al., 2020). Moreover, we used the canonical tensor decomposition (as opposed to Tucker decomposition (Tucker, 1966)) to decompose the tensors, for which we used the PARAFAC decomposition function in Python package Tensorly (Kossaifi et al., 2016).

2.2 Classification and dimension reduction with MUDRA

Once the parameter values in (3) have been infered by running Algorithm 1, a LDA–like approach to classification (1) can be applied to a new sample $Y \in \mathbb {R}^{T' \times F'}$. Our model allows to define the density function $f_i$ of class $i=1,2,\dots ,K$ and the prior probability $\pi _i$. In our experiments, we assume a uniform prior, that is, $\pi _i = 1/K$ for all $i=1,2,\dots ,K$, but other types of prior might also be considered. Then, the class i(Y) predicted for sample Y is

$$\begin{aligned} \mathop {\mathrm {arg\,max}}\limits _{i \le K} \log \mathbb {P}(C=i | X=Y) \propto \log f_i(Y) + \log \pi _i \;. \end{aligned}$$

(6)

Let us derive the expression of $f_i$ from (5). Let us denote $S_Y$ the spline matrix in (2) associated with time points in Y, $C_Y$ the corresponding matrix of reordered feature columns, $M_Y \triangleq \sigma ^2I+(C_Y^\intercal \Psi C_Y)\otimes (S_Y\Sigma S_Y^\intercal )$ and

$$\begin{aligned} \begin{aligned} vec(\hat{\alpha }_Y) \triangleq&\big ((\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal )M_Y^-(C_Y^\intercal \xi ^\intercal \otimes S_Y\Lambda )\big )^{-\frac{1}{2}} (\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal )M_Y^{-1}vec(Y-S_Y\lambda _0C_Y)\;. \end{aligned} \end{aligned}$$

(7)

Then, ignoring constant factors in i and leveraging the computations in Appendix B, $\log f_i(Y)$ in (6) can be replaced with

$$\begin{aligned} \log f_i(Y)= & -\frac{1}{2}||vec(Y - S_Y(\lambda _0+\Lambda \alpha _i\xi )C_Y)||^2_{M_Y} + \log \det (M_Y)+T_YF_Y\log (2\pi )\\\propto & -\frac{1}{2}||vec(Y - S_Y(\lambda _0+\Lambda \alpha _i\xi )C_Y)||^2_{M_Y}\\\propto & -\frac{1}{2}||vec(\hat{\alpha }_Y)-\big ((\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal ) M_Y^-(C_Y^\intercal \xi ^\intercal \otimes S_Y\Lambda )\big )^{\frac{1}{2}}vec(\alpha _i)||. \end{aligned}$$

(B8) in Appendix B shows that the covariance of $\hat{\alpha }_Y$, $\mathbb {V}[vec(\hat{\alpha }_Y)]=I$. $\hat{\alpha }_Y$ can be reshaped into an $r\times r$ matrix. Then, we obtain an $r^2$–dimensional representation of Y.

3 Experimental study

We describe the empirical results of MUDRA in this section regarding predictive power, interpretability, and time efficiency. The overall experimental pipeline involves the extraction of features using an algorithm like MUDRA or ROCKET (Dempster et al., 2020), and then the classification of the samples using the extracted features via a Ridge Regression based classifier with balanced weights as shown in Figure 2. In our experiments, we required that the time-series are aligned based on time elapsed since the first observation. However, if there is an experimental timeline, we can align the data points along that timeline, and apply MUDRA straightforwardly. For example, if we know when our experiment starts, we can align our sample timelines even if the first observation is missing.

We selected as baseline for MUDRA the ROCKET algorithm (Dempster et al. (2020)). A recent review by Ruiz et al. (2021) showed that ROCKET (Dempster et al., 2020) is currently the best-performing model, both in terms of efficiency and accuracy, for feature transformation in short–time–series classification. However, when data is missing, ROCKET cannot be applied directly. Inspired by Bier et al. (2022), we employed three forms of padding and imputation to the data set prior to running ROCKET:

1.
“Imputation”–type padding: The recorded data points are aligned according to their sampling times. Missing time points and missing features are denoted with 0.
2.
“End”–type padding: All missing features are imputed with 0. All measured time points are concatenated. 0s are appended at the end to obtain time series of the same length.
3.
Last Observation Carried Forward (LOCF) imputation: The last observed non-missing value is used to fill in the missing values at later time points.

The experiments along with the environment configuration and package versions used for ROCKET and MUDRA are available in the GitHub repository (https://github.com/rbordoloi/MUDRA). The reported experiments were run on a CPU cluster (configuration: processor AMD EPYC 7763 64-Core, 20 cores @2.40GHz, RAM 8GB). None of the algorithms used a GPU.

3.1 Time–series classification on a synthetic data set

We built a data set with two features, 12 time points, two classes, and 300 samples as described in Appendix D. For this experiment, we set $b=10$ spline basis functions as a tradeoff between performance and computational cost. To assess the predictive power of our model, we collected the $\mathcal {L}^2$-norm of the functional $R^2$ values (Ramsay and Silverman, 2005) for each class as a goodness-of-fit measure, and the $F_1$ scores for the corresponding classification task for multiple values of r. $R^2$ values and $F_1$ scores are reported in respectively Figure 3a and Figure 3b.

We observe that the approximate $\mathcal {L}^2$-norms of the $R^2$ values show a general increasing trend. An increase in the number of class-specific parameters results in a corresponding increase in the goodness-of-fit. Similarly, Figure 3b shows an increase in the classification accuracy in terms of F1-score as the output dimensionality increases. For low dimensionalities ($r\le 4$), we observe that MUDRA outperforms ROCKET(with Last Observation Carried Forward imputation) with a Ridge Regression-based classifier. Figure 4 shows that MUDRA is able to retrieve the true functionals in the synthetic data set.

3.2 Time–series dimension reduction on real-life data

We also applied the MUDRA algorithm to the real-world data set “Articulary Word Recognition” from Dau et al. (2019). This data set compiles measurements of the movements of the tongue and lips during speech in native English speakers pronouncing 25 words. The time-varying tongue and lip movements were measured by an electromagnetic articulograph with 12 sensors that return 3D coordinates with a sampling rate of 200 Hz during the pronunciation of each word. Each word represents one class, and the goal is to identify the time-dependent movements associated with each word (Wang et al. 2013). 275 series/samples are in the training set, and 300 are in the testing set. In this use case, we restrict ourselves to 9 dimensions corresponding to the outputs from three sensors: one on the tongue tip, one on the upper lip, and the last on the lower lip (Shokoohi-Yekta et al. 2017).

Further details about the simulation of missing data are present in Appendix D. We applied MUDRA or ROCKET to perform a dimension reduction on the data set. Resulting features were fed to a ridge regression-based classifier. Such a procedure allows to test how informative the resulting representations are, which is a first step towards interpretability. We also classified the data directly through the Bayes–optimal classifier described in Section 2.2. Since ROCKET always reduces to an even number of dimensions, and MUDRA to $r^2$ dimensions, to ensure fairness when comparing both algorithms, we considered for ROCKET the smallest even number which was larger than $r^2$. The two plots in Figure 5 report the $F_1$ scores obtained for the classification tasks for multiple values of output dimension r on complete data (i.e. without any missing data) and on missing data, where the procedure in Appendix D has been applied to the data set. Figure 5a shows that MUDRA competes with ROCKET for very low–dimensional representations, but ROCKET overtakes MUDRA as we increase the output dimension.

Figure 5b shows that MUDRA clearly outperforms ROCKET in all of these cases.

3.3 Empirical runtime of the MUDRA algorithm

To assess the empirical time complexity of MUDRA, we collected the runtimes over $N=1,000$ iterations of both MUDRA and ROCKET combined with a ridge classifier on the real–life data set (without missing data, $r=7$, $b=9$). Resulting runtimes are reported in Figure 6.

As evidenced by Figure 6, the current implementation of MUDRA is approximately 12 times slower in average than ROCKET. Since ROCKET is a simple algorithm that shifts most of the burden of inference to the subsequent classification algorithm, it is extremely fast. However, as previously mentioned, it lacks some of the predictive power of MUDRA.

4 Conclusion

This work presents a novel approach to dealing with missing data in multivariate functional data, which has historically proven to be a challenging problem for short–time–series classification. Contemporary algorithms, which achieve optimal performance, typically resort to padding; however, this strategy proves suboptimal for very short time series with a large proportion of missing data. Our proposed algorithm, MUDRA, uniquely enables model training on selective fragments of a high–dimensional surface without necessitating regularization to reconstruct the entire surface for each sample. This eliminates the need for padding and imputation. Our algorithm outperforms the current state-of-the-art models in the cases where a very low–dimensional representation is required and/or a large proportion of the data is missing. Such cases frequently arise in real-life data, especially in biomedical applications. Our approach MUDRA is also interpretable in several ways, in comparison to the state-of-the-art. Firstly, the estimated class functionals can be recovered, which is impossible with other methods for this problem (see Equation (3) on page 7). Additionally, since the model is based on a statistical functional regression approach, the estimated parameters are maximum likelihood estimators. This means that one could, in principle, construct confidence intervals and parametric hypothesis tests based on these estimates (see Appendix A). Finally, since the underlying parametric assumptions involve Gaussian distributions, one can also interpret the embeddings generated by the model as clusters and look at new data points based on their closeness to the various cluster centroids (see Section 3.2 on the experiments on the real-life data set “Articulary Word Recognition”).

While our approach demonstrates considerable flexibility and efficiency, it has some shortcomings. Empirically, we show that the MUDRA algorithm sacrifices some time efficiency in favor of interpretability. Improving on the implementation of Algorithm 1 by resorting to faster routines (Nguyen et al., 2016; Song et al., 2022), switching the programming language to C and Fortran, or approximating solution equations (Wang et al., 2022) could tackle this issue.

On the theoretical side, one issue we could not solve was irregularly sampled features for one subject. Consider data being recorded for a sample $Y_{ij}$ at time points $t^{i j}_{1}$ and $t^{i j}_{2}$. If the features recorded for $Y_{ij}$ at $t^{i j}_{1}$ are not the same as at $t^{i j}_{2}$, our model cannot train on that datapoint without resorting to the imputation or deletion of some of the features. Another significant limitation is the assumption of common covariance matrices $\Sigma$ and $\Psi$ for the autoregressive noise in each class, which stem from the LDA standpoint. This is sometimes not representative of real data, for example, clinical data, as certain conditions can have a major impact on the variability of levels of certain metabolites. A quadratic discriminant analysis (QDA) approach might have mitigated these issues. However, class-specific covariance matrices raise a number of computational problems in the ECM algorithm and do not lead to the relatively simple expression for the dimension reduction of a new sample. Tackling these shortcomings would improve the application of short–time–series classification algorithms to real-life use cases.

Code availability

The source code for MUDRA (Algorithm 1), along with documentation, is available at the following GitHub repository https://github.com/rbordoloi/MUDRA under a GPL-3 license. The main algorithm is available as a plug-and-play estimator for the sktime (Löning et al., 2019) Python module and can be installed using the Package Installer for Python (pip).

References

Bartels, R. H., & Stewart, G. W. (1972). Algorithm 432 [C2]: Solution of the matrix equation AX + XB = C [F4]. Communications of the ACM, 15(9), 820–826. https://doi.org/10.1145/361573.361582
Article MATH Google Scholar
Bier, A., Jastrzebska, A., & Olszewski, P. (2022). Variable-length multivariate time series classification using ROCKET: A case study of incident detection. IEEE Access, 10, 95701–95715. https://doi.org/10.1109/ACCESS.2022.3203523
Article MATH Google Scholar
Bro, R. (1997). PARAFAC. Tutorial and applications. Chemometrics and Intelligent Laboratory Systems, 38(2), 149–171. https://doi.org/10.1016/S0169-7439(97)00032-4
Article MATH Google Scholar
Dau, H. A., Bagnall, A., Kamgar, K., Yeh, C.-C.M., Zhu, Y., Gharghabi, S., Ratanamahatana, C. A., & Keogh, E. (2019). The UCR time series archive. IEEE/CAA Journal of Automatica Sinica, 6(6), 1293–1305. https://doi.org/10.1109/JAS.2019.1911747
Article MATH Google Scholar
Dempster, A., Petitjean, F., & Webb, G. I. (2020). ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery, 34(5), 1454–1495. https://doi.org/10.1007/s10618-020-00701-z
Article MathSciNet MATH Google Scholar
Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89–102.
Article MathSciNet MATH Google Scholar
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188.
Article MATH Google Scholar
Fukuda, T., Matsui, H., Takada, H., Misumi, T., & Konishi, S. (2023). Multivariate functional subspace classification for high-dimensional longitudinal data. Japanese Journal of Statistics and Data Science. https://doi.org/10.1007/s42081-023-00226-x
Article MATH Google Scholar
Gardner-Lubbe, S. (2021). Linear discriminant analysis for multiple functional data analysis. Journal of Applied Statistics, 48(11), 1917–1933. https://doi.org/10.1080/02664763.2020.1780569
Article MathSciNet MATH Google Scholar
Glanz, H., & Carvalho, L. (2013) An Expectation-Maximization Algorithm for the Matrix Normal Distribution. arXiv . https://doi.org/10.48550/arXiv.1309.6609 . http://arxiv.org/abs/1309.6609 Accessed 2023-10-25
Golub, G., Nash, S., & Van Loan, C. (1979). A Hessenberg-Schur method for the problem AX+ XB= C. IEEE Transactions on Automatic Control, 24(6), 909–913.
Article MathSciNet MATH Google Scholar
Graf, R., Zeldovich, M., & Friedrich, S. (2024). Comparing linear discriminant analysis and supervised learning algorithms for binary classification-A method comparison study. Biometrical Journal, 66(1), 2200098.
Article MathSciNet MATH Google Scholar
Izenman, A. J. (2008). Multivariate regression. In A. J. Izenman (Ed.), Modern multivariate statistical techniques: regression, classification, and manifold learning (pp. 159–194). New York, NY: Springer.
Chapter MATH Google Scholar
James, G. M., & Hastie, T. J. (2001). Functional linear discriminant analysis for irregularly sampled curves. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63(3), 533–550.
Article MathSciNet MATH Google Scholar
Jebb, A. T., Tay, L., Wang, W., & Huang, Q. (2015). Time series analysis for psychological research: Examining and forecasting change. Frontiers in Psychology, 6, 727.
Article Google Scholar
Kossaifi, J., Panagakis, Y., Anandkumar, A., & Pantic, M. (2016) Tensorly: Tensor learning in python. arXiv preprint arXiv:1610.09555 https://doi.org/10.48550/arXiv.1610.09555
Lee, D. Y., & Harring, J. R. (2023). Handling missing data in growth mixture models. Journal of Educational and Behavioral Statistics, 48(3), 320–348. https://doi.org/10.3102/10769986221149140
Article MATH Google Scholar
Lines, J., Taylor, S., & Bagnall, A. (2016) HIVE-COTE: The Hierarchical Vote Collective of Transformation-Based Ensembles for Time Series Classification. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1041–1046 . https://doi.org/10.1109/ICDM.2016.0133 . https://ieeexplore.ieee.org/document/7837946 Accessed 2024-02-02
Löning, M., Bagnall, A., Ganesh, S., Kazakov, V., Lines, J., & Király, F.J. (2019) sktime: A unified interface for machine learning with time series. arXiv preprint arXiv:1909.07872
McLaughlin, C., & Su, L. (2023) FedLDA: Personalized Federated Learning Through Collaborative Linear Discriminant Analysis. In: International Workshop on Federated Learning in the Age of Foundation Models in Conjunction With NeurIPS 2023 . https://openreview.net/forum?id=1ww9tjEQVL
Meng, X.-L., & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2), 267–278. https://doi.org/10.2307/2337198
Article MathSciNet MATH Google Scholar
Middlehurst, M., Large, J., Flynn, M., Lines, J., Bostrom, A., & Bagnall, A. (2021). HIVE-COTE 2.0: A new meta ensemble for time series classification. Machine Learning, 110(11–12), 3211–3243.
Article MathSciNet MATH Google Scholar
Nagar, D. K., & Gupta, A. K. (1999). Matrix variate distributions. New York: Chapman and Hall/CRC.
MATH Google Scholar
Nguyen, V.-D., Abed-Meraim, K., & Linh-Trung, N. (2016) Fast adaptive PARAFAC decomposition algorithm with linear complexity. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6235–6239. IEEE, ???
Ram, N., & Grimm, K. J. (2009). Growth mixture modeling: A method for identifying differences in longitudinal change among unobserved groups. International Journal of Behavioral Development, 33(6), 565–576. https://doi.org/10.1177/0165025409343765
Article MATH Google Scholar
Ramsay, J. O., & Silverman, B. W. (2005). Functional linear models (pp. 217–222). New York, NY: Springer.
MATH Google Scholar
Rao, C. R. (1948). The utilization of multiple measurements in problems of biological classification. Journal of the Royal Statistical Society. Series B (Methodological), 10(2), 159–203.
Article MathSciNet MATH Google Scholar
Ruiz, A. P., Flynn, M., Large, J., Middlehurst, M., & Bagnall, A. (2021). The great multivariate time series classification bake off: A review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, 35(2), 401–449. https://doi.org/10.1007/s10618-020-00727-3
Article MathSciNet MATH Google Scholar
Saad, Y., & Schultz, M. H. (1986). GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 7(3), 856–869. https://doi.org/10.1137/0907058
Article MathSciNet MATH Google Scholar
Shokoohi-Yekta, M., Hu, B., Jin, H., Wang, J., & Keogh, E. (2017). Generalizing dtw to the multi-dimensional case requires an adaptive approach. Data Mining and Knowledge Discovery, 31, 1–31.
Article MathSciNet Google Scholar
Song, Y., Sebe, N., & Wang, W. (2022) Fast differentiable matrix square root. arXiv preprint arXiv:2201.08663 https://doi.org/10.48550/arXiv.2201.08663
Spliid, H. (2016) Multivariate Time Series Estimation using marima: 38. Symposium i Anvendt Statistik 2016. Symposium i anvendt statistik 2016, 108–123
Tan, C. W., Dempster, A., Bergmeir, C., & Webb, G. I. (2022). MultiRocket: Multiple pooling operators and transformations for fast and effective time series classification. Data Mining and Knowledge Discovery, 36(5), 1623–1646.
Article MathSciNet MATH Google Scholar
Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3), 279–311. https://doi.org/10.1007/BF02289464
Article MathSciNet MATH Google Scholar
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., & Bright, J. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17(3), 261–272.
Article Google Scholar
Wang, W. K., Chen, I., Hershkovich, L., Yang, J., Shetty, A., Singh, G., Jiang, Y., Kotla, A., Shang, J. Z., Yerrabelli, R., Roghanizad, A. R., Shandhi, M. M. H., & Dunn, J. (2022). A systematic review of time series classification techniques used in biomedical applications. Sensors, 22(20), 8016. https://doi.org/10.3390/s22208016
Article Google Scholar
Wang, T., Layton, S. K., & Barba, L. A. (2022). Inexact GMRES iterations and relaxation strategies with fast-multipole boundary element method. Advances in Computational Mathematics, 48(3), 32.
Article MathSciNet MATH Google Scholar
Wang, J., Balasubramanian, A., Vega, L.G.M., Green, J.R., Samal, A., & Prabhakaran, B. (2013) Word recognition from continuous articulatory movement time-series data using symbolic representations. In: Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies, pp. 119–127
Yao, F., Müller, H.-G., & Wang, J.-L. (2005). Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association, 100(470), 577–590.
Article MathSciNet MATH Google Scholar
Yoon, J., Zame, W. R., & Schaar, M. (2019). Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Transactions on Biomedical Engineering, 66(5), 1477–1490. https://doi.org/10.1109/TBME.2018.2874712
Article MATH Google Scholar
Zhu, F., Gao, J., Yang, J., & Ye, N. (2022). Neighborhood linear discriminant analysis. Pattern Recognition, 123, 108422.
Article MATH Google Scholar

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. C.R. has received funding from the European Union’s HORIZON 2020 Programme under grant agreement no. 101102016 (RECeSS, HORIZON MSCA Postdoctoral Fellowships–European Fellowships). O.W. acknowledges support from the German Research Foundation (DFG) FK515800538 (learning convex data spaces).

Author information

Authors and Affiliations

Department of Systems Biology and Bioinformatics, Institute of Computer Science, University of Rostock, 18051, Rostock, Germany
Rahul Bordoloi, Clémence Réda, Orell Trautmann & Olaf Wolkenhauer
School of Data Science, Indian Institute of Science Education and Research, Thiruvananthapuram, Thiruvananthapuram, 695551, India
Saptarshi Bej
Leibniz-Institute for Food Systems Biology, Technical University of Munich, 85354, Freising, Germany
Olaf Wolkenhauer
Stellenbosch Institute of Advanced Study, Wallenberg Research Centre, Stellenbosch, 7602, South Africa
Olaf Wolkenhauer

Authors

Rahul Bordoloi
View author publications
You can also search for this author inPubMed Google Scholar
Clémence Réda
View author publications
You can also search for this author inPubMed Google Scholar
Orell Trautmann
View author publications
You can also search for this author inPubMed Google Scholar
Saptarshi Bej
View author publications
You can also search for this author inPubMed Google Scholar
Olaf Wolkenhauer
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

R.B. designed and wrote the algorithm. O.T. and C.R. helped with calculations and cross-checking for the ECM algorithm. R.B. and C.R. wrote the manuscript and available code. S.B. and O.W. supervised the study and helped writing the manuscript.

Corresponding author

Correspondence to Olaf Wolkenhauer.

Ethics declarations

Conflict of interest

No conflict of interest is declared.

Additional information

Editor: Annalisa Appice.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A The MUDRA algorithm: ECM

To guarantee convergence, we assume that $S_{ij}\Sigma S_{ij}^\intercal$ and $C_{ij}^\intercal \Psi C_{ij}$ are invertible for each i and j. Our objective is to maximize

$$\begin{aligned} Q&= -\frac{1}{2}\sum _{i,j}\Big (\frac{1}{\sigma ^2}||Y_{ij}-S_{ij}(\lambda _0 + \Lambda \alpha _i\xi + \gamma _{ij})C_{ij}||^2_F+T_{ij}F_{ij}\log {\sigma ^2}+\text {tr}\{\Psi ^{-1}\gamma _{ij}^\intercal \Sigma ^{-1}\gamma _{ij}'\}\\&+ T_{ij}\log \det (\Psi ) + F_{ij}\log \det (\Sigma )\Big )\;. \end{aligned}$$

1.1 A.1 The E–step

To simplify calculations, we will drop all indices and look at only one data point from one class, as the aggregate can be calculated by doing similar calculations for all data points and computing the arithmetic mean at the end. Thus the simplified model can be written as

$$\begin{aligned} Y = S(\beta +\gamma )C+\epsilon \;, \end{aligned}$$

where S and C are $T\times b$ and $n\times F$ matrices respectively. The idea is to compute the expectation of $S\gamma C=\gamma '$ instead of $\gamma$ itself like in James and Hastie (2001) and write Q in terms of $\gamma '$. This will allow us to simplify and use the Bartels and Stewart (1972) algorithm to find the expectation without computing any tensor products. The log-likelihoods, ignoring proportionality constants, are as follows.

$$\begin{aligned}l(Y)& = -vec(Y-S\beta )^\intercal (\sigma ^2I\otimes I + \Psi \otimes \Sigma )^{-1}vec(Y-S\beta C) -\log (|\sigma ^2I\otimes I + \Psi \otimes \Sigma )\\ l(Y,\gamma ') & = -\frac{vec(Y-S\beta C-\gamma ')^\intercal vec(Y-S\beta C-\gamma ')}{\sigma ^2} - vec(\gamma ')^\intercal (( C^\intercal \Psi C)\otimes (S\Sigma S^\intercal ))^{-1}vec(\gamma ')\\&- TF\log (\sigma ^2) - F\log \det (S\Sigma S^\intercal )- T\log \det ( C^\intercal \Psi C)\\&=-\frac{1}{\sigma ^2}||Y-S\beta C - \gamma '||^2_F-( C^\intercal \Psi C)^{-1}\gamma '^\intercal (S\Sigma S^\intercal )\gamma ' - TF\log (\sigma ^2)\\&- T\log \det ( C^\intercal \Psi C) - F\log \det (S\Sigma S^\intercal )\;. \end{aligned}$$

This is because $\gamma \sim \mathcal {N}_{b\times n}(0,\Sigma ,\Psi )\implies \gamma '=S\gamma \sim \mathcal {N}_{m\times n}(0,S\Sigma S^\intercal , C^\intercal \Psi C)$ by Theorem 2.3.10 in D. K. Nagar (1999). Let us set $\Sigma '=S\Sigma S^\intercal$ and $\Psi '= C^\intercal \Psi C$. We know that $\mathbb {E}[\gamma '|Y=y]=\int \gamma '\frac{f_{Y,\gamma '}(y,\gamma ')}{f_Y(y)}\,d\gamma '$, for all $y: f_Y(y)>0$. Let us ignore all terms not depending on $\gamma '$. We will work with the terms in the exponential depending on $\gamma '$ in log space to avoid numerical instability.

$$\begin{aligned}&\frac{1}{\sigma ^2}(vec(Y-S\beta C)^\intercal vec(\gamma ') + vec(\gamma ')^\intercal vec(Y-S\beta C)\\&- vec(\gamma ')^\intercal vec(\gamma ')) - vec(\gamma ')^\intercal (\Psi '\otimes \Sigma ')^{-1}vec(\gamma ')\\&=\frac{1}{\sigma ^2}(vec(Y-S\beta C)^\intercal vec(\gamma ') + vec(\gamma ')^\intercal vec(Y-S\beta C)\\&-vec(\gamma ')^\intercal (I+\sigma ^2(\Psi '\otimes \Sigma ')^{-1})vec(\gamma '))\;. \end{aligned}$$

Let $M=(I+\sigma ^2(\Psi '\otimes \Sigma ')^{-1})$. Clearly, M is symmetric. Then by completing the square inside the term multiplied by $-\frac{1}{\sigma ^2}$ we have

$$\begin{aligned}&-vec(Y-S\beta C)^\intercal M^{-1}vec(Y-S\beta C) +vec(Y-S\beta C)^\intercal M^{-1}vec(Y-S\beta C)\\&-vec(Y-S\beta C)^\intercal M^{-\frac{1}{2}}M^{\frac{1}{2}}vec(\gamma ')-vec(\gamma ')M^{\frac{1}{2}}M^{-\frac{1}{2}}vec(Y-S\beta C)\\&+vec(\gamma ')^\intercal Mvec(\gamma ')\\ =&-vec(Y-S\beta )^\intercal M^{-1}vec(Y-S\beta C)\\&+(M^{\frac{1}{2}}vec(\gamma ')-M^{-\frac{1}{2}}vec(Y-S\beta C))^\intercal (M^{\frac{1}{2}}vec(\gamma ')-M^{-\frac{1}{2}}vec(Y-S\beta C))\\ =&-vec(Y-S\beta C)^\intercal M^{-1}vec(Y-S\beta C)+(vec(\gamma ')-M^{-1}vec(Y-S\beta C))^\intercal M(vec(\gamma ')\\&-M^{-1}vec(Y-S\beta C))\;. \end{aligned}$$

Clearly, $vec(\gamma ')|Y\sim N(M^{-1}vec(Y-S\beta C), \sigma ^2M^{-1})$. So

$$\begin{aligned}&\mathbb {E}[vec(\gamma ')|Y]=M^{-1}vec(Y-S\beta C)\\ \implies&\mathbb {E}[vec(\gamma ')|Y]=(I\otimes I + \sigma ^2(\Psi '\otimes \Sigma ')^{-1})^{-1}vec(Y-S\beta C)\\&=(I\otimes I + (\Psi '^{-1}\otimes I)(I\otimes \sigma ^2\Sigma '^{-1}))^{-1}vec(Y-S\beta C)\\&=((\Psi '^{-1}\otimes I)(\Psi '\otimes I) + (\Psi '^{-1}\otimes I)(I\otimes \sigma ^2\Sigma '^{-1}))^{-1}vec(Y-S\beta C)\\&=(\Psi '\otimes I+I\otimes \sigma ^2\Sigma '^{-1})^{-1}(\Psi '\otimes I)vec(Y-S\beta C)\\&=(\Psi '\otimes I+I\otimes \sigma ^2\Sigma '^{-1})^{-1}vec((Y-S\beta C)\Psi '^\intercal )\;. \end{aligned}$$

Now, we can find $(\Psi '\otimes I+I\otimes \sigma ^2\Sigma '^{-1})^{-1}vec((Y-S\beta )\Psi '^\intercal )$ by solving the Sylvester equations $\sigma ^2(\Sigma '^{-1})X+X\Psi '^\intercal =(Y-S\beta )\Psi '^\intercal$ using the Bartels and Stewart (1972) algorithm. Let this value be called $\mu _{\gamma '}$. Also, let $\mu _\gamma =\mathbb {E}[\gamma |Y]$ and $Y'=Y-S\beta C$. From D. K. Nagar (1999) we know that if $X\sim \mathcal {N}_{m\times n}(\mu ,\Sigma ,\Psi )$, $\mathbb {E}[X^\intercal AX]=\text {tr}\{\Sigma A^\intercal \}\Psi + \mu ^\intercal A\mu$

$$\begin{aligned} \mathbb {E}[||Y'-S\gamma C||_F^2]=&\text {tr}\{\mathbb {E}[(Y'-S\gamma C)^\intercal (Y'-S\gamma C)]\}\\ =&\text {tr}\{Y'^\intercal Y'\}-2\text {tr}\{Y'^\intercal \mu _{\gamma '}\}\\&+\text {tr}\{ C^\intercal \Psi C\}\text {tr}\{S\Sigma S^\intercal \}+\text {tr}\{\mu _{\gamma '}^\intercal \mu _{\gamma '}\}\\ \implies \mathbb {E}[\sum _{i,j}||Y_{ij}'-&S_{ij}\gamma _{ij} C_{ij}||_F^2]\\ =&\sum _{i,j}\Big (||Y_{ij}-S_{ij}(\lambda _0+\Lambda \alpha _i\xi ) C_{ij}\\&-\mu _{\gamma _{ij}'}||_F^2+\text {tr}\{ C_{ij}^\intercal \Psi C_{ij}\}\text {tr}\{S_{ij}\Sigma S_{ij}^\intercal \}\Big )\;. \end{aligned}$$

With some abuse of notation, we will call $\hat{\mu }_{{\gamma '}_{ij}}$ as $\gamma '_{ij}$.

1.2 A.2 The CM–step

Since many of the parameters are interdependent, we will use conditional maximization. For a parameter p, $\hat{p}$ denotes the estimate at the next step. For convenience we will call $\mu _{\gamma _{ij}'}$ as $\gamma _{ij}'$. Our objective is to maximize

$$\begin{aligned} Q =&-\frac{1}{2}\sum _{i,j}\Big (\frac{1}{\sigma ^2}\big (||Y_{ij}-S_{ij}(\lambda _0 + \Lambda \alpha _i\xi ) C_{ij} - \gamma '_{ij}||^2_F\\&+\text {tr}\{ C_{ij}^\intercal \Psi C_{ij}\}\text {tr}\{S_{ij}\Sigma S_{ij}^\intercal \}\big )+T_{ij}F_{ij}\log {\sigma ^2}\\&+\text {tr}\{\Psi ^{-1}\gamma _{ij}^\intercal \Sigma ^{-1}\gamma _{ij}\}+T_{ij}\log \det (\Psi ) + F_{ij}\log \det (\Sigma )\}\Big )\;, \end{aligned}$$

where $||\cdot ||_F$ denotes the Frobenius norm.

1.2.1 A.2.1 Estimating the linear parameters

Equations of the form $\sum _{i}A_iXB_i=C$ for X can be solved by the iterative GMRES algorithm (Saad and Schultz, 1986) without actually computing the inverse of $\sum _{i}(B_i^\intercal \otimes A_i)$. So our task will be to reduce all estimates to this form.

Let $Y_{ij}'=Y_{ij}-\gamma _{ij}'$ and $\beta _i=\lambda _0+\Lambda \alpha _i$. In order to estimate each $\beta _i$, we differentiate $\sum _j||Y_{ij}'-S_{ij}\beta _i C_{ij}||_F^2$ and set to 0.

$$\begin{aligned} \begin{aligned}&\frac{\partial }{\partial \beta _i}\sum _j\text {tr}\{(Y_{ij}'-S_{ij}\beta _i C_{ij})^\intercal (Y_{ij}'-S_{ij}\beta _i C_{ij})\}=0\\&\implies \sum _j\frac{\partial }{\partial \beta _i}\text {tr}\{ C_{ij}^\intercal \beta _i^\intercal S_{ij}^\intercal S_{ij}\beta _i C_{ij}\}=2\sum _j\frac{\partial }{\partial \beta _i}\text {tr}\{Y_{ij}'^\intercal S_{ij}\beta _i C_{ij}\}\\&\implies \sum _j C_{ij} C_{ij}^\intercal \beta _i^\intercal S_{ij}^\intercal S_{ij}=\sum _j C_{ij}Y_{ij}'^\intercal S_{ij}\\&\implies \sum _jS_{ij}^\intercal S_{ij}\beta _i C_{ij} C_{ij}^\intercal =\sum _jS_{ij}^\intercal Y_{ij}' C_{ij}^\intercal \;. \end{aligned} \end{aligned}$$

(A1)

This is of the form above, so we solve it to estimate each $\beta _i$, say $\hat{\beta }_i$. Then to satisfy (4) we have,

$$\begin{aligned} \hat{\lambda }_0=\sum _i\frac{m_i\hat{\beta }_i}{\sum _im_i}\;. \end{aligned}$$

(A2)

Now, let $\beta _i'=\hat{\beta }_i-\hat{\lambda }_0$. We know that $\beta _i'=\hat{\Lambda }\hat{\alpha }_i\hat{\xi }$. Let $\varvec{\beta }$ denote the 3-way $b\times F\times K$-tensor formed by stacking $\beta '_i$s. We apply the PARAFAC algorithm (Bro, 1997) for r factors on $\varvec{\beta }$ to get the r-way canonical decomposition, denoted by $\sum _{i=1}^rw_i(\varvec{\lambda }_i\otimes \varvec{\xi }_i\otimes \varvec{c}_i)$, where $\varvec{\lambda }_i$, $\varvec{\xi }_i$ and $\varvec{c}_i$ are normalized. We easily get the estimates of $\hat{\lambda }$ and $\hat{\xi }^\intercal$ subject to (4) by stacking the $\varvec{\lambda }_i$s and $\varvec{\xi }_i$s respectively. Since $\hat{\alpha }_i$s are diagonal, they can be obtained by elementwise multiplication of $w_i$s with $c_i$s, i.e.,

$$\begin{aligned} \hat{\alpha }_{i,jj}=\sum _{j=1}^rw_j\varvec{c}_{i,j}\;. \end{aligned}$$

(A3)

1.2.2 A.2.2 Estimating the variance parameters

$\Sigma$ and $\Psi$ will be estimated by the flip-flop procedure, as suggested by Glanz and Carvalho (2013). Firstly, we need an estimate of $\gamma _{ij}$. We have

$$\begin{aligned} \begin{aligned}&S_{ij}\gamma _{ij}C_{ij}=\gamma '_{ij}\\ \implies&(C_{ij}^\intercal \otimes S_{ij})vec(\gamma _{ij})=vec(\gamma '_{ij})\\ \implies&\hat{\gamma }_{ij} = (S_{ij}^\intercal S_{ij})^-S_{ij}^\intercal \gamma '_{ij}C_{ij}^\intercal (C_{ij}C_{ij}^\intercal )^-\;. \end{aligned} \end{aligned}$$

(A4)

We then compute the derivative with respect to $\Sigma$ and set it to 0.

$$\begin{aligned}&\frac{\partial Q}{\partial \Sigma } = 0\\ \implies&-\frac{1}{2}\sum _{i,j}\Big (\frac{\text {tr}\{C_{ij}^\intercal \Psi C_{ij}\}}{\sigma ^2}S_{ij}^\intercal S_{ij}-\Sigma ^{-1}\gamma _{ij}\Psi ^{-1}\gamma _{ij}^\intercal \Sigma ^{-1}+F_{ij}\Sigma ^{-1}\Big )=0\\ \implies&\Sigma \Big (\sum _{i,j}\frac{\text {tr}\{C_{ij}^\intercal \Psi C_{ij}\}}{\sigma ^2}S_{ij}^\intercal S_{ij}\Big )\Sigma +\Big (\sum _{i,j}F_{ij}\Big )\Sigma =\sum _{i,j}\gamma _{ij}\Psi ^{-1}\gamma _{ij}^\intercal \;. \end{aligned}$$

We perform a similar calculation for the derivative of Q with respect to $\Psi$. Directly solving these equations will be computationally intensive, so instead, we ignore the quadratic terms and approximate $\Sigma$ and $\Psi$ by

$$\begin{aligned} \hat{\Sigma } = \sum _{i,j}\frac{\hat{\gamma }_{ij}\hat{\Psi }^-\hat{\gamma }^\intercal _{ij}}{\sum _{i,j}F_{ij}},\qquad \text {and}\qquad \hat{\Psi } = \sum _{i,j}\frac{\hat{\gamma }^\intercal _{ij}\hat{\Sigma }^-\hat{\gamma }_{ij}}{\sum _{i,j}T_{ij}}\;. \end{aligned}$$

(A5)

Finally, $\sigma ^2$ can be estimated as

$$\begin{aligned} \begin{aligned} \hat{\sigma }^2=&\frac{1}{\sum _{i,j}T_{ij}F_{ij}}\sum _{i,j}\Big (||Y_{ij}-S_{ij}(\hat{\lambda }_0+\hat{\Lambda }\hat{\alpha }_i)C_{ij} - \gamma '_{ij}||_F^2\\&+\text {tr}\{ C_{ij}^\intercal \hat{\Psi } C_{ij}\}\text {tr}\{S_{ij}\hat{\Sigma } S_{ij}^\intercal \}\Big )\;. \end{aligned} \end{aligned}$$

(A6)

After initially estimating the $\hat{\gamma }_{ij}$ values, we proceed to iteratively estimate $\hat{\Sigma }$, $\hat{\Psi }$, and $\hat{\sigma }^2$ until they reach convergence.

Appendix B Calculations for projection

Fix a Y, and let $\varvec{z} = ((\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal )M_Y^-(C_Y^\intercal \xi ^\intercal \otimes S_Y\Lambda ))^{-\frac{1}{2}}vec(\hat{\alpha }_Y)=((\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal )M_Y^{-1}(C_Y^\intercal \xi ^\intercal \otimes S_Y\Lambda ))^-(\xi C_Y\otimes \Lambda ^\intercal S_Y^\intercal )M_Y^{-1}vec(Y-S_Y\lambda _0\xi _Y)$, $A=(C^\intercal _Y\xi ^\intercal \otimes S_Y\Lambda )$ and $\varvec{x}=vec(Y-S_Y\lambda _0C_Y)$. So $\varvec{z}=(A^\intercal M_Y^{-1}A)^{-1}A^\intercal M_Y^{-1}\varvec{x}$. Then,

$$\begin{aligned}&||vec(Y - S_Y\lambda _0C_Y - S_Y\Lambda \alpha _i\xi C_Y)||^2_{M_Y}=||\varvec{x} - Avec(\alpha _i)||^2_{M_Y}\\ =&||\varvec{x}-A\varvec{z}||^2_{M_Y} + ||\varvec{z}-vec(\alpha _i)||^2_{A^\intercal M_Y^{-1}A}+2\langle \varvec{x}-A\varvec{z},A(\varvec{z}-vec(\alpha _i)\rangle _{M_Y}\;, \end{aligned}$$

as $M_Y$ is symmetric positive definite. Consider only the inner product term. We have,

$$\begin{aligned}&\langle \varvec{x}-A\varvec{z},A(\varvec{z}-vec(\alpha _i)\rangle _{M_Y}\\ = &\varvec{x}^\intercal M_Y^{-1}A\varvec{z}+\varvec{z}^\intercal A^\intercal M_Y^{-1}Avec(\alpha _i)-\varvec{x}^\intercal M_Y^{-1}A vec(\alpha _i)-\varvec{z}^\intercal A^\intercal M_Y^{-1}A\varvec{z}\\ = &\varvec{x}^\intercal M_Y^{-1}A(A^\intercal M_Y^{-1}A)^{-1}A^\intercal M_Y^{-1}\varvec{x}\\&+\varvec{x}^\intercal M_Y^{-1}A(A^\intercal M_Y^{-1}A)^{-1}A^\intercal M_Y^{-1}A^\intercal M_Y^{-1}Avec(\alpha _i)\\&-\varvec{x}^\intercal M_Y^{-1}Avec(\alpha _i)-\varvec{x}^\intercal M_Y^{-1}A(A^\intercal M_Y^{-1}A)^{-1}A^\intercal M_Y^{-1}A(A^\intercal M_Y^{-1}A)^{-1}A^\intercal M_Y^{-1}\varvec{x}\\ = &\varvec{x}^\intercal M_Y^{-1}A(A^\intercal M_Y^{-1}A)^{-1}A^\intercal M_Y^{-1}\varvec{x}+\varvec{x}^\intercal M_Y^{-1}Avec(\alpha _i)\\&-\varvec{x}^\intercal M_Y^{-1}Avec(\alpha _i)-\varvec{x}^\intercal M_Y^{-1}A(A^\intercal M_Y^{-1}A)^{-1}A^\intercal M_Y^{-1}\varvec{x}\\ \implies&\langle \varvec{x}-A\varvec{z},A(\varvec{z}-vec(\alpha _i)\rangle _{M_Y}=\varvec{0}\;. \end{aligned}$$

Now consider the second norm term. We have,

$$\begin{aligned}&||\varvec{z}-vec(\alpha _i)||^2_{A^\intercal M_Y^{-1}A}\\ =&||(A^\intercal M_Y^{-1}A)^{-\frac{1}{2}}vec(\hat{\alpha }_Y) - vec(\alpha _i)||^2_{A^\intercal M_Y^{-1}A}\\ =&||(A^\intercal M_Y^{-1}A)^{-\frac{1}{2}}\big (vec(\hat{\alpha }_Y) - (A^\intercal M_Y^{-1}A)^{\frac{1}{2}}vec(\alpha _i)\big )||^2_{A^\intercal M_Y^{-1}A}\\ =&||vec(\hat{\alpha }_Y) - (A^\intercal M_Y^{-1}A)^{\frac{1}{2}}vec(\alpha _i)||^2\;. \end{aligned}$$

Therefore,

$$\begin{aligned} \begin{aligned}&||vec(Y - S_Y\lambda _0C_Y - S_Y\Lambda \alpha _i\xi C_Y)||^2_{M_Y}\\ =&||\varvec{x}-A\varvec{z}||^2_{M_Y} + ||vec(\hat{\alpha }_Y) - (A^\intercal M_Y^{-1}A)^{\frac{1}{2}}vec(\alpha _i)||^2\;. \end{aligned} \end{aligned}$$

(B7)

We will also compute the covariance matrix of $vec(\hat{\alpha }_Y)$

$$\begin{aligned} \mathbb {V}[\varvec{x}]&= cov(vec(Y-S_Y\lambda _0C_Y))\\&=\mathbb {V}[vec(S_Y\Lambda \alpha _i\xi C_Y+S_Y\gamma C_Y+\epsilon )]\\&=\mathbb {V}[S_Y\gamma C_Y]+\mathbb {V}[\epsilon ]\\&=(C_Y^\intercal \Psi C)\otimes (S_Y\Sigma S_Y^\intercal )+\sigma ^2I=M_Y\\ vec(\hat{\alpha }_Y)&= (A^\intercal M_Y^{-1}A)^{-\frac{1}{2}}A^\intercal M_Y^{-1}\varvec{x}\\ \implies \mathbb {V}[vec(\hat{\alpha }_Y)]&=(A^\intercal M_Y^{-1}A)^{-\frac{1}{2}}A^\intercal M_Y^{-1}A(A^\intercal M_Y^{-1}A)^{-\frac{1}{2}}\;. \end{aligned}$$

Thus,

$$\begin{aligned} \mathbb {V}[vec(\hat{\alpha }_Y)] = I\;. \end{aligned}$$

(B8)

Appendix C Proof of convergence

Let us partition the parameter vector $\varvec{\theta }\in \Theta$ as $\varvec{\theta }^\intercal =[\varvec{\theta }_1^\intercal ,\varvec{\theta }_2^\intercal ,\varvec{\theta }_3^\intercal ,\varvec{\theta }_4^\intercal ,\theta _5]$ where

$$\begin{aligned} \varvec{\theta }_1&=vec(\lambda _0)\\ \varvec{\theta }_2&=[vec(\Lambda )^\intercal ,\alpha _1^\intercal ,\ldots ,\alpha _k^\intercal ,vec(\xi )^\intercal ]^\intercal \\ \varvec{\theta }_3&=vec(\Sigma )\\ \varvec{\theta }_4&=vec(\Psi )\\ \theta _5&=\sigma ^2 \end{aligned}$$

correspond to the $S=5$ conditional-maximization (CM) steps in MUDRA. We will denote the space of values that the parameter $\theta _s$ can take as $\Theta _s$. Also, let $\mathbb {R}^\Theta$ denote the Euclidean space in which the manifold $\Theta$ is embedded. $\mathbb {R}^{\Theta _s}$ denotes the subspaces in which the submanifolds $\Theta _s$ are embedded, so $\mathbb {R}^\Theta =\bigcup _{s=1}^5\mathbb {R}^{\Theta _s}$ and $\mathbb {R}^{\Theta _i}\cap \mathbb {R}^{\Theta _j}=\{0\}$ if $i\ne j$. For simplicity, we will assume that the maximum number of iterations done for estimating $\Sigma$, $\Psi$ and $\sigma ^2$ until they converge to be 1, but the same proof will also work for multiple iterations.

By Theorem 3 in Meng and Rubin (1993), to show that our ECM algorithm converges, it is enough to show the following:

The constraint functions $g_s$ associated with the ECM algorithm are differentiable.
The Jacobian of the constraint functions with respect to $\varvec{\theta }$, $\nabla g_s(\varvec{\theta })$, is full rank.
The column spaces of the Jacobians $J_s(\varvec{\theta }) \triangleq \{\nabla g_s(\varvec{\theta }) \lambda \mid \lambda \in \mathbb {R}^{\Theta _s} \}$ only intersect at the origin, i.e.
$$\begin{aligned} \bigcap _{s=1}^5 J_s(\varvec{\theta })=\{0\}\;. \end{aligned}$$
All conditional maximizations are unique.

Here constraints refer to the parameters which are fixed at each step of the CM step. For instance, in the CM step where we estimate $\varvec{\theta }_2$ in Algorithm 1 (Lines 13-14), $\lambda _0$ is fixed, so $\varvec{\theta }_1=vec(\lambda _0)$ will be one component of the constraint function $g_2(\varvec{\theta })$. Notice that in our algorithm, at each CM step, the parameter being estimated is never used as a constraint. Thus, $g_s(\varvec{\theta })=[\varvec{\theta }_i]_{i\ne s}$. Since these are mere projection functions from $\Theta$ to a lower dimensional Euclidean space, they are differentiable.

Now consider the Jacobian of one of these constraint functions. The Jacobian of $g_s$ with respect to $\varvec{\theta }$ will be a matrix of dimension $dim(\varvec{\theta })\times dim(g_s(\varvec{\theta }))$. But by our definition of $g_s$, $dim(g_s(\varvec{\theta }))=dim(\varvec{\theta })-dim(\varvec{\theta }_s)$. The Jacobian will be a block matrix of the following form

$$\begin{aligned} \nabla g_s(\varvec{\theta })={\left[ \begin{matrix} I_{dim(\varvec{\theta }_1)} & 0 & \cdots & 0 & 0 & \cdots & 0\\ 0 & I_{dim(\varvec{\theta }_2)}& \cdots & 0 & 0 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & I_{dim(\varvec{\theta }_{s-1})} & 0 & \cdots & 0\\ 0 & 0 & \cdots & 0 & 0 & \cdots & 0\\ 0 & 0 & \cdots & 0 & I_{dim(\varvec{\theta }_{s+1})} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & 0 & \cdots & 1\\ \end{matrix}\right] } \end{aligned}$$

(C9)

The last element is 1 because $\theta _5=\sigma ^2$ is a scalar. Clearly, every column block has exactly one identity matrix, so this matrix is full rank.

Consider the decomposition of $\mathbb {R}^\Theta$ as $\bigcup _{s=1}^5\mathbb {R}^{\Theta _s}$. By our construction of $\nabla g_s(\varvec{\theta })$, we observe that its column space is the subspace of $\mathbb {R}^\Theta$ that is outside the span of $\mathbb {R}^{\Theta _s}$, i.e. $J_s(\varvec{\theta })=\text {span}(\mathbb {R}^\Theta \setminus \mathbb {R}^{\Theta _s})$. This is because the $s^{th}$ row block only has 0 matrices, which means that any element of $\mathbb {R}^{\Theta }$ which, in its basis representation, has non-zero values for the basis elements of $\mathbb {R}^{\Theta _s}$ does not lie in the span of the columns of $\nabla g_s(\varvec{\theta })$. Thus,

$$\begin{aligned} \bigcap _{s=1}^5J_s(\varvec{\theta }) = \bigcap _{s=1}^5\text {span}(\mathbb {R}^{\Theta }\setminus \mathbb {R}^{\Theta _s})=\{\varvec{0}\} \;. \end{aligned}$$

(C10)

This satisfies all of the above conditions, hence our ECM algorithm converges.

Appendix D Experimental details

1.1 D.1 Synthetic data set

To generate simulated data for our study, we selected $F=2$ features, $T=12$ time points, $K=3$ classes, and $m_i=100$ samples per class. A library of six functions with range [0, 1] was created, and each feature-class pair was randomly assigned one function. The $\Sigma$ (of dimensions ${T\times T}$) and $\Psi$ (of dimensions $F\times F$) matrices were randomly generated, and they were employed to produce autoregressive noise following the matrix-variate normal distribution with a mean matrix of 0, row covariance $\Sigma$, and column covariance $\Psi$. Additionally, measurement errors were generated by sampling from a matrix-variate normal distribution with a mean matrix of 0 and identity covariance matrix. Samples were drawn from these matrix-variate normal distributions and added to each selected function K times to generate each sample.

To simulate missing data, for each sample, an integer $T_{ij}$ was randomly sampled from the range of 1 to 11, representing the selected number of time points to retain. Similarly, an integer $F_{ij}$ was randomly chosen from the range of 1 to 2, indicating the chosen number of features to retain. The $T_{ij}$ time points and $F_{ij}$ features were randomly selected for each sample, and the rest of the data points were deleted.

1.2 D.2 The “Articulary Word Recognition” data set

This data set (Ruiz et al., 2021) comprises quite a long time series, with 144 time points. We only retained every $12^{th}$ time point to shorten the length while preserving the overall trend. Then, we randomly deleted some time points and some features to simulate the missing data. The proportion of missing time points is capped at $50\%$, whereas it is capped at $55\%$ for missing features.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bordoloi, R., Réda, C., Trautmann, O. et al. Multivariate functional linear discriminant analysis for partially-observed time series. Mach Learn 114, 80 (2025). https://doi.org/10.1007/s10994-025-06741-0

Download citation

Received: 05 October 2024
Revised: 14 January 2025
Accepted: 15 January 2025
Published: 11 February 2025
DOI: https://doi.org/10.1007/s10994-025-06741-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multivariate functional linear discriminant analysis for partially-observed time series

Abstract

Similar content being viewed by others

Eigen-entropy based time series signatures to support multivariate time series classification

Canonical correlation for principal components of time series

Functional classwise principal component analysis: a classification framework for functional data analysis

Explore related subjects

1 Introduction

1.1 Related work

1.2 Contributions

1.3 Notation

2 The MUDRA algorithm

2.1 A novel multivariate model for time-dependent functional observations

2.1.1 Description of MUDRA

2.1.2 MUDRA as an extension to FLDA in a multivariate setting

2.1.3 Selection of hyperparameters in MUDRA

2.1.4 Theoretical time complexity of MUDRA

2.1.5 Convergence of MUDRA

2.1.6 Implementation of MUDRA

2.2 Classification and dimension reduction with MUDRA

3 Experimental study

3.1 Time–series classification on a synthetic data set

3.2 Time–series dimension reduction on real-life data

3.3 Empirical runtime of the MUDRA algorithm

4 Conclusion

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A The MUDRA algorithm: ECM

1.1 A.1 The E–step

1.2 A.2 The CM–step

1.2.1 A.2.1 Estimating the linear parameters

1.2.2 A.2.2 Estimating the variance parameters

Appendix B Calculations for projection

Appendix C Proof of convergence

Appendix D Experimental details

1.1 D.1 Synthetic data set

1.2 D.2 The “Articulary Word Recognition” data set

Rights and permissions

About this article

Cite this article

Share this article

Keywords