1 Introduction

Feature extraction of high-dimensional data has been an important research area in machine learning [2, 7, 8, 27]. The traditional problem of feature extraction of high-dimensional data focuses on the study of multivariate data with a goal of reducing the dimensionality from the original observation domain to a feature domain. This problem is typically encountered in two machine learning tasks, clustering and classification, depending on whether the data labelling is available or not. Clustering is an unsupervised learning technique that aims at grouping the multivariate data into a selected number of clusters, while classification focuses on predicting the group membership of test data by first learning a mathematical model based on the training data and then using this model to make a prediction. The performance of clustering or classification is often affected by the dimension of multivariate data. This is why the reduction of data dimension is firstly considered in any given problem. Random signals such as EEG or financial time series are special type of high- dimensional data that often contain no deterministic patterns. Instead, the pattern may appear to be stochastic, which is often time dependent if the random signals are observations over time. Moreover, in real-world problems, most of the signals are non-stationary, which makes the time dependency difficult to measure because of the instability or containing too much uncertainty. Due to the fact of lacking a stable pattern that can be modelled by some mathematical equations, the clustering or classification of random signals directly within the original time domain is challenging. Therefore, feature extraction of random signals is usually the most important step to meet the success of classification or clustering [1, 4, 18, 22, 23], and the development of new methodology for this type of problem becomes highly desirable.

In real-world applications of biomedical signal classification [5, 9, 13, 14, 17], a low-dimensional and linearly or non-linearly separable feature vector is highly desirable for both, the ease of data visualization in medical devices and the possibility of using simple classification methods, such as a linear classifier or the k-nearest neighbour (k-NN) method. To meet this goal, there is a lot of current research focusing on feature extraction of signals in the time domain using sparse representation of signals [28]. The idea of sparse representation is to approximate a given signal using a set of basis functions and to obtain a number of coefficients related to the approximation. The obtained coefficients of basis functions are used as signal features. The further study of signal characteristics is then based on these extracted features. However, a set of time domain basis functions used for approximating a given signal may not be a good choice for another signal, thus the extracted features may appear to be extremely volatile. This may lead to a low performance when using these extracted features in applications. To overcome this, the focus of sparse representation of signals has been moved from the time domain to other single or multiple domains. Among many published research works, the time-frequency domain decomposition is the most popular one within this type of approach. It decomposes the signal in terms of time and frequency domain components. By extending the analysis from a single domain to multiple domains, the separability of signal features is often significantly improved and the classification based on extracted features in both time and frequency domains will outperform a single domain approach [6, 12]

For sparse representation of signals, either in time domain or time-frequency domain, the objective is to achieve an explainable or low-dimensional feature vector, so that classification or clustering of signals becomes easier in feature domain [24]. When these signals contain deterministic patterns and the signal to noise ratio is high, they are typically referred to functional data. The functional data can be in spatial domain, temporal domain or both. Some examples of functional data include, but are not limited to images, temperature data, and growth curves [15, 16, 20]. To study functional data, functional data analysis (FDA) is a natural tool, due to its capability of capturing stable, and observable deterministic patterns contained in the data. In [25], we have used functional data analysis and applied functional summary statistics, functional probes and functional principal components to both epileptic EEG signals without seizure and non-epileptic signals. We have demonstrated that feature extraction through the functional data analysis is able to produce low-dimensional and explainable features so that signals of different types can be discriminated. In this extended work, we will further present the results using the epileptic EEG signals with seizure onset. We will focus on the study of using functional principle component analysis by providing more technical details. In particular, we will provide on discussion of the effect of number of basis functions retained in functional PCA for clustering. Also, we will demonstrate the application of the proposed method for both the epilepsy diagnosis and epileptic seizure detection [10, 11, 21, 26].

This paper is organized as follows. In Sect. 2, we discuss the proposed methods including Fourier power spectra and functional principal component analysis. In Sect. 3, the analysis of publicly available EEG data and summary of main results are presented. Finally, we conclude our findings and provide further remarks in Sect. 4.

2 Methods

2.1 Fourier Power Spectra

To define the Fourier power spectra of signals, we have to first apply the Fourier transformation. For a given signal \(X_t\) of length n, sampled at discrete times, the discrete Fourier transform (DFT) is defined as

$$\begin{aligned} d(\omega _j)=n^{-1/2} \sum _{t=1}^n X_t e^{-2\pi i \omega _j t}, \end{aligned}$$
(1)

for \(j=0,1, \ldots , n-1\), and the frequency \(\omega _j\) = j/n. Transforming the signal by discrete Fourier transform allows to obtain a concentration of signal powers using a small set of more dominant frequencies. This means that one is able to focus on a selected \(\omega _j\) and its transformed values \(d(\omega _j)\) only. The signal powers that correspond to the selected frequencies provide us a first layer of feature extraction, and due to the focus on the selected frequencies, the feature sparsity is achieved. However, direct use of Fourier transformed values for the given signal is very inconvenient, because they are complex numbers. A signal periodogram is then created to avoid this difficulty. The periodogram for each frequency \(\omega _j\) is defined as

$$\begin{aligned} \nonumber I(\omega _j)= & {} \mid d(\omega _j)\mid ^2\,=\frac{1}{n}\sum _{t=1}^n \sum _{s=1}^n (X_t - \bar{X})(X_s-\bar{X}) e^{-2\pi i \omega _j (t-s)}\nonumber \\= & {} \frac{1}{n} \sum _{h=-(n-1)}^{n-1} \sum _{t=1}^{n-\mid h \mid } (X_{t+\mid h \mid } - \bar{X})(X_s-\bar{X}) e^{-2\pi i \omega _j h}\nonumber \\= & {} \sum _{h=-(n-1)}^{n-1} \hat{\gamma }(h) e^{-2\pi i \omega _j h}, \end{aligned}$$
(2)

where \(\hat{\gamma }(h)\) is the auto-covariance function of time lag h. From the statistical point of view, auto-covariance function plays a role of capturing the stochastic pattern of a given signal, which refers to the time dependence structure. Therefore, our approach is not on extracting deterministic pattern in original time domain. This makes our approach to be significantly different from the sparse representation of signals or time frequency decomposition methods. Instead, we aim at first extracting the key signal features, by transforming them to an auto-covariance function, so that the periodogram can be obtained. Notice that the periodogram \(I(\omega _j)\) is just the Fourier transform of the auto-covariance function, which captures the quadratic covariation of the lagged signals in the spectral domain. \(I(\omega _j)\) is also called a power spectrum of a signal \(X_t\). Because of its definition, the periodogram captures the distribution of covariation of a signal in the spectral domain. The larger is a value of the periodogram, the more dominant is its corresponding frequency. Thus, the dominant values determine the signal power spectra. Often these more dominant frequencies correspond to smaller frequency values, which implies that local patterns are more significant than the global one, after the signal is transformed into the spectral domain. Because of this, in this work, we will only focus on the analysis of the power spectrum in a sub-interval, i.e. we will analyze only the first 200 frequency values for the given signals.

To illustrate the points stated above, we present a set of results of the power spectra for selected different types of EEG signals in Fig. 1. One can clearly see that the power spectra of the first 200 frequency values behave similarly within each set of data, but their patterns look differently over different frequency values. In particular, we observe that the signal powers for high frequency values are still high for both the signals from the set D and the set E, but they are significantly decayed for both the set B and the set C. This observation inspires us to consider the functional representation of signal power spectra, as the functionality of signals within each set is now significantly enhanced when compared to the original time series plots. However, as we can see from Fig. 1, the power spectra for each set of signals are still very noisy and further smoothing may be required, in order to achieve a better performance on clustering of extracted features.

Fig. 1.
figure 1

Sample plots of the power spectra for data sets B (Normal: Eyes Open), C (Non-epileptogenic zone), D (Epileptogenic zone) and E (Seizure onset), respectively. Each plot contains three sample power spectra of the first 200 frequency values. This Figure appears in the proceeding of ICPRAM 2019 [25].

Fig. 2.
figure 2

The plots of functional power spectra for data sets A (Normal: Eyes Closed) and B (Normal: Eyes Open), respectively, with different choices of K value. 10 B-splines basis functions are used to smooth sample power spectra.

Fig. 3.
figure 3

The plots of functional power spectra for data sets C (Non-epileptogenic zone) and D (Epileptogenic zone), respectively, with different choices of K value. 10 B-splines basis functions are used to smooth sample power spectra.

Fig. 4.
figure 4

The plots of functional power spectra for data sets E (Epileptic Seizure), with different choices of K value. 10 B-splines basis functions are used to smooth sample power spectra.

Fig. 5.
figure 5

The plots of extracted probe values using mean function of power spectrum of Set D as the functional probe for data sets A (Normal: Eyes Closed, Black), B (Normal: Eyes Open, Red), C (Non-epileptogenic zone, Green) and D (Epileptogenic zone, Blue), respectively. (Color figure online)

2.2 Functional Representation of Signal Power Spectrum

Our discussion on the signal power spectra in the previous section shows that, they are in fact noisy and further processing is required before we can extract the signal features for clustering. Otherwise, the extracted signal features may be distorted by such noises and a performance of further clustering may not be ideal. Also, a signal is often sampled at a discrete time and further processing is required to make it functional. Equipped with functional data analysis, we can model the power spectrum of a signal with discrete observations by a linear combination of a set of continuous basis functions. This makes feature extraction of signals to be more mathematically tractable. For a given ith signal, we can expand the power spectrum \(I_i (\omega )\) by

$$\begin{aligned} I_i (\omega )=\sum _{k=1}^K \alpha _{ik} \phi _k (\omega ), \end{aligned}$$
(3)

where \(\omega \) is the frequency value, \(\alpha _{ik}\) is the coefficient of the kth basis function and K is the total number of basis functions. Since our objective is not to fully represent a power spectrum using a set of functional basis, K is assumed to be finite and relatively small. That is, we approximate the power spectrum by a linear combination of K basis functions, but for the ease of explanation, we use an equal sign in Eq. (3). However, from the computational perspective, K being finite is always true, as the total number of basis functions needed to approximate the power spectrum is less than the total number of frequency values that we focus on. Notice that, within the discussion of this section, we do not isolate the mean function from the representation of signal. In the later discussion, we will separate the mean function from the signal expansion because of the need of studying functional variation.

When considering a sample of N signals, that is i = \(1,2,\ldots , N\), in the above Eq. (3), the matrix notation of these power spectra becomes

$$\begin{aligned} \mathbf{I} (\omega )=\mathbf{A} \varvec{\phi }(\omega ), \end{aligned}$$
(4)

where \(\mathbf{I} (\omega )\) = \([I_1(\omega ),I_2(\omega ),\ldots ,I_N(\omega )]^{\top }\) is a column vector of length N and \(\varvec{\phi }(\omega )\) = \([\phi _1(\omega )\), \(\phi _2(\omega )\), \(\ldots \), \(\phi _K(\omega )]^{\top }\) is a column vector of length K containing the basis functions. \(\mathbf{A}\) is the coefficient matrix of the size \(N \times K\), i.e.

$$\begin{aligned} \mathbf{A}= \begin{bmatrix} \alpha _{11} &{} \alpha _{12} &{} \alpha _{13} &{} \dots &{} \alpha _{1K} \\ \alpha _{21} &{} \alpha _{22} &{} \alpha _{23} &{} \dots &{} \alpha _{2K} \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ \alpha _{N1} &{} \alpha _{N2} &{} \alpha _{N3} &{} \dots &{} \alpha _{NK} \end{bmatrix}. \end{aligned}$$
(5)

Notice that, the set of basis functions \( \varvec{\phi }(\omega )\) can be different for various groups of signals, however, given the fact that we will be considering signals that share many commonalities in the spectral domain, it is reasonable to use the same set of basis functions. This allows us to extract signal features within the same feature space. So, in this work, we will hypothesize that signals can be represented by using the same set of basis functions.

2.3 Functional Probes

The descriptive methods such as functional mean, functional variance or functional covariance allow us to see functional central tendency and functional variation patterns. However, they may be functional and high-dimensional as well. If this is the case, for classification purpose, a dimension of these functional descriptive statistics needs to be further reduced. As a possible dimension reduction method an application of functional probes may be considered. A probe \(\rho _{\xi }\) is a measure that allows us to see specific variation by defining a functional weight \(\xi (\omega )\) and it is defined as the inner product of functions \(\xi (\omega )\) and \( I(\omega )\).

$$\begin{aligned} \rho _{\xi }=\int \xi (\omega ) I(\omega ) d\omega . \end{aligned}$$
(6)

The \(\xi (\omega )\) has to be structured so that we can extract specific features or meaningful patterns of the variation in power spectrum \(I(\omega )\). The probe values for ith signal power spectrum using functional mean and functional standard deviation of the jth group power spectrum are defined, respectively, as

$$\begin{aligned} \rho _{\bar{I}_{ij}}= & {} \int \bar{I}^{(j)} (\omega ) I_i(\omega ) d\omega =\sum _{k_{1}=1}^K \sum _{k_{2}=1}^K\alpha _{ik_1} \bar{\alpha }_{k_2}^{(j)} \int \phi _{k_1}(\omega )\phi _{k_2}(\omega )d \omega ,\end{aligned}$$
(7)
$$\begin{aligned} \rho _{S_{ij}}= & {} \int S_{I_{i (\omega )}}^{(j)} I_i(\omega ) d\omega . \end{aligned}$$
(8)

The functional probe values capture the similarity between the weight function and the ith power spectrum of a signal. Using functional mean as a probe, when the basis functions are orthogonal, i.e., \(\int \phi _{k_1} (\omega ) \phi _{k_2} (\omega ) d\omega =0\), for \(k_1\ne k_2\), and \(\int \phi _k ^2(\omega ) d\omega =1\), for \(k=1,2,\ldots \), the probe values become

$$\begin{aligned} \rho _{\bar{I}_{ij}}= \sum _{k=1}^K\alpha _{ik} \bar{\alpha }_{k}^{(j)}, \end{aligned}$$
(9)

and they can be interpreted as a similarity measure between two different groups of signals in the spectrum domain, in terms of overall pattern. Unlike the case of using functional mean as a probe, the closed form does not exist for the probe values using functional standard deviation. In this work, we continue to investigate how the extracted signal features (i.e., probe values) behave when functional mean and functional standard deviation are used as the functional probes. However, we focus only on one dimensional quantity. For two dimensional patterns, we refer to [25].

So far, we have discussed functional probe values based on the power spectrum \(I(\omega )\). If we replace \(I(\omega )\) by \(v(\omega _1, \omega _2)\), i.e. by the variance-covariance function, then the functional probe value becomes

$$\begin{aligned} \int \xi (\omega _2) v(\omega _1,\omega _2) d \omega _2. \end{aligned}$$
(10)

This is exactly the left hand side of eigenequation for solving eigenvalues and eigenvectors in functional principal component analysis. More discussion of this type of probe will be provided later.

2.4 Classical Principal Component Analysis

In multivariate statistics, principal component analysis (PCA) of a p-variate random vector X = \((X_1, X_2, \ldots , X_p)\) looks for a set of weight values, denoted by \(\mathbf \xi _j\) = \((\xi _{1j}\), \(\xi _{1j}\), \(\ldots \), \(\xi _{pj})\), so that, at the jth step, the variance of linear combinations of variable \(X_i\) is maximized. That is,

$$\begin{aligned} Var\bigg (\sum _{i=1}^p \xi _{ij} X_i^{(j)}\bigg ) \end{aligned}$$
(11)

is maximized or in matrix notation, \(Var(X^{(j)} \mathbf \xi _j^{\top })\) is maximized, where

$$\begin{aligned} X^{(j)}=X^{(j-1)}-\mathbf \xi _{j-1}^{\top } X^{(j-1)},\,\, \text {for}\,\, j=1,2,\ldots , p, \end{aligned}$$
(12)

where \(X^{(0)}\) is defined as \(\mathbf 0\). This process is repeated for \(j=1, 2,\ldots , p\), and at each step, it is subject to

$$\begin{aligned} \nonumber \sum _{i=1}^p \xi _{ij}^2=1 \,\, \text {and}\,\, \sum _{i=1}^p \xi _{ij}\xi _{il}=0, \text { for} \, j < l, \text { and} \, 1\le l,j \le p. \end{aligned}$$

An alternative approach of finding a solution of \(\mathbf \xi _j\) is by using a singular value decomposition (SVD) of the data matrix \(\mathbf{X}\), which contains N realizations of X. The solution of \(\mathbf \xi _j\) is the jth eigenvector obtained from the SVD of the data matrix \(\mathbf{X}\). In this work, the data matrix \(\mathbf{X}\) becomes \(\mathbf{I} (\omega )\), which is a \(N \times p\) data matrix, where p is the total number of frequency values being considered. Technically, the SVD of \(\mathbf{I} (\omega )\) is a factorization of the form \(\mathbf{U} \varSigma \mathbf{V}\), where \(\mathbf{U}\) is a \(N \times N\) unitary matrix, \(\varSigma \) is a \(N \times p\) diagonal matrix consisting of eigenvalues of \(\mathbf{I} (\omega )\) and \(\mathbf{V}\) is a \(p \times p\) unitary matrix. The columns of \(\mathbf {U}\) and the columns of \(\mathbf {V}\) are called the left eigenvectors and right eigenvectors of \(\mathbf{I} (\omega )\), respectively. Also, each column of \(\mathbf {V}\) is just the weight vector \(\mathbf \xi _j\). The feature extraction of data matrix \(\mathbf{I} (\omega )\) becomes a problem of computing \(\mathbf{I} (\omega ) \mathbf \xi _j^{\top }\), for \(j=1,2,\ldots , p\). For example, the first principal component scores set is \(\mathbf{I} (\omega ) \mathbf \xi _1^{\top }\), and the second principal component scores set is \(\mathbf{I} (\omega ) \mathbf \xi _2^{\top }\), etc.

We should also realize that the objective function in PCA can be rewritten as \(\mathbf \xi _j \mathbf{X}^{\top } \mathbf{X} \mathbf \xi _j^{\top }\), assuming that vector \(\mathbf{X}\) is centred. In this case, \(\mathbf{X}^{\top } \mathbf{X}\) becomes the variance - covariance matrix. The vector \(\mathbf \xi _j \) is still the eigenvector that is associated with the variance and covariance matrix. All of these can help us understand the relationships among the functional probe, PCA and the functional PCA, which will be discussed later. Notice that, the functional probes discussed above aim at capturing the variation of data associated with the weight function. If we carefully select the functional weight \(\xi (\omega )\), so that, the variance of functional probe values in (6) is maximized, subject to the constraint that \(\int \xi _l(\omega ) \xi _j(\omega )d\omega =0\) for \( l \ne j\), and \(\int \xi _j ^2(\omega ) d\omega =1\), then it becomes the functional PCA. In this case, the functional probe values are the principal component scores and the weight function becomes functional principal component loadings.

2.5 Functional Principal Component Analysis

So far, we have explained how functional probe and classical principal component analysis work in analyzing the power spectra of random signals. We have also mentioned that when principal component analysis is combined with functional probe, it leads to functional principal component analysis. That is, functional principal component analysis is a typical statistical analysis procedure that aims at maximizing the functional probe values with the use of variance-covariance matrix as an input to functional probe.

Suppose that the power spectrum of a given signal can be expanded using K basis functions and it is given as follows:

$$\begin{aligned} I_i (\omega ) = \mu (\omega )+\sum _{k=1}^K \beta _{ik} \phi _k (\omega ), \end{aligned}$$
(13)

where \(\mu (\omega )\) is the functional mean of power spectrum. Here, we consider an approximation of \(I_i (\omega )\) rather than exact representation of the function using infinite number of basis functions. Of course, we can not rule out the possibility of having a very large K in representing a signal, but in this work, we will consider a sparse representation of the power spectrum. Also, obtaining the functional mean of power spectrum requires an estimate. Without loss of generality, we can simply take the grand mean over all \(\omega \), denoted by \(\mu _0\), as an estimate, so that the Eq. (13) becomes

$$\begin{aligned} I_i (\omega ) = \mu _0+\sum _{k=1}^K \beta _{ik} \phi _k (\omega ), \end{aligned}$$
(14)

or in a matrix notation

$$\begin{aligned} \mathbf{I} - \mathbf {\mu }= \mathbf{C}\mathbf \phi , \end{aligned}$$
(15)

where \( \mathbf \mu \) = \((\mu _0,\mu _0,\ldots ,\mu _0)^{\top }\), \(\mathbf \phi \) = \((\phi _1,\phi _2,\ldots ,\phi _K)^{\top }\), and the coefficient matrix \(\mathbf{C}\) is \(N \times K\) that can be written as a matrix:

$$\begin{aligned} \mathbf{C}= \begin{bmatrix} \beta _{11} &{} \beta _{12} &{} \beta _{13} &{} \dots &{} \beta _{1K} \\ \beta _{21} &{} \beta _{22} &{} \beta _{23} &{} \dots &{} \beta _{2K} \\ \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ \beta _{N1} &{} \beta _{N2} &{} \beta _{N3} &{} \dots &{} \beta _{NK} \end{bmatrix}. \end{aligned}$$

Now, we will describe how to obtain the function principal components and their scores. First, let us denote the variance-covariance function of the power spectra \( \mathbf{I} \) by \(v(\omega _1, \omega _2)\). In a matrix notation this function is defined as

$$\begin{aligned} v(\omega _1, \omega _2)=\frac{1}{N-1} \mathbf \phi ^{\top }(\omega _1) \mathbf{C}^{\top } \mathbf{C} \mathbf \phi (\omega _2). \end{aligned}$$

To solve for the eigenfunction, first we have to solve the following eigenequation for an appropriate eigenvalue \(\lambda \)

$$\begin{aligned} \int v(\omega _1,\omega _2) \xi (\omega _2) d \omega _2 = \lambda \xi (\omega _1). \end{aligned}$$
(16)

Suppose that the eigenfunction \(\xi (\omega )\) has an expansion

$$\begin{aligned} \xi (\omega )=\sum _{k=1}^K b_{k} \phi _k (\omega ), \end{aligned}$$
(17)

or in matrix notation

$$\begin{aligned} \xi (\omega )= \mathbf \phi ^{\top }(\omega ) \mathbf{b}, \end{aligned}$$
(18)

where \(\mathbf{b}\) = \((b_1,b_2,\ldots ,b_K)^{\top }\). This yields

$$\begin{aligned} \int v(\omega _1,\omega _2) \xi (\omega _2) d \omega _2= & {} \frac{1}{N-1}\int \mathbf \phi ^{\top }(\omega _1) \mathbf{C}^{\top } \mathbf{C}\nonumber \\&\mathbf \phi (\omega _2) \mathbf \phi ^{\top }(\omega _2) \mathbf{b} d \omega _2.\nonumber \\= & {} \frac{1}{N-1} \mathbf \phi ^{\top }(\omega _1) \mathbf{C}^{\top } \mathbf{C} \mathbf \Phi \mathbf{b}, \end{aligned}$$
(19)

where \( \mathbf \Phi \) = \(\int \mathbf \phi (\omega ) \mathbf \phi ^{\top }(\omega ) d \omega \) is the matrix that is given as follows:

$$\begin{aligned} \mathbf \Phi = \begin{bmatrix} \int \phi _1 \phi _1 d \omega &{} \int \phi _1 \phi _2 d \omega &{} \dots &{} \int \phi _1 \phi _K d \omega \\ \int \phi _2 \phi _1 d \omega &{} \int \phi _2 \phi _2 d \omega &{} \dots &{} \int \phi _2 \phi _K d \omega \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ \int \phi _K \phi _1 d \omega &{} \int \phi _K \phi _2 d \omega &{} \dots &{} \int \phi _K \phi _K d \omega \end{bmatrix}. \end{aligned}$$

The eigenequation in (16) becomes

$$\begin{aligned} \frac{1}{N-1} \mathbf \phi ^{\top }(\omega ) \mathbf{C}^{\top } \mathbf{C} \mathbf \Phi \mathbf{b}=\lambda \mathbf \phi ^{\top }(\omega ) \mathbf{b}. \end{aligned}$$
(20)

Since (20) must hold for all \(\omega \), this implies that the following matrix equation must hold

$$\begin{aligned} \frac{1}{N-1} \mathbf{C}^{\top } \mathbf{C} \mathbf \varPhi \mathbf{b}=\lambda \mathbf{b}. \end{aligned}$$
(21)

To obtain the required principal components, we define \(\mathbf{u}= \mathbf \Phi ^{1/2} \mathbf{b}\) and the Eq. (21) becomes

$$\begin{aligned} \frac{1}{N-1} \varPhi ^{1/2} \mathbf{C}^{\top } \mathbf{C} \mathbf \Phi ^{1/2} \mathbf{u}=\lambda \mathbf{u}. \end{aligned}$$
(22)

By solving the symmetric eigenvalue problem for \(\mathbf{u}\) in (22), and then computing \(\mathbf{b}=\mathbf \Phi ^{-1/2} \mathbf{u}\) to get the eigenfunction \(\xi (\omega )\), we get that

$$\begin{aligned} \xi (\omega )= \mathbf \phi ^{\top }(\omega )\mathbf \Phi ^{-1/2} \mathbf{u}. \end{aligned}$$
(23)

Solving the eigenvalue problem in (22) will produce K different eigenfunctions and their corresponding eigenvalues which we denote by \((\lambda _j, \xi _j (\omega ))\).

If \(\phi _k (\omega )\) are orthonormal, then \(\mathbf \Phi \) becomes the \(K \times K\) identity matrix and the eigenanalysis of the functional PCA problem in (21) reduces to

$$\begin{aligned} \frac{1}{N-1} \mathbf{C}^{\top } \mathbf{C} \mathbf{b}=\lambda \mathbf{b}, \end{aligned}$$

which is the multivariate PCA that replaces variance-covariance matrix by the coefficient matrix \(\mathbf{C}\) obtained from the function approximation of power spectrum. However, one should realize that this is not a standard multivariate PCA. From the discussion above, a multivariate PCA conducts eigenanalysis for a \(p \times p\) covariance matrix. With function approximation using K basis functions, the eigenanalysis of functional PCA is applied to \(K \times K\) coefficient matrix, which depends on the value of K. In case of sparse approximation, which gives a small value of K, solving an eigenequation is much more efficient from the point of view of computational complexity.

2.6 B-Spline as Functional Basis

In this work, we choose the spline function as functional basis which is called B-spline [3, 19]. B-spline has been used very often in numerical analysis to approximate a function or a surface. Assume that there is a sequence of non-decreasing real numbers (i.e., \(t_k \le t_{k+1}\)), such that \(t_0\le t_1 \le \ldots \le t_{N-1}\), and N is the length of a signal. We call the set \(\{t_k \mid k\in \mathbb {Z}\}\) a knot set and each value \(t_k\) is referred to as a knot. Next, we define the augmented knot set \(t_{-v+2}=\ldots \) = \(t_0\le t_1 \le \ldots \le t_{N-1}\) = \(t_{N}\) \(\ldots =t_{N+v-2}\), where v is the order of the B-spline. We have appended the lower and upper indexes because of the use of recursive formula of the B-spline. Furthermore,we reset the index, so that the new index in the augmented knot set becomes k = 0, 1, \(\ldots \), \(N+2v-3\). For each augmented knot \(t_k\), \(k=0,1, \ldots , N+2v-3\), we define B-spline recursively as follows:

$$ B_{k,0}(x) = {\left\{ \begin{array}{ll} 1, &{} \text {if }t_k \le x < t_{k+1} \\ 0, &{} \text {otherwise} \end{array}\right. } $$
$$\begin{aligned} B_{k,j}(x) =\gamma _{k,j+1} (x) B_{k,j} (x)+[1-\gamma _{k+1,j+1}] B_{k+1,j}(x), \end{aligned}$$
(24)

for \(j=0,1,\ldots ,v-1\), where

$$ \gamma _{k,j+1} (x) = {\left\{ \begin{array}{ll} \frac{x-t_k}{t_{k+j}- t_k}, &{} \text {if }t_{k+j} \ne t_{k} \\ 0, &{} \text {otherwise} \end{array}\right. }. $$
Fig. 6.
figure 6

The results of principal component scores and eigenvalues of power spectra for C (Non-epileptogenic zone, Green), D (Epileptogenic zone, Blue) and E (Epileptic seizure onset, Light Blue). (Color figure online)

After this recursive process, the maximum number of non-zero basis functions is N, and they are denoted by \(B_{1,v-1}\), \(B_{1, v-1}\), \(\ldots \), \(B_{N, v-1}\), and they are called B-spline basis functions of \(v-1\) degree. If the number of interior points of the knot set, denoted by \(N^*\), is smaller than \(N-2\), then the total number of basis functions is K = \(N^* +v\). When \(v=4\), these functions are called cubic B-spline basis functions. In this work, we assume

$$\begin{aligned} \phi _k(a)=B_{k, 3}(a), \text {for} \,\, \, k=1, 2, \ldots ,N^*+4. \end{aligned}$$

So, the maximum number of basis functions used in this work is \(K=N^*+4\).

Fig. 7.
figure 7

The evolution of extracted first two principal component scores of power spectra under different choices of K for data sets A (Normal: Eyes Closed, Black), B (Normal: Eyes Open, Red), C (Non-epileptogenic zone, Green) and D (Epileptogenic zone, Blue). (Color figure online)

2.7 Feature Extraction by Functional Principal Component Analysis

After the jth eigenfunction \(\mathbf \xi _j (\omega )\) is obtained, we can extract the principal component scores, denoted by \(P_j\), for the given power spectrum \(\mathbf{I} (\omega ) \) as follows

$$\begin{aligned} P_j =\int \mathbf{I}(\omega ) \xi _j (\omega ) d \omega ,\,\,\, j=1,\ldots , K. \end{aligned}$$
(25)

Subsitituting (15) and (23) to the equation above, we get

$$\begin{aligned} P_j= & {} \int (\mathbf \mu +\mathbf{C} \mathbf \phi (\omega )) \mathbf \phi ^{\top }(\omega ) \mathbf \Phi ^{-1/2} \mathbf{u}_j d \omega .\\= & {} \int \mu \mathbf \phi ^{\top } \mathbf \Phi ^{-1/2} \mathbf{u}_j +\int \mathbf{C} \mathbf \phi \mathbf \phi ^{\top } \mathbf \Phi ^{-1/2} \mathbf{u}_j \\= & {} \mu \varPhi ^{-1/2} \mathbf{u}_j + \mathbf{C} \mathbf \Phi ^{-1/2} \mathbf{u}_j. \end{aligned}$$

Thus, \(P_1\) is the first principal component score vector of the N power spectra of signals, and \(P_2\) is the second principal component score vector, and so on. One of the strengths of using PCA is retaining a small number of principal components, so that, the dimension of feature vector is low. Using the functional PCA on the Fourier power spectra, we have been able to focus on only the first two PCs for clustering the EEG signals.

3 Results

In this study, we use the same data set as in [25], which is from the University of Bonn, Germany (http://epileptologie-bonn.de/cms/front_content.php?idcat=193), but we extend our study by including the set E. We focus on both the epilepsy diagnosis and the epilepsy seizure detection problems. From the results displayed in Figs. 2 and 3, one can see that the numbers of basis functions have a significant effect on the shape of the power spectra. When \(K=10\) is used to smooth out the power spectra, then the smoothed signal power spectra behave similarly for both types of signals, one coming from healthy people (sets A and B) and another one coming from patients for which the signals were collected from a non-epileptogenic zone (set C). However, there are still some differences that we can see among the graphs. This may suggest that further classification is needed based on these power spectra to recognize the differences hidden in the power spectra. Also, one can see that the power spectra of signals collected from patients’ epileptogenic zone (set D) are more volatile and look different from the signals of healthy people. However, they share some commonalities with signals from the set C. These differences become more clear when the number of basis functions is \(K=50\). It is understood that when the values of K increase, the smoothed power spectra tend to capture more of the local patterns. This makes a significant difference between different sets of signals. Furthermore, the patterns of power spectra associated with signals from patients with seizure onset (set E) are completely different from other types of signals. From Fig. 4, one can see that there is no clear functionality in the power spectra and with the decrease of K values the overall pattern among the power spectra is still not recognizable. This may suggest the ease of clustering this type of signals from the others. Overall, the graphical display offers some evidence that suitable clustering methods may differentiate these types of signals successfully. In particular, we can aim for the clustering between the sets A and B, this allows us to see the differences caused by artifacts. We can also combine the signals from healthy people, i.e. the sets A and B, and combine the signals from patient without seizure onset, i.e. the sets C and D, in order to see if there is a clustering effect between patients’ EEGs and healthy people’s EEGs. However, one question still remains. Should the clustering be based on the one that leads to significant overall difference, or the one that offers big difference in local patterns within the power spectra? This question will be answered later in the analysis of the evolution of extracted features.

To further reduce the dimensionality of the power spectrum and its functional mean and its functional standard deviation, the functional probe values are calculated based on the inner product of a selected functional mean and a given signal power spectrum. The results using functional standard deviation as a weight function are also studied. In our study, the best results, in terms of separability of features, are the ones that use the functional mean calculated from the set D. Using the functional mean as a probe, we extract a single-dimensional feature vector from a given signal power spectrum. Figure 5 clearly display the pattern, which shows a great separability of extracted features (i.e., functional probe values), due to the dimension reduction. We observe that the artifacts (i.e., eyes open or eyes closed) associated with the healthy people can be identified when mapping the power spectra of signals on the power spectra of epileptic signals (i.e., set D signals). Using only a single dimension of features, the separation of signals can be achieved for the set A and the set B. Similarly, these single dimension features are highly separable for other cases. This may suggest that the functional probe approach has certain merits in automatic clustering of different types of EEG signals.

To demonstrate the application to epileptic seizure detection, we use functional PCA to extract the principal components of the power spectra of signals from the sets C, D and E. From the results one can see if there exist some dominant signal components. Additionally, the corresponding scores can be used for clustering. The obtained results are displayed in Fig. 6. From Fig. 6(a), one can see that the extracted features for non-seizure signals (sets C and D) and seizure signals (set E) form into clusters and they are linearly separable in first PC. Figure 6(b) display that, the first eigenvalue is very dominant, which explains why the extracted features are highly separable in the first PC. This implies that the extracted first PC scores can be successfully classified using a simple classification method, such as k-NN. The second PC scores are not helpful in contributing to the cluster effect as they are completely overlapped. We further investigate the effect of the number of basis functions (i.e., K) on the separability of extracted signal features (i.e., principal component scores of power spectra). The proposed method is applied to signals from sets A, B, C and D only by varying the value of K. The obtained results are displayed in Fig. 7. We observe that the proposed method is highly successful in separating the artifacts (i.e., open/closed eyes) as the results did not depend on how the number of basis functions was selected. The feature separability increases with the decrease of K, i.e. the number of basis functions. This may suggest that the sparsity in approximation of the signal power spectra plays an important role in the success of applying functional principal component analysis. When \(K=50\), the extracted features for epileptic signals overlap significantly. This overlapping changes when K decreases, and features start to be fully separable when K is relatively small, for example, around 20. When \(K=5\), the obtained results are considered to be optimal in the sense of feature separability. However, the overall separability between healthy and epileptic signals is not affected by the number of basis functions. This may suggest that applying the proposed methods to the epilepsy diagnosis problem can be a successful tool, which allows separating the healthy and patient signal.

4 Concluding Remarks

Clustering and classification of high-dimensional data are important aspects in both pattern recognition and artificial intelligence. To be successful in dealing with clustering and classification, dimension reduction of high-dimensional data plays an important role. In this work, we focus on the study of using feature extraction as a dimension reduction approach. We use EEG signals to illustrate our proposed methodologies. We first transform EEG signals to the spectral domain to obtain their power spectra. Next, we apply functional principal component analysis, which is considered to be a special case of functional probe, to further investigate the characteristics of the signals and to extract their features for clustering. We have demonstrated that functional principal component analysis in spectral domain is useful for better understanding of different types of EEG signals. Furthermore, the extracted features can be used for signal classification. We have also investigated the effect of sparsity on the performance of separating signal features. We have observed that the separability of extracted features is significantly improved, when the number of signal components used for approximating the power spectra decreases. This may imply that the sparse approximation for signal approximation in spectral domain is a necessary step for better performance of signal clustering. From an application perspective, the obtained results demonstrate that the proposed method may be useful for both epilepsy diagnosis and epileptic seizure detection. Future work will focus on the study of wavelet spectral domain functional PCA and its application to clustering general random signals such as financial time series.