1 Introduction

Non-invasive electroencephalography (EEG) based brain computer interfaces (BCIs) are designed as assistive technologies for people with severe speech and muscle impairments providing means for them to communicate with their caretakers and families [2]. Event relate potentials (ERPs) are commonly employed by the EEG-based BCIs to detect the user intend [13, 5, 7]. Donchin and Farewell demonstrated that ERPs can be used to design a letter by letter typing BCI [3]. The matrix based presentation paradigm used in their design is shown to be highly gaze dependent [10]. On the other hand, rapid serial visual presentation (RSVP) paradigm is a gaze-independent alternative for matrix presentation paradigms. In RSVP, the symbols are rapidly presented as a time series on a prefixed location on the screen in a pseudo-random order [1, 5, 7].

RSVP Keyboard™  is a non-invasive EEG-based language-model-assisted BCI for typing which utilizes ERPs for intent detection. Inference module of the RSVP Keyboard™ probabilistically fuses the evidence extracted from the recorded multiple EEG channels with the probabilistic context information provided by a 6-gram language model [57]. This BCI system currently can employ both matrix-based presentation and RSVP paradigms. The EEG evidence is extracted using regularized discriminant analysis (RDA [6, 7]). RDA is a generalization of the quadratic discriminant analysis (QDA) which applies regularization and shrinkage on the maximum likelihood class covariance matrix estimators to remedy rank deficiencies [4]. RSVP Keyboard™  utilizes RDA because the dimensionality of the extracted EEG feature vectors is relatively higher than the number of measurements collected for supervised learning.

Alternative to the RDA method, in this manuscript, we propose a Kronecker product structure for the covariance matrices. We show that modeling multichannel EEG using an auto-regressive moving average (ARMA) model under certain assumptions leads to a covariance matrix with a Kronecker product structure. In this structure the number of parameters is significantly lower than RDA. The maximum likelihood estimation of the proposed parametric model of covariance matrix along with regularization lead to significant improvement in classification performance. Our offline analysis shows that the median of the percentage of improvement for different subjects across different presentation paradigms is 1.111 %.

2 Inference in RSVP Keyboard™

RSVP Keyboard™ utilizes a visual presentation module to detect the user intent. The EEG collected during the visual stimulation is then employed in decision making procedure.

2.1 Visual Presentation

In letter by letter typing task we assume a dictionary set \(\mathcal {D}\) of 26 letters in English alphabet, a space symbol “_” and a backspace symbol “<” as the set of all possible choices. Our system utilizes both matrix-based and rapid serial visual presentation paradigms. The different presentation paradigms are shown in Fig. 1a, b and c. Generally for all matrix-based presentation paradigms the dictionary members are arranged on a matrix shaped layout on the screen in gray color. In row and column presentation (RCP) paradigm the elements on each row or column of the matrix are assumed as a “trial” which are then flashed rapidly and in a pseudo-random order. One sequence for this presentation paradigm contains the presentation of all the rows and columns.

After each sequence the system will attempt to make a decision however if a predefined confidence threshold is not satisfied the system will capture more EEG in response to more sequences to improve the decision confidence. Accordingly the set of sequences which lead to a decision is called an “epoch”. Similarly, a trial in single character presentation (SCP) paradigm consists of only one element of dictionary and a sequence is defined as flashes of a subset of dictionary. For RSVP paradigm there exist no background matrix of characters but all the characters are presented on a prefixed location of the screen in a pseudo-random order and rapidly in time. In this presentation scheme each flashing letter is a trial and in each sequence a subset of dictionary is presented. The definition of epoch is the same among all presentation paradigms.

Fig. 1.
figure 1

Rapid serial visual and matrix-based presentation paradigms

2.2 Decision Making

The decision making process in RSVP Keyboard™  utilizes a maximum a posteriori (MAP) inference mechanism. During this procedure the context information from a language model (LM) is probabilistically fused with EEG evidence to produce a more accurate decision. The inference mechanism at epoch k and after observing sequence l is defined as follows:

$$\begin{aligned} \begin{aligned} \hat{s}_k^*= \arg \max _{\text {s}\in \mathcal {D}}P\left( s_k^*=\text {s}|\mathcal {E}^l;C\right) \end{aligned} \end{aligned}$$
(1)

where \(s^*_k\) is a random variable which represents the user intent in epoch k, \(\hat{s}_k^*\) is the estimated user intent, \(\mathcal {E}^l\) is the EEG evidences for all the observed l sequences in epoch k. Assuming that conditioned on the unknown symbol, the EEG evidence and context information are independent from each other, and again conditioned on the unknown symbol all EEG evidence from different trials are independent, we can simplify Eq. (1) as:

$$\begin{aligned} \widehat{s}_k^*=\arg \max _{\text {s}\in \mathcal {D}}P(s^*_k=\text {s}|C)\prod _{\begin{array}{c} i=1,~\ldots ,~l\\ \{j~|~\text {s}\,\in \,\mathbf {s}^i_j\} \end{array}}{ \frac{p\left( e_j^i|1\right) }{p\left( e_j^i|0\right) }} \end{aligned}$$
(2)

Here in (2), \(\mathbf {s}^i_j\) is the \(j^\mathrm{th}\) trial of the sequence \(i^\mathrm{th}\) of epoch k and \(e^i_j\) represents the EEG evidence associated with \(s^i_j\).

As in (2), one needs to define \(P(s^*_k=\text {s}|C)\) and class conditional distributions \(p\left( e|1\right) \), \(p\left( e|0\right) \) to be able to perform an inference.

Context Information. To define \(P(s^*_k=\text {s}|C)\) we utilize a letter n-gram LM which provides a prior probability mass function (PMF) over the dictionary. We have shown that context information fused with EEG evidence improves system performance effectively [5, 6]. An n-gram LM, mimics a Markov model of order \(n-1\), trough which it estimates the conditional PMF over the dictionary set based on \(n-1\) previously typed letters. Let \(C=\{s^*_m\}_{m\,=\,n-1,~\ldots ,~1}\), where \(s^*_m\) is the \(m^\mathrm{th}\) previously typed character, then

$$\begin{aligned} P(s|C)=P(s|\{s^*_m\}_{m\,=\,n-1,~\ldots ,~1}) = \frac{P(s,~s^*_{n-1},~\ldots ,~s^*_1)}{P(s^*_{n-1},~\ldots ,~s^*_1)} \end{aligned}$$
(3)

In our system, we use a 6-gram letter model, which is trained on the NY Times portion of the English Gigaword corpus [8].

Preprocessing and Feature Extraction. The class conditional distributions \(p\left( e|1\right) \), \(p\left( e|0\right) \) in RSVP Keyboard™  are estimated over the EEG evidences. To extract the EEG evidence from the EEG time signals, we begin with applying a two step dimensionality reduction following a preprocessing of recorded EEG. We use g.USBAmp bio-signal amplifier with the sampling frequency of 256 Hz to acquire the data. A bandpass linear-phase finite impulse response (FIR) filter with bandpass of [1.5, 42] Hz is then applied on the EEG data in order to improve the signal to noise ratio (SNR) and eliminate DC drifts. We down-sample the preprocessed data by order of 2. We concatenate the data from every channel in a time window of [0, 500) ms, time locked to onset of \(i^\mathrm{th}\) trial, to form the feature vector \(\mathbf {x}_i\) for that trial.

The supervised data required for estimating the class conditional distributions is recorded during “calibration” mode of the system [5]. Each calibration task of RSVP Keyboard™  consists of 100 sequences. Before each sequence the user is presented with a target character which she/he is supposed to locate during that sequence. For RSVP and SCP paradigms the number of trials in each sequence is set to 10, and for RCP it is equal to number of all rows and columns in the matrix (for instance, here we are using a \(4\times 7\) matrix which leads to 11 trials in a sequence).

The labels for the feature vectors is assigned as 0 if the trial in a non-target, or 1 if the trial contains the target character. To increase the EEG ERP detection accuracy, we further quadratically project the feature vectors on to a space which maximizes the distance between two classes. In RSVPKeyobard™  we use regularized discriminant analysis (RDA) which is a generalization of quadratic discriminant analysis (QDA) to perform this projection. In our system the dimensionality of feature vectors is relatively higher than the number of observation during a calibration session, hence we mainly utilize RDA to be able to estimate invertible covariance matrices. The maximum likelihood class conditional mean and covariance matrices are computed as follows:

$$\begin{aligned} \begin{aligned}&\varvec{\upmu }_h ={1 \over N_h}\sum _{i=1}^N {\mathbf {x}_i\delta (y_i,h)} \\&\mathbf {\Sigma }_h ={1 \over N_h}\sum _{i=1}^N {(\mathbf {x}_i-\varvec{\upmu }_h) (\mathbf {x}_i-\varvec{\upmu }_h)^T\delta (y_i,h)} \end{aligned} \end{aligned}$$
(4)

where \(y_i\in \{0, 1\}\) is the label of \(\mathbf {x}_i\), \(h\in \{0,1\}\) is the class for which we are performing the estimation, \(N_h\) is the number of observations in class h and \(N=N_0+N_1\). RDA makes the estimated covariance matrices invertible by applying regularization and shrinkage.

$$\begin{aligned} {\begin{matrix} &{}\widehat{\mathbf {\Sigma }}_h(\lambda )=\frac{(1-\lambda ) N_h\mathbf {\Sigma }_h+(\lambda )\sum _{h=0}^1{N_h\mathbf {\Sigma }_h}}{(1-\lambda ) N_h+(\lambda )\sum _{k=0}^1{N_h}} \\ &{}\widehat{\mathbf {\Sigma }}_h(\lambda ,\gamma )= (1-\gamma )\widehat{\mathbf {\Sigma }}_h(\lambda )+(\gamma ) {1\over p}tr[\widehat{\mathbf {\Sigma }}_h(\lambda )]\mathbf {I}_p \end{matrix}} \end{aligned}$$
(5)

Here, \(\lambda ,\gamma \in [0,1]\) are the shrinkage and regularization parameters, \(tr[\cdot ]\) is the trace operator and \(\mathbf {I}_p\) is an identity matrix of size \(p\times p\). RSVP Keyboard™  optimizes the \(\lambda \) and \(\gamma \) for the maximum area under the receiver operating characteristics (ROC) curve (AUC) in a 10-fold cross validation framework. The RDA score for \(e_i\), is then referred to as EEG evidence for trial \(\mathbf {s}_i\).

$$\begin{aligned} e_i=\log \left( \frac{f_\mathcal {N} (\mathbf {x}_i;\varvec{\upmu }_1, \widehat{\mathbf {\Sigma }}_1(\lambda ,\gamma ))}{f_\mathcal {N}(\mathbf {x}_i;\varvec{\upmu }_0, \widehat{\mathbf {\Sigma }}_0(\lambda ,\gamma ))} \right) \end{aligned}$$
(6)

where \(f_\mathcal {N}(\mathbf {x};\varvec{\upmu },\mathbf {\Sigma })\) is the Gaussian probability density function with mean \(\varvec{\upmu }\) and covariance \(\mathbf {\Sigma }\).

Consequently we use these EEG evidences in kernel density estimation (KDE) framework to define class conditional distributions. In our system we use Silverman rule of thumb to define the kernel width for KDE [9].

3 Signal Modeling and Covariance Estimation

Currently in our system we employ RDA to estimate full-ranked class conditional covariance estimates. But for a non-structured maximum likelihood estimation of covariance matrix one needs to estimate many parameters (i.e. elements of covariance matrix). But due to lack of enough observation in a calibration session, this estimation might be prone to errors. We propose to use a Kronecker product structure for the covariance matrices. This structure reduces the number of the covariance parameters to be estimated using the assumption of stationarity in time and space. We show that defining an auto-regressive moving average (ARMA) (p,q) model for the multi-channel EEG recordings leads to Kronecker product structure under certain assumptions.

Define \(\mathbf {v}[n]\) as the spatial feature vector of EEG recorded from \(N_{ch}\) EEG channels at time instant n:

$$\begin{aligned} \mathbf v [n] = \left[ \begin{array}{ccc} v_1[n] \\ v_j[n] \\ \vdots \\ \\ v_{N_{ch}}[n] \end{array} \right] _{N_{ch} \times 1} \end{aligned}$$
(7)

where \(v_i[n]\) is the \(n^\mathrm{th}\) time sample recorded at channel i. Then define the feature vectors as:

$$\begin{aligned} \mathbf {x}[i] =\left[ \begin{array}{ccc} \mathbf v [1] \\ \mathbf v [2] \\ \vdots \\ \\ \mathbf v [N_t] \end{array} \right] _{(N_{ch} N_t)\times 1} \end{aligned}$$
(8)

where \(\mathbf {v}[.]\) and accordingly \(\mathbf {x}\) follows a multivariate Gaussian distribution and \(N_t\) is the number of time samples. We define an ARMA model for EEG signal as follows:

$$\begin{aligned} \mathbf v [n]=\sum _{k=1}^{p}A_k \mathbf v [n-k]+\sum _{j=0}^{q}b_j \mathbf {w}[n-j] \end{aligned}$$
(9)

In (9), \(A_k\) represents the \(N_{ch}\times N_{ch}\) signal weight matrix at lag k, \(b_j\) is an scalar weight for noise at lag j and \(\mathbf {w}[n]\) represents multivariate wide sense stationary Gaussian noise for the \(n^\mathrm{th}\) time sample. Let us assume that the EEG signals among the channels is stationary. Then one can write (9) as:

$$\begin{aligned} \mathbf v [n]=\sum _{k=1}^{p} c_k\cdot \mathbf v [n-k]+\sum _{j=0}^{q}b_j\mathbf {w}[n-j] \end{aligned}$$
(10)

in which \(c_k\) is an scalar weight of time for the signal at time lag k. Now lets further assume \(p=1\) and \(b_j=0 \ \forall j=0 \ldots q\) then we can write:

$$\begin{aligned} \mathbf v _n=c_1\cdot \mathbf v [n-1] \end{aligned}$$
(11)

Then define:

  1. 1.

    Initial state v[0] \(\sim \mathcal {N}_{N_{ch}}(\varvec{\mu }_\mathbf{v [0]}, \varvec{\Sigma }_\mathbf{v }[0,0])\)

  2. 2.

    \(E[\mathbf v [n]]={\varvec{\mu }}_{\mathbf{v }[n]}\)

  3. 3.

    \(\Sigma _\mathbf{v }[m,n]=Cov[\mathbf v [m],{\mathbf{v }[n]]}\)

    \(~~~~~~~~~~~~=E[(\mathbf v [m]-{\varvec{\mu }}_{\mathbf{v }[m]}) (\mathbf v [n]-\varvec{\mu }_{\mathbf{v }[n]})^T]\).

  4. 4.

    \(\mathbf {x}=\left[ \begin{array}{ccc} \mathbf v [1] \\ \mathbf v [2] \\ \vdots \\ \\ \mathbf v [N_t] \end{array} \right] _{(N_{ch} N_t)\times 1}\)

Here note that based on the above definition we have \(\varvec{\Sigma }_\mathbf{v }[m,n] = \varvec{\Sigma }_\mathbf{v }[n,m]\). Also we have:

$$\begin{aligned} \mathbf v [n]=(c_1)^n\cdot \mathbf v [0] \Rightarrow E[\mathbf v [n]]=(c_1)^n\cdot \varvec{\mu }_\mathbf{v [0]} \end{aligned}$$
(12)

We further assume that the EEG signal is stationary in time. Lets assume \(m<n\) hence we have:

$$\begin{aligned} {\varvec{\Sigma }}_\mathbf{v }[m,n]&= {E}\{\mathbf{v }[n]\mathbf{v }[m]^T\}-{E}\{\mathbf{v }[n]\}E\{\mathbf{v }[m]\}^T \nonumber \\&={c}_{1}^{(n-m)}\cdot (E\{\mathbf{v }[m]\mathbf{v }[m]^T\}-\varvec{\mu }_{\mathbf{v }[m]}\varvec{\mu }_{\mathbf{v }[m]}^T) \nonumber \\&={c}_{1}^{(n-m)}\cdot {\varvec{\Sigma }}_\mathbf{v }[m,m] \nonumber \\&={c[|n-m|]} \cdot \varvec{\Sigma }_\mathbf{v }[0,0] \nonumber \\&\,\text {where}\,c[|n-m|]=c_1^{(|n-m|)} \end{aligned}$$
(13)

According to definition of \(\mathbf {x}\) one can define the covariance matrix of \(\mathbf {x}\) as:

$$\begin{aligned} \begin{array}{ll} \varvec{\Sigma }_{\mathbf {x}} &{}=\left[ \begin{array}{ccc} \varvec{\Sigma }_\mathbf v [1,1] &{} \ldots &{} \varvec{\Sigma }_\mathbf v [1,N_t]\\ \varvec{\Sigma }_\mathbf v [2,1]&{} &{} \varvec{\Sigma }_\mathbf v [2,N_t]\\ \vdots &{} &{} \vdots \\ \\ \\ \varvec{\Sigma }_\mathbf v [N_t,1] &{} \ldots &{} \varvec{\Sigma }_\mathbf v [N_t,N_t] \end{array} \right] \\ {} &{}\\ &{}=\left[ \begin{array}{ccc} c[0] &{} \ldots &{} c[N_t-1]\\ c[1]&{} &{} c[N_t-2]\\ \vdots &{} &{} \vdots \\ \\ \\ c[N_t-1] &{} \ldots &{} c[0] \end{array} \right] \otimes \varvec{\Sigma }_\mathbf{v }[0,0] \end{array} \end{aligned}$$
(14)

Finally, we assume the EEG signal is independent in time which means that \(c[|n-m|]=0\) for all \(m\ne n\). Then we have:

$$\begin{aligned} \varvec{\Sigma }_{\mathbf {x}}=c[0]\cdot \mathbf {I}_{N_t} \otimes \varvec{\Sigma }_\mathbf{v }[0,0] \end{aligned}$$
(15)

Here \(\mathbf {I}_{N_t}\) is an \(N_t\times N_t\) identity matrix.

Through a maximum likelihood framework we can estimate the parameter values of the structured covariance matrices. We specifically utilize a flipflop algorithm presented by Karl Werner in [11] for which we fix the time covariance matrix to identity and perform a one time estimation on channel covariance matrix.

4 Results

4.1 Participants

In this manuscript we utilized the calibration data collected from 9 healthy users who had consented to participate in our study according to the IRB-approved protocol (IRB130107) [5]. In our study, each user performed 12 calibration sessions for all possible combinations of 4 inter trial interval (ITI) values (\(\{200; 150; 100; 85\}\) ms) and 3 presentation paradigms (RCP, SCP and RSVP). According to the International 10/20 configuration, data recorded from 16 EEG locations: Fp1, Fp2, F3, F4, Fz, Fc1, Fc2, Cz, P1, P2, C1, C2,Cp3, Cp4, P5 and P6.

4.2 Data Analysis and Results

We calculated the area under the receiver operating characteristics (ROC) curve (AUC) values, for every calibration data using a 10-fold cross validation. The goal of this analysis is to assess the changes in classification accuracy under the proposed signal model.

For each particular ITI and presentation paradigm (PP) combination, we compared the median of AUC values for RDA and the proposed model in Table 1, and also we show the number of participants who demonstrate improvement under the proposed model in Table 2. In Table 1 we can see an improvement for RSVP at ITI = 150 ms which is the optimal speed for this presentation paradigm [5]. Also, the proposed model seems to be most effective in RCP paradigm. However, we cannot observe any significant improvement for SCP at any ITI. As shown in Table 2 most of the population could benefit from the proposed model at every PP and ITI combination. Among all the users at every ITI and PP combinations, half of the AUC values fall bellow .811. We utilized this value to define a threshold for high AUCs and low AUCs. The Table 3 represents the median AUC values for regular RDA and proposed covariance estimation technique. As one can clearly see in this table the participants with low AUCs can benefit more from the new signal modeling scheme.

We also compute the number of participants who demonstrate a classification improvement regardless of particular ITI value and PP. We assumed a participant can benefit from this signal modeling scheme if the median of all 12 ITI and PP combination AUCs improves. The corresponding results are shown in Table 4.

Table 1. The median of changes in AUC for each PP and ITI combination among nine users
Table 2. The number of participant for whom the proposed model improved the classification AUC, for each PP and ITI combination and among nine users
Table 3. Median of AUCs lower than 0.811 for the nine subjects when we use the signal modeling (SM) versus RDA for all PP and ITIs.
Table 4. Improvement in median of AUCs among all 12 ITI and PP combination for each user.

Table 4 shows that most of the population, 5 out of 9, demonstrate an improvement in classification AUC. Besides the amount of improvements is generally higher than \(1\,\%\) while the performance degradation is less than \(0.5\,\%\) for other users.

5 Discussions and Future Work

In this manuscript, we considered the EEG as a structured multivariate Gaussian data, and under certain assumptions, we modeled the covariance matrix of this signal to have a Kronecker product of a channel covariance matrix and an identity time covariance matrix. With this assumption on the covariance matrix, we reduced the number of parameters that are needed to be estimated. Correspondingly this decrease in the number of parameters to be estimated led to an increase in classification performance.

In this study at every presentation paradigm and inter trial interval combination, we compared the classification performances of two methods when the covariance matrix is estimated under the new structure versus the covariance is estimated without a specific structure using typical RDA. Results suggested that considering a structured EEG signal can significantly improve the ERP-detection specially when the RDA AUC is below 80 %. Future work will analyze and optimize additional structures such as Toeplitz or AR(p) structures for the covariance of the multichannel EEG signal.