1 Introduction

Long-term electroencephalogram (Long-Term EEG) is a type of EEG that records over a long period of time, rather than a specific duration [1]. It is used primarily for epilepsy monitoring, but is also used in intensive care units (ICU), operating rooms, and emergency rooms [2,3,4]. As Long-Term EEG is used to monitor seizures, cerebrovascular diseases, and psychiatric conditions, it typically lasts from hours to days [5,6,7,8,9,10,11]. During the recording process, the EEG may contain multiple signals from both neuronal and non-neuronal sources, with the latter often referred to as artifacts, which interfere with neural signals [12]. Artifacts are usually manually identified and removed from the data before EEG signals are further analyzed [13]. However, this manual annotation procedure is both time-consuming and subjective [14], making an efficient and reliable automatic artifact removal tool highly desirable.

Several advanced algorithms have been developed for the automated preprocessing of artifacts, which can be divided into two categories: identifying or detecting artifacts in EEG, and processing artifacts that have been discovered [15,16,17,18,19]. In recent years, many methods based on deep learning (DL) or machine learning (ML) have been proposed for the former category [20,21,22,23,24,25,26,27,28,29]. For example, in 2022, the convolutional neural network with transformer (CNN-Transformer) was proposed to detect artifacts at single channel level and segment level, which was validated on the TUH Artifact (TUH-ART) dataset [30]. For the latter category, despite the famous independent component analysis (ICA), artifact subspace reconstruction (ASR), and signal space projection (SSP) methods that have been applied to the correction and reconstruction of EEG, novel DL-based denoising methods continue to emerge [31,32,33,34,35,36,37,38,39,40].

However, the aforementioned methods have certain limitations. Most researchers who employ DL validate their methods on one or a few datasets, resulting in poor generalization performance and scalability [41]. Additionally, algorithms based on signal reconstruction theory encounter challenges to process epochs of signals that have been heavily corrupted by artifacts. In such cases, employing certain signal completion techniques can be beneficial. However, in the absence of recordings regarding these completion signals in a clinical setting, repairing artifacts becomes particularly challenging. Given these circumstances, rejecting these signals may be regarded as a viable and favorable choice.

Most existing tools reject epochs based on the peak-to-peak amplitude (PTP), and the mainstream EEG analysis software integrates the PTP-based method [42, 43]. However, the selection of threshold in PTP value is data-specific and requires the expertise of practitioners, making an automated specification of threshold preferable. In this regard, the automated artifact rejection for MEG and EEG data (Autoreject) has achieved remarkable success and has been utilized in various types of research [44]. Nevertheless, Autoreject has some drawbacks. Firstly, due to its implementation of an interpolation algorithm and Bayesian optimization, it runs slowly when processing intensive data. Secondly, its performance is not satisfactory when the overall data quality of the signal is poor. These shortcomings make it unsuitable for Long-Term EEG.

In order to address the aforementioned issues, this paper proposes a novel automatic method for the artifact rejection of Long-Term EEG. The proposed method can significantly improve the overall data quality in a relatively short running time, making the method both reliable and fast. The superior performance of the proposed method is mainly due to the ability of Isolation Forest (IF) to accurately partition the feature space, as well as the linear time complexity of IF. The main contribution of this paper is the development of a reliable and fast automatic rejection method for clinical Long-Term EEG. This method avoids the manual selection of peak-to-peak amplitude threshold for artifact rejection by iteratively applying IF, in which a metric is designed to measure the class distance between epochs that should be dropped and retained, enabling the prompt termination of IF iteration. The evaluation of the proposed method includes the comparison with four state-of-the-art methods, comprising two unsupervised methods and two supervised methods, across Long-Term EEG recordings of six retrospectively collected patients.

2 Materials and methods

In this section, the description of patients and a series of basic preprocessing operations are introduced first. Subsequently, the proposed automatic artifact rejection method based on IF is detailed comprehensively.

The aim of the design of this method is to provide a reliable and efficient tool for swiftly removing corrupted segments from the data when conducting non-continuous epoch-level EEG analyses in research or clinical settings, thus avoiding the need for extensive manual data rejection. The motivation for constructing the proposed method is based on the aim of providing a reliable and efficient tool for capturing artifact introduced by subjects during Long-Term EEG recording. In this regard, the PTP metric remains an effective and practical choice. Furthermore, the IF algorithm was initially designed for efficient and precise outlier detection, making it well-suited for this purpose. When both are appropriately combined, it becomes effortless for the algorithm to detect outlier epochs in a Long-Term EEG recording. However, due to the complexity of clinical environments, it is often challenging for subjects to maintain a fully relaxed resting-state, which is often desirable for research and medical purposes. Instead, subjects typically exhibit some degree of mental activity or minor movements. As a result, there are relatively few data segments with low noise levels, and the entire dataset is characterized by moderate levels of noise, often accompanied by notable artifacts. In this scenario, the proposed method considers an iterative use of the IF algorithm for identifying outliers in the data while retaining epochs with relatively low PTP values in each iteration, until all epochs distorted by artifacts are correctly classified. Following the aforementioned design philosophy, the flowchart of the proposed method is shown in Fig. 1. The main procedures are presented in Fig. 1 and will be demonstrated in this section, respectively.

Fig. 1
figure 1

Flowchart of the proposed method

2.1 Data preparation and basic preprocessing

The present study enrolled 6 subjects in total, of which 2 subjects had epilepsy and 4 subjects were healthy. Data from all subjects were retrospectively collected from the database of the Department of Neurology of Nanjing Drum Tower Hospital from 2021 to 2022. All the subjects provided informed consent and underwent Long-Term Video-EEG for up to 20 h. The inclusion criteria for epilepsy were as follows: (1) subject age ≥ 18 years old; (2) EEG recordings of subjects showed obvious epileptic discharge, and subjects had previously experienced seizures. The inclusion criteria for healthy subjects were the following: (1) subject age ≥ 18 years old; (2) the subjects presented for consultation due to heatstroke or syncope; however, no visible abnormalities were observed in the EEG. This study was approved by the Ethics Committee of the Department of Neurology, Nanjing Drum Tower Hospital, Nanjing, China, with the approval number 2021-432-02.

As shown in Fig. 2, a 19-channel montage based on the 10-20 International System was used to collect the EEG signals. In order to maintain the configuration of each subject consistent, the recordings were referenced to the average of earlobes, namely A1 and A2. Since the sampling rates of the EEG recordings are different, the data were resampled at 500 Hz for the convenience of subsequent analysis. The high-frequency muscle artifacts resulting from movements such as chewing and head motion are common artifacts in EEG recordings, and eye movement signals often interfere with the subsequent analysis of EEG signals. Therefore, in order to verify the effectiveness of the proposed methods, bandpass filtering and independent component analysis were intentionally omitted for reducing or suppressing these artifacts. Instead, notch filters were exclusively used to remove global noise associated with the power source, ensuring proper operation of the methods. To remove power-line interference, an FIR notch filter with zero phase and hamming window is designed by using MNE [43], and is applied at 50 Hz for each subject’s EEG recording.

Fig. 2
figure 2

Channel configuration of selected Long-Term EEG recordings

2.2 Feature extraction

Initially, as shown by the first block in the middle column of Fig. 1, the features of epochs in the original signals are extracted. PTP is widely used in the analysis of EEG. For the purpose of simplicity of notation, a formal definition is established in this subsection. In consideration of EEG research with a large amount of data, the raw data is always divided into epochs of equal time length, so the PTP values for each channel in one epoch are defined by the following equation:

$${E}_{n}^{c}=\mathrm{max}{x}_{n}^{c}(k)-\mathrm{min}{x}_{n}^{c}\left(k\right)$$
(1)

where \({E}_{n}^{c}\) is the PTP value for the \(c\) th channel in the \(n\) th epoch, \(x(k)\) is in discretization form of time course of a single channel within an epoch, with \(k\) being the sampling point, where \(0\le k\le K-1\), and \(K\) is the sampling rate. By computing the PTP value \({E}_{n}^{c}\) for each channel in each epoch, the raw EEG time series can be reformulated as a new feature matrix:

$${X}_{\mathrm{feature}}=\left[\begin{array}{ccc}{E}_{1}^{1}& {E}_{2}^{1}& \begin{array}{cc}\cdots & {E}_{N}^{1}\end{array}\\ {E}_{1}^{2}& {E}_{2}^{2}& \begin{array}{cc}\cdots & {E}_{N}^{2}\end{array}\\ \begin{array}{c}\vdots \\ {E}_{1}^{C}\end{array}& \begin{array}{c}\vdots \\ {E}_{2}^{C}\end{array}& \begin{array}{cc}\begin{array}{c}\ddots \\ \cdots \end{array}& \begin{array}{c}\vdots \\ {E}_{N}^{C}\end{array}\end{array}\end{array}\right]$$
(2)

Thus, \({X}_{\mathrm{feature}}\) is a dense matrix of size \(C\times N\).

2.3 Isolation Forest

Following the extraction of epoch-level features, the subsequent step is to identify the corrupted epochs. Since the ratio of corrupted epochs to clean epochs in a given Long-Term EEG recording is indeterminate, it is suitable to define the process of corrupted epoch rejection as a two-class imbalanced classification task. Hence, methods formulated for anomaly detection can be employed to resolve this problem.

This paper proposes to use the IF to distinguish between corrupted and clean epochs. The reason to choose the IF is that it has linear time complexity with a low constant and a low memory requirement [45]. The IF assumes that, when randomly partitioning the feature space of specified samples, samples with distinguishable features are more likely to be separated in early partitioning [46]. Thus, nodes with shorter path lengths in the isolation trees are highly likely to be anomalies.

Overall, anomaly detection using the IF is a two-stage procedure. The first stage involves constructing isolation trees through the application of the specified samples and the partitioning process, while the second stage entails passing the samples through the isolation trees to generate an anomaly score for each sample.

An isolation tree built from a dataset \(X=\{{x}_{1}, {x}_{2}, \dots , {x}_{n}\}\) is a complete binary tree, where each node in the tree has exactly zero or two children. In the first stage of the IF algorithm, a collection of \(t\) isolation trees are generated by initially sub-sampling the provided set \(X\) to size \(\psi\), followed by a recursive partition of the set \({X}_{\psi }\). The partition operation selects a feature dimension of the samples \(q\) and a split value \(p\), such that the comparison \(q < p\) divides samples into left subtree and right subtree. The termination conditions of constructing an isolation tree are as follows:

  1. (1)

    The depth of the tree reaches the limit \(l\), where \(l= \mathrm{ceiling}(\mathrm{log}2 \psi )\).

  2. (2)

    \(\left|{X}_{\psi }\right|=1\).

  3. (3)

    All values in \(q\) are the same.

Upon completion of the aforementioned operations for \(M\) times, an Isolation Forest consisting of \(M\) isolation trees is constructed.

In the second stage of the IF algorithm, an anomaly score \(s\) is derived from the expected path length \(E(h(x))\) for each sample presented in \(X\), where \(h(x)\) is calculated by counting the number of edges from the root node to a terminating node as instance \(x\) traverses through an isolation tree and \(E(h(x))\) is the average of \(h(x)\) from a collection of isolation trees. As described in [46], the trees constructed by the IF algorithm have an equivalent structure to a binary search tree (BST). In a BST, an unsuccessful search is defined as the inability to find a specified element within the tree. Therefore, the path length of termination due to an external node in the IF algorithm can be estimated using the theory of BST. The average path length of an unsuccessful search in a binary search tree is defined as follows:

$$c\left(n\right)= 2H\left(n - 1\right)- \left(\frac{2\left(n - 1\right)}{n}\right)$$
(3)

where \(H(i)\) is the harmonic number and it can be estimated by \(\mathrm{ln}(i) + 0.5772156649\) (Euler’s constant). Since \(c(n)\) can be used to normalize \(h(x)\), the anomaly score \(s\) of a sample is formulated as:

$$s= {2}^{-\frac{E\left(h\left(x\right)\right)}{c\left(n\right)}}$$
(4)

The anomaly score \(s\) can be used to identify anomalies within the data. The algorithm arranges the data in descending order according to \(s\), and the first \(m\) instances are the top \(m\) anomalies. As illustrated in Fig. 1, the IF algorithm is utilized repeatedly. Following each iteration, the most probable outliers are identified as potentially corrupted epochs.

2.4 Boundary adjustment

Since the IF detects outliers on the basis of anomaly score, it is capable of discovering corrupted epochs in the iteration. According to the definition of PTP, clean epochs are supposed to have a lower value of PTP compared to heavily corrupted epochs. However, in clinical Long-Term EEG recordings, the proportion of time for patients to maintain resting-state and perform daily activities is unknown. In other words, patients only need to wear the signal collector during clinical EEG monitoring, and they can engage in activities such as eating and drinking within the monitoring area. However, the duration of patients lying down and keeping their mind empty is uncertain. As a result, the collected signals contain both resting-state and other daily activity EEG signals in a coupling manner. Consequently, the ratio of time spent in resting-state and other daily activities cannot be determined explicitly. Under such circumstances, in a clinical long-term recording that needs to be preprocessed, the aggregation trend of data is also non-deterministic. In the feature space, when epochs with higher values of PTP tend to cluster, those with lower values of PTP (representing data when the patient is in a resting-state) are significantly distanced from the cluster center, often resulting in their identification as outliers by the IF algorithm and subsequent elimination, despite being the epochs of interest.

To address this issue, six methods are proposed to adjust the boundary between outliers and inliers. Since it is unambiguous that clean epochs always have a lower value of PTP, those epochs considered outliers by the IF but with a low value of PTP should be retained as well. After each decision of the IF, the data can be divided into two sets, one as inliers and the other as outliers, which can be expressed as follows:

$${S}_{\mathrm{inliers}}=\left\{{s}^{1},{s}^{2},\dots ,{s}^{p}\right\}$$
(5)

and

$${S}_{\mathrm{outliers}}=\left\{{s}^{1},{s}^{2},\dots ,{s}^{q}\right\}$$
(6)

where \(p\) and \(q\) are the numbers of epochs determined as inliers and outliers in the current iteration. Due to the multi-dimensional nature of EEG signals, segmental feature extraction using peak-to-peak values allows for temporal separation and reduction of the data. However, the spatial dimensions are still preserved. The main design motivation of the proposed method is to facilitate comparison and iteration on the most salient features of the data, enabling the progressive identification of epochs heavily contaminated by signals. Principal component analysis (PCA) is a classical data analysis technique that linearly combines the original variables to generate new composite variables while preserving the maximum amount of information. With the aim of identifying the primary components of variance in the data, PCA is employed in the boundary adjustment stage to reduce dimensionality and eliminate redundant information. To mathematically represent the boundary adjustment method, the PTP value at the epoch level after PCA reduction is defined as:

$${E}_{\mathrm{epo}}\left(s\right)=\mathrm{PCA}\left(s, {n}_{\mathrm{compoents}}=1\right)$$
(7)

where \(s\in {S}_{\mathrm{inliers}}\) or \({S}_{\mathrm{outliers}}\), and \({n}_{\mathrm{components}}=1\) indicates that original feature matrix is reduced to one dimension. Thus, the maximum value of retained epochs is used to adjust the boundary:

$${E}_{\mathrm{max}}^{\mathrm{inliers}}=\mathrm{max}\left({E}_{\mathrm{epo}}\left(s\right)\right)$$
(8)

where \(s\in {S}_{\mathrm{inliers}}\), and \({E}_{\mathrm{max}}^{\mathrm{inliers}}\) represent the maximum PTP value of retained epochs. If \({E}_{\mathrm{epo}}(s)<{E}_{\mathrm{max}}^{\mathrm{inliers}}\) when \(s\in {S}_{\mathrm{outliers}}\), then this epoch should be regarded as an inlier and be retained with other inliers. This boundary method is denoted as the Max Method. In the same way, the min value of retained epochs can be defined as:

$${E}_{\mathrm{min}}^{\mathrm{inliers}}=\mathrm{min}\left({E}_{\mathrm{epo}}\left(s\right)\right)$$
(9)

and it is used as a comparator to retain epochs with \({E}_{\mathrm{epo}}(s)<{E}_{\mathrm{min}}^{\mathrm{inliers}}\) when \(s\in {S}_{\mathrm{outliers}}\). This boundary method is denoted as the Min Method. Similarly, mean and median values of the retained epochs can also be used to adjust the boundary of outliers and inliers. They can be defined as:

$${E}_{\mathrm{mean}}^{\mathrm{inliers}}=\mathrm{mean}\left({E}_{\mathrm{epo}}\left(s\right)\right)$$
(10)

and

$${E}_{\mathrm{median}}^{\mathrm{inliers}}=\mathrm{median}\left({E}_{\mathrm{epo}}\left(s\right)\right)$$
(11)

respectively. Then, the epochs decided as outliers by the IF are supposed to be retained if \({E}_{\mathrm{epo}}(s)<{E}_{\mathrm{mean}}^{\mathrm{inliers}}\) or \({E}_{\mathrm{epo}}(s)<{E}_{\mathrm{median}}^{\mathrm{inliers}}\) when \(s\in {S}_{\mathrm{outliers}}\). These two methods are called the Mean Method and the Median Method, respectively. Moreover, we have also taken into consideration the metrics of the distribution. Therefore, kurtosis and skewness of the distribution are used to adjust the boundary of outliers and inliers, and they are denoted as Kurtosis Method and Skewness Method. Thus, they are defined to be:

$${E}_{\mathrm{kurtosis}}^{\mathrm{inliers}}=\mathrm{kurtosis}\left({E}_{\mathrm{epo}}\left(s\right)\right)$$
(12)

and

$${E}_{\mathrm{skewness}}^{\mathrm{inliers}}=\mathrm{skewness}\left({E}_{\mathrm{epo}}\left(s\right)\right)$$
(13)

respectively. These six methods are all effective for adjusting the boundary but have different characteristics, which will be explored in the next section.

2.5 Termination condition of the IF iteration

In order to implement automatic rejection of artifacts, it is necessary to specify a termination condition for the IF iteration. For this reason, two classes should be defined:

  1. (1)

    \({\Omega }_{\mathrm{retain}}\) : all the epochs that are confirmed to be clean by the algorithm.

  2. (2)

    \({\Omega }_{\mathrm{drop}}\): all the epochs that are confirmed to be artifact-corrupted by the algorithm.

After adjusting the boundary of separation results of the IF, the sets \({S}_{\mathrm{inliers}}\) and \({S}_{\mathrm{outliers}}\) can be rearranged and merged to \({\Omega }_{\mathrm{retain}}\) and \({\Omega }_{\mathrm{drop}}\). By using L2 norm, the distance between \({\Omega }_{\mathrm{retain}}\) and \({\Omega }_{\mathrm{drop}}\) is defined as:

$$\mathrm{Dist}= {\left|\mathrm{max}\left({E}_{\mathrm{epo}}\left({s}^{i}\right)\right)-\mathrm{min}\left({E}_{\mathrm{epo}}\left({s}^{j}\right)\right)\right|}_{2}$$
(14)

where \({s}^{i}\in {\Omega }_{\mathrm{retain}}\) and \({s}^{j}\in {\Omega }_{\mathrm{drop}}\). As the number of iterations increases, the distance defined above will be subject to alteration. This is due to the gap in the PTP values between clean epochs and artifact-corrupted epochs being varied in the successive iteration. Finally, when all the artifact-corrupted epochs are correctly classified, the IF algorithm will detect no outlier present or incorrectly classify minor clean epochs as outliers. In either of the scenarios, the distance between the two classes will remain unchanged. Consequently, the iteration should be terminated and the automatic rejection is completed.

The source code for this research is available on GitHub at the following URL: https://github.com/RunKZhang/Isolation_Forest_Automatic_Rejection.

3 Experiments and results

In this section, the proposed method is verified and compared with the state-of-the-art. For each patient described in the previous section, the first 2-h data of the recordings is extracted. For each 2-h data, the data is split from the middle to build two 1-h datasets. Next, non-overlapping 1-s epochs were made for each 1-h data. Thus, twelve 1-h data segments with 3600 epochs were built. The 2-h data is extracted because, upon registration and commencement of Long-Term EEG recording, patients are typically unable to remain still and at rest. The segmentation of EEG recordings is executed by using MNE.

Considering that the length of data processed by different algorithms may not be consistent, the evaluation metrics for data quality should be independent of data length. Thus, the metrics used to evaluate the data quality before and after rejection were Overall Data Quality (ODQ) and Overall Data Quality Rating (DQR) introduced by Zhao et al. in [47]. The ODQ value is expressed as:

$$ODQ=\frac{{M}_{\mathrm{good} \mathrm{windows}}}{{M}_{\mathrm{total} \mathrm{windows}}}$$
(15)

where \({M}_{\mathrm{good} \mathrm{windows}}\) is the number of good windows in a data segment, while \({M}_{\mathrm{total} \mathrm{windows}}\) is the number of total windows in a data segment. A window refers to the time course of a single channel within an epoch and a good window indicates that the single time course is with an acceptable level of noise. Since the quantitative signal quality evaluation method for EEG proposed in [47] is threshold-based, it requires no ground-truth labels for good windows and automatically identifies the good windows based on a set of parameters. Considering its nature of not requiring manual labeling, this metric is particularly suitable for evaluating data quality in situations where the dataset lacks true labels and the size of the dataset is tremendous. The number of total windows \({M}_{\mathrm{total} \mathrm{windows}}\) is denoted as:

$${M}_{\mathrm{total} \mathrm{windows}}=C\times N$$
(16)

where \(C\) is the number of channels in a data segment and \(N\) is the number of epochs in a data segment. From the definition of ODQ, it can be inferred that this metric is suitable for evaluating the data quality of the same data segment after being processed by different methods. It can be summarized that the higher the ODQ value, the better the quality of the data. The authors in [47] manually partition the ratings of DQR, which correspond to “perfect,” “good,” “poor,” and “bad.” The rating of DQR is determined based on the numerical value of ODQ. When ODQ is less than 60, an EEG recording’s DQR is classified as D, indicating bad data quality. When ODQ is greater than or equal to 60 but less than 80, the DQR of a recording is classified as C, indicating poor data quality. When ODQ is greater than or equal to 80 but less than 90, the DQR of a recording is classified as B, indicating good data quality. When ODQ is greater than or equal to 90, the DQR is classified as A, indicating perfect data quality. The ODQ values and their corresponding DQR ratings are illustrated in Table 1.

Table 1 The ODQ value and corresponding DQR

The IF is implemented using scikit-learn and the parameters are set to default, which are shown in Table 2. Since the IF algorithm is a type of ensemble method, the number of base estimators actually represents the number of random trees in the forest, and it is defaulted to be 100 in scikit-learn. The number of samples denotes the amount of data drawn from the dataset to construct a random tree, and this parameter value set to “Auto” indicates that scikit-learn will automatically select the minimum value between the size of the dataset and 256. The contamination represents the proportion of outliers in the original data, and this parameter value set to “Auto” means that the threshold is determined as in the original paper [46, 47]. The number of features denotes the percentage of the dimension of the feature vector drawn from the original data to train random trees, and 1.0 means that all the features are used for training. Bootstrap is a parameter used to control the sampling method, where “True” means that training data were sampled with replacement, and “False” indicates sampling without replacement is performed. The ODQ value and the DQR are calculated using the WeBrain platform and Python scripts [48].

Table 2 The parameters of the IF in scikit-learn

The data quality prior to artifact rejection is shown in Fig. 3. It is evident that the ODQ varies in a large range from 0 to 88.89 and the DQR is between D and B. The average quality value of these segments is 53.24 ± 27.87 and the average rating is D.

Fig. 3
figure 3

Overall data quality and corresponding ratings

3.1 Evaluation of six boundary adjustment methods

The six methods for boundary adjustment were compared in terms of ODQ values in our experiment to illustrate which adjustment method is the most reasonable. The ODQ values after rejection by the six boundary adjustment methods are illustrated in Fig. 4.

Fig. 4
figure 4

Boxplot of the performance of six boundary adjustment methods

From Fig. 4, it can be noted that the lowest ODQ value of the Min Method is 22.22, while the lowest ODQ values of the other five methods are 0, 27.77, 27.77, 19.64, and 19.64, respectively. Meanwhile, it is evident from the figure that the highest ODQ values among the boundary adjustment methods based on simple statistics exceed 90, while the two methods based on distribution metrics only exceed 80. This indicates that these six boundary adjustment methods are effective in eliminating artifact-corrupted epochs. However, the box size of the Min Method is the smallest among the six methods, implying that the overall data quality after rejection by the proposed method with the Min Method as the boundary adjustment method can reach a superior level. As observed from the distribution curve on the right side of each box, the ODQ values of the other five methods are more dispersed and the average ODQ value of the Min Method is higher than those of other methods.

The number of iterations required for convergence for each data segment is illustrated in Fig. 5. The distance between two classes is calculated after each iteration, and the iteration stops when the distance values of two adjacent iterations are the same.

Fig. 5
figure 5

Numbers of iterations required for convergence by the six methods that use different statistical indicators as the centroid. a Min Method. b Max Method. c Median Method. d Mean Method. e Kurtosis Method. f Skewness Method

The distance values recorded in Fig. 5 were calculated after each iteration. However, prior to any iterations or before the first iteration, the number of elements in set \({\Omega }_{\mathrm{drop}}\) is zero, indicating that the set containing epochs confirmed to be artifact-corrupted is empty. In Eq. (14), the minimum value of elements in the set \({\Omega }_{\mathrm{drop}}\) is required, but the minimum value of an empty set is undefined. Hence, Fig. 5 omits the distance values before any iterations and starts recording from the first iteration onwards. After the first iteration, both sets \({\Omega }_{\mathrm{retain}}\) and \({\Omega }_{\mathrm{drop}}\) contain elements. However, due to different data segments evaluated in the experiments, the elements of the two sets are distinct, resulting in different calculated distance values. Figure 5 shows a discernible trend that the Max Method requires the least number of iterations to achieve convergence and only needs two iterations in most data segments. Similarly, the Kurtosis Method converges after a few iterations. On the contrary, using the min value as the centroid to adjust the boundary needs more iterations to achieve convergence. The performances of the other three methods tend to be analogous, which indicates that they need a similar running time. In general, although the Min Method requires the maximum number of iterations to achieve convergence compared with the other five boundary adjustment methods, this method is capable of reaching the best data quality after rejecting artifact-corrupted epochs.

Figure 6 provides an overview of the number of epochs of each data segment after corrupted epochs being rejected by the proposed method. It is noted from the figure that using the min value as the centroid always resulted in more iterations and fewer retained epochs.

Fig. 6
figure 6

The number of epochs of each data segment after corrupted epochs being rejected by the proposed method that use different statistical indicators as the centroid. a Min Method. b Max Method. c Median Method. d Mean Method. e Kurtosis Method. f Skewness Method

3.2 Comparison with the state-of-the-art

In this subsection, the proposed method is compared with several state-of-the-art automatic artifact removal methods, including two unsupervised methods and two supervised methods. The first method is Autoreject, which adaptively selects thresholds for discriminating artifacts from clean segments using cross-validation and employs Bayesian optimization for optimal threshold [44]. The second method, referred to as AUTO in this paper, takes a series of complex handcrafted features as input to an autoencoder for unsupervised learning and outlier removal [49]. The third and fourth methods are EEGdenoiseNet and Interpretable CNN, well-known convolutional neural network–based approaches for artifact handling in EEG signals [22, 35]. In this paper, the AUTO is trained for 100 epochs to discriminate artifact-corrupted epochs, and contamination is set to 0.1 by default according to [49]. For EEGdenoiseNet, a classifier was added to the end as stated in [22] and trained for 100 epochs on the EEGdenoiseNet dataset, which is a semi-synthetic EEG dataset. According to [22], Interpretable CNN is also trained for 100 epochs on the EEGdenoiseNet dataset to achieve a high accuracy of classifying clean EEG and artifacts. After completing the training of both networks, they were validated on the collected dataset used in this paper. Each epoch’s data was inputted to the network as a batch for validation purposes. The training and testing of deep learning models in this paper were executed using NVIDIA GeForce GTX 1080 Ti. The comparison results are shown in Table 3. It is noteworthy that the epochs left by these methods differ as shown in Table 3, and this can be attributed to the normal inconsistency in the length of the remaining data after being processed by different algorithms. This is because during the running process, different algorithms will reject epochs that they deem to be highly contaminated by artifacts, so it is challenging to ensure that processing the same data segment with different algorithms will yield data of the same length. Thus, the discrepancy in data length after automatic rejection renders it sensible to compare the quality of the artifact-rejected data using ODQ values.

Table 3 Results from comparison between the proposed method and other methods

The ODQ values before and after artifact removal by using the proposed method and other methods are shown in Fig. 7a. The proposed method yields superior or equivalent results in most data segments compared to other methods. However, not all these methods can be used to improve the quality of the data. After being processed by AUTO and EEGdenoiseNet, segment twelve even exhibits a lower ODQ value compared to the original data, and Autoreject shows a similar performance on segment four. As shown in Fig. 3, the bars that illustrate data segments are sorted by ODQ values, and segments one to six are rated as D. The proposed method yields the highest improvement in ODQ value through artifact removal when applied to these data segments. This indicates that the proposed method resulted in a higher proportion of good windows in the artifact-rejected data compared to other methods, implying that the proposed method is able to perform better than other methods on poor-quality data segments. Figure 8a is a scatter plot, with the y-axis showing the ODQ values of the proposed method and the x-axis showing the ODQ values of the other methods. Evidently, the ODQ values of segments after artifact rejection by the proposed method are higher than those of the other methods, as most data points are above the dashed line. On the contrary, the scatter plot illustrates that the proposed method is capable of improving the data quality to some extent, as there are few data points located in the bottom left corner.

Fig. 7
figure 7

Comparisons of ODQ values and number of epochs between the original data and data after rejection by the proposed method and other methods. a ODQ values of the original data and data after rejection by the proposed method and other methods. b Number of epochs of the original data and data after rejection by the proposed method and other methods

Fig. 8
figure 8

Scatter plots for comparing the performances of the proposed method and other methods in terms of the ODQ value and the number of retained epochs after artifact rejection. a Scatter plot for comparison between the proposed method and other methods in terms of ODQ value. b Scatter plot for comparison between the proposed method and other methods in terms of number of retained epochs

Subsequently, the number of epochs left by the proposed method was compared to those left by other methods, and the results are depicted in Fig. 7b. The figure suggests that the proposed method tends to reject more epochs than other methods on poor-quality data segments, thus indicating that it is more proficient in detecting artifact-corrupted epochs. Notably, the two methods based on supervised learning refused to drop any epochs on segments 4, 7, and 10, leading to the failure of improving the ODQ value. From the perspective of ML/DL, the quality of data is more important than the quantity. Figure 8b shows the scatter plot of the number of retained epochs after rejection by the proposed method (y-axis) and other methods (x-axis). More data points are below the dashed line in Fig. 8b, which demonstrates that the data segments have fewer epochs left after rejection by the proposed method than after rejection by other methods. From Figs. 7 and 8, it can be seen that, when using these methods to reject heavily corrupted epochs, the proposed method results in a higher proportion of good windows in the artifact-rejected data, even when fewer epochs are retained. In other words, the ratio of clean epochs to contaminated epochs in the remaining data segments has been increased.

In addition to the ODQ value, two other indicators used to assess data quality in [50] are used to compare the effects of different methods, namely zero-crossing-rate (ZCR) and root-mean-square (RMS). The ZCR is the rate at which the signal changes signs from positive to negative, and is represented as:

$$\mathrm{ZCR}=\frac{1}{N}\sum_{n=1}^{N}\left|\mathrm{sign}\left[x\left(n\right)\right]-\mathrm{sign}[x(n-1)]\right|$$
(17)

ZCR is a good indicator of high-frequency artifacts since it is capable of detecting frequency at which the majority of energy is concentrated in the signal spectrum. The RMS is a useful measure of the magnitude of EEG signal and is calculated by:

$$\mathrm{RMS}=\sqrt{\frac{1}{N}\sum_{n=1}^{N}{\left|x(n)\right|}^{2}}$$
(18)

Theoretically, EEG recordings with high-frequency artifacts and high-amplitude artifacts typically have higher ZCR and RMS than EEG recordings within the normal range. Since data processed by different methods leave different data lengths, ZCR and RMS are normalized by dividing by the number of epochs. Therefore, the ZCR and RMS of EEG recordings after being processed by different methods are shown in Table 4.

Table 4 Comparison of ZCR and RMS in the original data and data being processed by methods

From Table 4, it is observed that the data processed by the proposed methods exhibit lower ZCR values compared to the data processed by the other methods. Furthermore, since the proposed method is capable of detecting outliers in the PTP feature spaces, it yields a lower RMS value in comparison to the other techniques. To individually compare the performances of the various methods for each data segment, the differences in ZCR and RMS between the original data and the methods listed are illustrated in Fig. 9.

Fig. 9
figure 9

Plots for comparing the performances of the proposed method and the other methods in terms of the differenced ZCR value and the differenced RMS value. a Differenced ZCR value. b Differenced RMS value

As depicted in Fig. 9a, for over 50% of the data segments, the quality improvement of the data processed by the proposed method achieves the highest performance. It is clear that the differences between the ZCR values of the original data and those of the processed data are the most pronounced. In contrast, Fig. 9b reveals that even when other methods fail to enhance the RMS value of the data, the proposed method can still effectively reduce it. The efficacy of the proposed method is exemplified by the results in segment 4. Here, the proposed method effectively lowers both the ZCR and RMS values of the data, whereas Autoreject increases the RMS value. Nevertheless, all methods elevate the RMS value for segment 10, though the proposed method results in the smallest increase.

Furthermore, the relationships between running time and the length of the data segment of the aforementioned rejection methods were compared. To this end, data segments with lengths equal to 1 h, 3 h, and 5 h were extracted from the patients and used for comparison. The five methods were executed on these new data segments, and their corresponding running times were recorded. As a result, Fig. 10 shows the logarithmic average running time across segments versus the corresponding epoch length with points in square. The straight lines through these points represent the linear fit curve between the running time and the corresponding length of the data segment. The parameters after fitting are presented in Table 5. As demonstrated in Fig. 10 and Table 5, the linear fit result of the proposed method has a lower intercept than that of other methods, implying that the running time of the mentioned methods is comparable when the length of the data segment is short. Moreover, although the slopes of the two supervised learning methods are slightly higher than the proposed method, the time cost for pre-training the two methods should also be taken into consideration, implying a potential higher time complexity.

Fig. 10
figure 10

Relationships between running time and length of the data segment of the proposed method and other methods

Table 5 Results of linear fitting between the proposed method and other methods

4 Discussion

In this study, a novel automatic rejection method for enhancing the data quality of clinical Long-Term EEG recordings at the very initial stage of preprocessing is proposed and the performance of the proposed method is evaluated by comparison with four state-of-the-art methods. This section discusses the results depicted in the previous section, and the distinctions and similarities between the proposed method and other methods are elucidated in certain contexts.

In the previous section, it was initially demonstrated that using the minimum value of the data as the centroid to adjust the boundary is more effective than other statistical indicators. The rationale behind this phenomenon is that the boundary adjustment technique based on the minimum value can consistently target the minimum value in the data and increase the iteration times to discover more outliers. It has been observed that while using median, mean, maximum, kurtosis, and skewness can be beneficial in improving the quality of the data, they also have the potential to terminate the iteration process prematurely, thus preventing from achieving the optimum result. In theory, using skewness and kurtosis would yield better results compared to simple statistical metrics, as they consider the overall shape of the distribution. However, their performance is inferior when compared to the Min Method. This can be attributed to the relatively small variation in skewness and kurtosis across the entire data, making it easy for them to remain unchanged after removing a portion of data segments, thereby ending the removal process. Nevertheless, using the minimum value as the centroid will result in fewer epochs compared to other methods, since using the minimum value as the centroid requires more iterations. Therefore, when aiming to keep fewer epochs than the other five methods, those epochs with a high level of data quality will be preserved.

The results from the previous sections demonstrate that the proposed method exhibits certain advantages over the state-of-the-art methods in terms of reliability. From a principle-based perspective, Autoreject employs a cross-validation framework and utilizes the Frobenius norm of the mean value of good trials in the training set, as well as the median value in the validation set. Therefore, it assumes that the mean and median values of the PTP of the dataset are sufficient to discriminate artifacts from the clean data. However, as stated in Section 2, the proportion of time for a patient to stay in a resting-state or perform activities is unknown, and there may be instances where the artifact-corrupted data exceeds the clean data. Mean and median values used in Autoreject may not be able to effectively differentiate between artifact-contaminated epochs and clean epochs since they potentially assume that epochs with a higher PTP value than mean and median are likely to contain artifacts, and epochs with a lower PTP value are likely to be clean. However, this assumption may not always hold true in clinical settings. The reason for the superiority of the proposed method over Autoreject lies in its divergence from the approach employed by Autoreject. Unlike Autoreject, the proposed method utilizes the minimum value of retained epochs as a criterion to distinguish between clean epochs and contaminated epochs during the iterative process. Although this may result in a reduction in the number of retained epochs, the minimum value of the retained epochs consistently ensures an improvement in data quality. For the comparison between the proposed method and other methods, it is imperative to acknowledge the significant contributions that DL-based methods have made to the preprocessing stage of EEG. In terms of feature extraction, AUTO applies manipulation to multiple features and metrics extracted from the EEG, compressing them into a more compact space. Subsequently, a decision boundary is curved to separate the outliers from the inliers. Compared to the proposed method, the major drawback of AUTO lies in its assumption of a fixed proportion of outliers in the data. Although this proportion can be treated as a hyperparameter of the model, the proposed method, on the other hand, continuously discovers contaminated epochs in the data through an iterative process. EEGdenoiseNet and Interpretable CNN employ distinct convolution kernels to extract features from the original EEG signals and classify clean EEG signals from artifacts based on these extracted features, and they necessitate pre-training on the dataset prior to being transferred to other scenarios. Nonetheless, the training source data often lacks diversity, leading to suboptimal transfer performance. The proposed approach, in contrast, can be regarded as an unsupervised method that achieves satisfactory results without the need for pre-training.

On the other hand, the proposed method needs the shortest execution time compared to other methods. From a theoretical standpoint, the proposed method constructs isolation trees by partitioning the feature space to form sub-trees, and the feature space shrinks with each iteration step. Contrarily, Autoreject requires the adoption of Bayesian optimization to select the optimal threshold from a set of candidate thresholds. The optimization-based design often yields satisfactory results within an acceptable runtime for datasets of small sizes. However, for larger datasets, the time required to find the global optimum solution is considerably longer compared to directly partitioning the feature space. For the DL-based methods, EEGdenoiseNet and Interpretable CNN, they exhibit short execution times on large-scale datasets with the support of GPUs. However, the time required for pre-training them should also be taken into consideration. Although their pre-training time in this study is in the order of thousands of seconds, this time may increase exponentially as the size of the source dataset increases. The low runtime efficiency of AUTO can be attributed to its extraction of numerous features from the EEG as inputs to the network, which consumes more time compared to using only PTP.

However, this study still has some limitations. Firstly, a limited number of patients were used to validate the proposed method in this study. Due to the specificity of EEG, it is reasonable to consider a larger number of participants for validation. Secondly, this paper only addresses the removal of contaminated data without considering the use of methods for data restoration, which serves as a potential direction for future research.

5 Conclusion

In summary, a novel reliable and fast method for automatic rejection of clinical EEG recordings is proposed in this paper. The results illustrated in Section 3 indicate that the proposed method with min value as the centroid greatly improved the data quality, which implies that the proposed method is reliable in the automatic artifact rejection of Long-Term EEG recordings. Furthermore, the proposed method is also compared with the current state-of-the-art methods for preprocessing clinical EEG data. The comparison results suggest that the proposed method is competitive in most circumstances and it performs better than other methods especially when data quality is poor. Meanwhile, the comparison of running time indicates that the proposed method has a lower time complexity and is much faster than other methods. This is the consequence of the repeated application of an advanced data-driven outlier detection algorithm, accompanied by the establishment of an appropriate centroid, which ultimately led to the fulfillment of the termination condition. By building a tool to help researchers clean up data automatically, researchers can reduce the time required to inspect data, thus allowing them to focus on scientific research instead of parameter tuning for preprocessing.