Multichannel biomedical time series clustering via hierarchical probabilistic latent semantic analysis
Introduction
With the development of modern recording technology and reduction of hardware cost, more and more biomedical time series such as Electrocardiography (ECG) signals are recorded to monitor human physiological condition. How to effectively and efficiently manage and analyse a large amount of physiological signals is a big challenge. Traditionally, these physiological signals are manually managed and analysed by medical experts. However, manual management and inspection are time-consuming and labour-intensive. Even worse, false hit rates by operators may increase considerably for a long-term inspection and management, since it is difficult for human to keep a high level of concentration for a long time. Therefore, automatic methods that can help medical exporters effectively manage and inspect a large amount of physiological time series are very valuable.
One of the automatic methods for biomedical time series management and inspection is time series clustering [1], [2], [3], [4], [5], which groups a collection of time series with no prior label information according to their internal structural similarity. The time series clustering makes biomedical time series management such as bio-signals archiving and retrieval much easier. For biomedical time series clustering, it is of importance to extract discriminative features to characterize the time series. Some state-of-the-art works extract features from time domain [6], [7], [8] while some others transform the time series into frequency domain [9], [10], [11]. However, most of these methods are limited to extract internal structural similarity information. To this end, Wang et al. [1], [12] and Lin and Li [13] proposed a bag-of-words/patterns representation that was originally developed for text document analysis to effectively capture the structural similarity information of time series. In the bag-of-words/patterns representation [1], [12], [13], time series are treated as documents and local segments are extract from the time series as words. A time series is then represented as a histogram of the number of codewords occurred in the time series.
Based on the bag-of-words/patterns representation, probabilistic topic models such as probabilistic Latent Semantic Analysis (pLSA) [14] and Latent Dirichlet Allocation (LDA) [15] were extended to cluster a set of unlabelled biomedical time series according to their structural similarity [1], [16]. It is demonstrated that the probabilistic topic model is able to naturally model the generative process of the words/patterns in time series, which provides very promising clustering performance [1], [16].
However, the clustering framework proposed in [1], [16] was developed for single-channel time series analysis. In real clinical applications, many biomedical signals are recorded in multiple channels. For instance, ECG signals are always recorded in more than one channel to provide more comprehensive clinical information. In this paper, we extend the bag-of-words representation and the probabilistic topic models for multichannel time series analysis. Similar to the bag-of-words representation in single-channel time series analysis [12], we treat a multichannel time series as a document and extract local segments from each channel of the time series as words. Based on the bag-of-words representation, we extended the topic models to analyse multichannel biomedical time series in an unsupervised manner. Specifically, a hierarchical pLSA (H-pLSA) [17] that was originally developed for visual motion analysis is extended to cluster multichannel biomedical time series.
The hierarchical pLSA developed in [17] models visual motion using a two-layer pLSA. Local motion behaviours are modelled in the first layer, and global motion behaviours are discovered by the global pLSA. In this paper, we extend the hierarchical pLSA to automatically discover categories of multichannel biomedical time series. In the first layer, we model each channel of the time series using a local pLSA model. The local topics extracted from each channel of the time series are then treated as words in the second-layer pLSA, i.e., global pLSA. The categories of the multichannel time series are automatically discovered by the global pLSA.
The main contribution of the paper is 3-fold: (i) the bag-of-words model was extended to represent multichannel time series; (ii) based on the bag-of-words representation, a hierarchical pLSA (H-pLSA) was developed for multichannel time series clustering; (iii) a series of experiments were conducted to investigate the effectiveness and robustness of the H-pLSA for multichannel time series clustering.
The rest of this paper is organized as follows. In Section 2, we introduce how to construct a bag-of-words representation for multichannel time series. The details of the hierarchical pLSA are illustrated in Section 3. The experimental results are given in Section 4. Finally, Section 5 concludes this paper.
Section snippets
Bag-of-words representation
The method in [12] continuously slides a pre-defined length window along a single-channel time series to extract local segments, and constructs a bag-of-words representation for single-channel time series analysis. Similarly, for multichannel time series, we continuously slide a window with pre-defined length along each channel of a time series to extract a group of segments. Each segment is then ℓ2 normalized to be a feature vector, i.e., each of the feature vectors is normalized to be a ell2
Hierarchical pLSA
In single-channel time series clustering, the works in [1], [16] treat a time series as a document, and extract local segments from the time series as words. The probabilistic Latent Semantic Analysis (pLSA) is extended to naturally model the generative process of the local segments (words) in the single-channel time series. The pLSA model introduces a latent topic for each local word in a time series and assumes that the observed local words are conditionally independent of the time series,
Experimental dataset and setup
In order to evaluate the effectiveness of the proposed method, we constructed a 15-channel ECG dataset from the PTB database, which is extensively used for biomedical time series analysis [18]. The PTB database consists of 549 records from 290 subjects, whose age ranges from 17 to 87. Each ECG record contains 15 simultaneously measured signals (i.e., 15 channels) with a sampling rate of 1000 Hz. The records are down-sampled to 500 Hz to reduce computation in the experiment. We randomly selected
Conclusion
This paper extended the probabilistic topic model to cluster multichannel biomedical time series. In particular, The Hierarchical pLSA (H-pLSA) is proposed to learn the topic distribution in multichannel time series based on the bag-of-words representation. In the H-pLSA, we separately model each channel of the time series using a local pLSA in the first layer and treat the topics learned in the first layer as bag-of-words representation of the global pLSA in the second layer. The topics
Conflict of interest
The authors confirm that no known conflicts of interest associated with this manuscript and there is no financial support for this work that could influence its outcome.
Acknowledgement
This work was supported in part by National Natural Science Foundation of China (No. 61271008).
References (26)
- et al.
Biomedical time series clustering based on non-negative sparse coding and probabilistic topic model
Comput. Methods Programs Biomed.
(2013) - et al.
Unsupervised feature relevance analysis applied to improve ECG heartbeat clustering
Comput. Methods Programs Biomed.
(2012) - et al.
ECG beat classification using a cost sensitive classifier
Comput. Methods Programs Biomed.
(2013) - et al.
Detection of heartbeat and respiration from optical interferometric signal by using wavelet transform
Comput. Methods Programs Biomed.
(2013) - et al.
Classifying depression patients and normal subjects using machine learning techniques and nonlinear features from EEG signal
Comput. Methods Programs Biomed.
(2013) - et al.
ECG beat classification using PCA, LDA, ICA and discrete wavelet transform
Biomed. Signal Process. Control
(2013) - et al.
Bag-of-words representation for biomedical time series classification
Biomed. Signal Process. Control
(2013) - et al.
Unsupervised mining of long time series based on latent topic model
Neurocomputing
(2013) - et al.
ECG beat classifier designed by combined neural network model
Pattern Recogn.
(2005) - et al.
Multistage approach for clustering and classification of ECG data
Comput. Methods Programs Biomed.
(2013)
Clustering of time series data – a survey
Pattern Recogn.
A review on time series data mining
Eng. Appl. Artif. Intell.
Emotion recognition based on physiological changes in music listening
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (8)
Unsupervised ECG Analysis: A Review
2023, IEEE Reviews in Biomedical EngineeringIdentification of Morphological Patterns for the Detection of Premature Ventricular Contractions
2022, Proceedings of the International Conference on Information VisualisationAn Artificial Heart System for Testing and Evaluation of Cardiac Pacemakers
2022, Computers, Materials and ContinuaAn intelligent gastric cancer screening method based on convolutional neural network and support vector machine
2021, International Journal of Computers and ApplicationsSemantic image annotation based on robust probabilistic latent semantic analysis
2017, Journal of Information Hiding and Multimedia Signal ProcessingProbabilistic Latent Semantic Analysis for Multichannel Biomedical Signal Clustering
2016, IEEE Signal Processing Letters