An online Bayesian approach to change-point detection for categorical data
Introduction
Change-point detection (CPD) is becoming increasingly popular in several fields including quality control [1], biomedical research [2], economic [3], [4], text mining [5], signal segmentation [6] and business analysis [7]. Change-points are referred to abrupt changes of the underlying distribution in data over time, which represent transitions that occur between states. Anomaly detection [8] and concept drift [9] are similar concepts, which are used in literature as well as CPD. There are two aims in CPD that are generally achieved simultaneously, detecting whether there exists a change-point and estimating the moment of occurrence if it exists.
Categorical data is widespread in practice such as health-care-related environment [10], recommendation system [11], text [12], [13] and image [14], [15]. Some nonparametric and distribution-free methods can be applied to CPD for categorical data, while they cannot utilize categorical information sufficiently. Existing methods for CPD in categorical data usually have limited performance when there exists “rare events” (events that occur with low frequency). Taking the classical method, Pearson’s chi-squared test, as an example, it requires a minimum number of counts for each category to achieve a convincing result. In this paper, we focus on CPD for categorical data. Specifically, we assume each data point is a high-dimensional vector of counts, where the length of the vector is the number of categories. Furthermore, each coordinate of the vector represents the counts for each category, and the sum is the total number of trials. An online Bayesian approach is then proposed. Different from existing methods, we assume categorical data comes from Dirichlet-multinomial mixtures [12], [14], [16], where the prior Dirichlet distribution is introduced to reduce sensitiveness to “rare events”. Under this assumption, we formulate the hypothesis testing problem for CPD and define the Bayes factor as the test statistic.
In modern big data era, data often comes in the form of streams. Large volumes of streaming data are infeasible to be accommodated in the machine’s main memory. Hence, an online procedure of CPD is desired. To deal with data streams where data arrives continuously, we apply an incremental expectation–maximization (EM) to estimate parameters in our Bayesian model and design an online strategy for CPD. Power analyses and Monte Carlo simulations are presented to show the effectiveness of our method. Moreover, the proposed model is compared with some competitive competitors to show great improvements. Lastly, we apply it to various scenarios in real world including biomedical research, document analysis, health news case study and location monitoring.
The remainder of this paper is organized as follows. In Section 2, we review related work of CPD. In Section 3, a Bayesian approach for CPD in categorical data is presented. In Section 4, an online estimation procedure and an online detecting strategy are designed for data streams. In Sections 5 Simulations, 6 Applications, we do some simulations and applications. Section 7 concludes.
Section snippets
Change-point detection
Existing methods of CPD can be divided into two classes, supervised and unsupervised. The supervised CPD can be regarded as a simple binary classification problem. The classification methods in machine learning are widely used such as nearest neighbor [17], decision trees [18], and Naive Bayes [19].
Compared with supervised methods, more commonly considered in CPD is unsupervised learning, that is, training samples are unlabeled. One of the popular ideas is to transform CPD into a hypothesis
Change-point detection problem
Denote the data stream as , where , are independent high-dimensional variables with length . Our goal is to detect the occurrence of a change-point in . Before proceeding, we first give the definition of a change-point, which is same as that in most literature [9], [11], [13], [32], [37].
Definition 1 Change-point A change-point represents a transition between different distributions that generate the data over time.
Denote the subsequence of starting at time with length as
Online parameter estimation
Estimating the value of parameter and in the Bayes factor (5) is a tricky problem. The optimization formula can be expressed as follows, The difficulty in solving the optimization problem (9)–(11) is the existence of latent variables . The expectation–maximization (EM) algorithm, widely used in missing data, censored observations and mixture distributions [45], [46], is considered for our model. The EM
Simulations
In this section, we first use Monte Carlo simulations to analyze the power of our proposed Bayes factor and then compare with other existing detection methods. In the following, we set as the threshold value according to Table 1.
Applications
In this section, we apply our model to four datasets in the fields of biomedicine research, document analysis, health news case study and location monitoring, where the first two have been considered extensively in literature [36], [37], [38], [50], the third one is public on the University of California at Irvine and the fourth one is collected by authors. In the following, we set .
Conclusions
In this paper, we propose an online Bayesian approach to change-point detection for categorical data. Firstly, we formulate the change-point detection as a hypothesis testing problem. Secondly, we introduce the Dirichlet distribution as prior information and design the test statistic, Bayes factor. Thirdly, an online parameter estimation procedure and an online detection strategy are conducted to adapt to data streams. The proposed method is robust when some “rare events” exist. Simulations and
CRediT authorship contribution statement
Yiwei Fan: Methodology, Formal analysis, Software, Writing - original draft. Xiaoling Lu: Conceptualization, Methodology, Data curation, Supervision, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The research was supported by the Ministry of Education Focus on Humanities and Social Science Research Base (Major Research Plan 17JJD910001) (China) and the fund for building world-class universities (disciplines) of Renmin University of China .
References (55)
- et al.
Detecting and predicting the topic change of knowledge-based systems: A topic-based bibliometric analysis from 1991 to 2016
Knowl.-Based Syst.
(2017) - et al.
Change-point detection in time-series data by relative density-ratio estimation
Neural Netw.
(2013) - et al.
Change-point detection in multinomial data using phi-divergence test statistics
J. Multivariate Anal.
(2013) - et al.
Bayesian image segmentation fusion
Knowl.-Based Syst.
(2014) - et al.
Evolution properties of online user preference diversity
Physica A
(2017) - et al.
Quality monitoring based on dynamic resistance and principal component analysis in small scale resistance spot welding process
Int. J. Adv. Manuf. Technol.
(2016) - et al.
Inference for multiple change points in time series via likelihood ratio scan statistics
J. R. Stat. Soc. Ser. B Stat. Methodol.
(2016) Dynamic detection of change points in long time series
Ann. Inst. Statist. Math.
(2007)- et al.
Continuous monitoring for changepoints in data streams using adaptive estimation
Stat. Comput.
(2017) - et al.
ST segment change detection by means of wavelets
Evaluating google, twitter, and wikipedia as tools for influenza surveillance using Bayesian change point analysis: a comparative analysis
JMIR Public Health Surv.
Anomaly detection: A survey
ACM Comput. Surv.
A survey on concept drift adaptation
ACM Comput. Surv.
Review of multinomial and multiattribute quality control charts
Qual. Reliab. Eng. Int.
Change-point detection in multinomial data with a large number of categories
Ann. Statist.
Latent Dirichlet allocation
Journal Machine Learning Research
Clustering of count data using generalized Dirichlet multinomial distributions
IEEE Trans. Knowl. Data Eng.
General models for resource use or other compositional count data using the dirichlet multinomial distribution
Ecology
Understanding transportation modes based on GPS data for web applications
ACM Trans. Web
Using mobile phones to determine transportation modes
ACM Trans. Sensor Netw.
Sequential tests of statistical hypotheses
Ann. Math. Stat.
Sequential change detection on data streams
Adaptive concept drift detection
Stat. Anal. Data Min.
EVE: a framework for event detection
Evol. Syst.
Learning recurring concepts from data streams with a context-aware ensemble
Cited by (8)
A comparison of online methods for change point detection in ion-mobility spectrometry data
2022, ArrayCitation Excerpt :For more details the reader is referred to [1]. Recent studies on online change point detection indicate that the likelihood and probabilistic approaches are the most attractive methods [9–11]. For example, in [10] the Bayesian online change point algorithm was adapted for detecting a behavioral change in daily water consumption time series.
Online mixture-based clustering for high dimensional count data using Neerchal–Morel distribution
2021, Knowledge-Based SystemsCitation Excerpt :The online learning shows great potential and brings tremendous advantages in many applications, such as: video surveillance and security systems, action-based human–computer interaction or intelligent robots for human behavior characterization, where immediate decision-making carries a crucial role. However, alongside to the enormous volume, high dimensionality and heterogeneity nature of datasets [44,45] in the mentioned domain, several other issues arise, such as: occlusion, background clutter, changes in scale, viewpoint, lighting conditions, shadows, appearance, frame resolution, etc. Therefore, given the increased difficulty to efficiently represent and model the data, we have chosen KTH [46] and Ballet [47] datasets, to test the performance of our models on recognizing multiple high level activities from video sequences composing several actors performing different movements.
Bayesian-based water leakage detection with a novel multisensor fusion method in a deep manned submersible
2021, Applied Ocean ResearchCitation Excerpt :Therefore, an unsupervised, online and robust fault detection method should be considered. Recently, change point detection methods have been widely researched (Cano et al., 2019; Fan and Lu, 2020), and an unsupervised Bayesian-based method is introduced for online change point detection (Adams and Mackay, 2007). Considering the real operation environment, the Bayesian-based online change point detection method can be adopted for water leakage detection in the underwater acoustic communication machine of the deep manned submersible.
Quality Change: Norm or Exception? Measurement, Analysis and Detection of Quality Change in Wikipedia
2022, Proceedings of the ACM on Human-Computer Interaction