Interval clustering algorithm for fast event detection in stream monitoring applications
Introduction
Recently, applications that monitor streams of data items such as sensor readings, network measurements, auction bids, stock exchanges, web page visits, etc., have attacted considerable interest. In stream monitoring applications, it is important to identify abnormal events over bursty data arrivals in a timely manner (Babcock et al., 2002, Golab and Ozsu, 2003). For example, a patients’ body status can be monitored using biosensors in a medical center. In this application, abnormal events (e.g., life-threatening events) should be detected on time and be notified to the medical staff immediately. The delayed detection of critical events may not be acceptable in this case. Similar examples can be found in network intrusion detection, plant monitoring, and so on.
In many applications, the conditions used for event detection can be represented as conjunctions of intervals. For example, BodyMedia’s Armband (Teller, 2004) predefines the possible body status, such as sleeping, exercising, reading, etc. The application then uses a range of biosensor data to identify the user’s body status. A body status can be represented as a conjunction of intervals, each of which describes a range of normal sensing values. For example, the status “sleep” can be represented as interval conjunction of body temperature [36.0, 37.0] (C), heart bit rate [80, 120], gait [0, 20], and others. An event that does not belong to any given status can be considered abnormal.
The number of conditions can become large in real-world applications. In this case, to meet the real-time constraint over bursty data arrivals, similar conditions can be clustered to reduce the number of comparisons. Many existing algorithms can be used for this purpose. Lingras et al. employed the rough set theory based on the K-means clustering algorithm to reflect unknown overlapping sets in clustering (Lingras and West, 2004). The same authors also proposed the use of fuzzy clustering as an alternative (Lingras and Yan, 2004). Asharaf et al. used the rough set theory in clustering based on the leader clustering algorithm (Asharaf et al., 2006). Regarding the dissimilarity measure, Souza and Carvalho (2004) introduced an adaptive method based on the city-block distance, where the dissimilarity between two intervals is measured adaptively by different weights. Chavent and Lechevallier (2002) used the Hausdorff distance to compare interval data.
One the other hand, none of these algorithms provides a method to control the amount of classification errors resulting from clustering. If event detection is performed based on the clustered intervals, the number of detection errors should be kept to within a tolerable bound. As an example, the medical center would limit the percentage of errors to less than 1% to avoid detection results with noise.
This paper proposes an interval clustering algorithm that provides an error control mechanism. The proposed algorithm enables a user to specify a permissible error bound which can be defined as a percentage of false positive errors that can occur during event detection. Given an error bound, the algorithm uses it as a threshold condition for clustering. In particular, when clustering a condition, the algorithm estimates the expected ratio of false positive errors based on the distribution of input events. The new condition can then be clustered only when its expected ratio is smaller than the given error bound.
The remaining part of this paper is organized as follows. Section 2 introduces the preliminary concepts relevant to interval clustering. Section 3 describes an error estimation measure and the interval clustering algorithm based on it. Section 4 provides the experimental results of the proposed algorithm. Section 5 concludes the discussion.
Section snippets
Preliminaries
Let X be a set of n conditions such that . A condition, , is represented as a vector of m intervals such that , where . A partition, of X into k equivalent classes (), can then be found. Fig. 1 gives an example of clustering n conditions into k classes; the conditions in a class (or a cluster) need not to be consecutive, as shown in the figure.
A cluster, , is also represented as a vector. The reference
Proposed algorithm
The proposed algorithm can be viewed as an extension of the leader clustering algorithm, where an error control mechanism is augmented to it. To support error control, the algorithm estimates the amount of error incurred by clustering intervals. As a measure of the estimation error, the percentage of false positive errors was used, whose estimation is discussed in the first subsection. A clustering algorithm, which uses the estimation measure for error control, is then presented. To simplify
Experimental results
To test the proposed algorithm, experiments were conducted to see if the algorithm observes a given error bound when clustering conditions over non-uniformly distributed data. For the experiments, PhysioBank (PhysioBank) was used, which is a large archive of well-characterized digital recordings of physiological signals and related annotations for use by the biomedical research communities. Among the many databases in PhysioBank, the UCD (University College Dublin) sleep apnea database and
Conclusion
This paper proposed an interval clustering algorithm that observes a user-specified error bound. The core of the algorithm is an error estimation measure. The measure can be used to estimate the percentage of false positive errors over any type of input data whose distribution is unknown. In the proposed method, it is also used to measure the distance between intervals. Based on the measured distance, the algorithm identifies the closest cluster for a given condition. The new condition can then
References (12)
- et al.
Rough set based incremental clustering of interval data
Pattern Recognition Letters
(2006) A platform for wearable physiological computing
Interacting with Computers
(2004)- Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J., 2002. Models and issues in data stream systems. In:...
- et al.
Dynamical clustering of interval data. Optimization of an adequacy criterion based on Hausdorff distance
- et al.
Issues in data stream management
ACM SIGMOD Record
(2003) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals
Circulation
(2000)
Cited by (4)
A diagnostic method based on clustering qualitative event sequences
2016, Computers and Chemical EngineeringCitation Excerpt :Different other approaches are using the fuzzy c-means clustering (FCM, described in Alpaydin, 1998), a method based on the concept of fuzzy sets and logic (described originally in Zadeh, 1975). For example, fuzzy c-means clustering for fault classification is reported in Mercurio et al. (2009) and Petković et al. (2012) while it is used for process control in Kim and Kim (2014). The most widely used quantitative feature extraction procedures use statistical methods (e.g. PCA or PLS) for process monitoring and fault detection, for which good review papers have appeared recently, see Yin et al. (2012), Qin (2012) or MacGregor and Cinar (2012).
Sentiment-based and hashtag-based Chinese online bursty event detection
2018, Multimedia Tools and ApplicationsPP-OMDS: An effective and efficient framework for supporting privacy-preserving OLAP-based monitoring of data streams
2018, ICEIS 2018 - Proceedings of the 20th International Conference on Enterprise Information SystemsChange detection using stream data clustering
2015, International Journal of Applied Engineering Research