Elsevier

Pattern Recognition Letters

Volume 36, 15 January 2014, Pages 171-176
Pattern Recognition Letters

Interval clustering algorithm for fast event detection in stream monitoring applications

https://doi.org/10.1016/j.patrec.2013.09.017Get rights and content

Highlights

Abstract

In stream monitoring applications, it is important to identify rapidly abnormal events over bursty data arrivals. By clustering similar conditions used in event detection, it is possible to reduce the number of comparisons and improve the event detection performance. On the other hand, event detection based on these clustered conditions can produce inaccurate results. Therefore, to use this method for critical applications, such as patient monitoring, the number of event detection errors needs to be kept to within a tolerable level. This paper presents an interval clustering algorithm that provides an error control mechanism. The proposed algorithm enables a user to specify a permissible error bound, and then uses the bound as a threshold condition for clustering. The simulation conducted based on real data showed that the algorithm improves the performance of event detection by clustering conditions while observing a user-specified error bound.

Introduction

Recently, applications that monitor streams of data items such as sensor readings, network measurements, auction bids, stock exchanges, web page visits, etc., have attacted considerable interest. In stream monitoring applications, it is important to identify abnormal events over bursty data arrivals in a timely manner (Babcock et al., 2002, Golab and Ozsu, 2003). For example, a patients’ body status can be monitored using biosensors in a medical center. In this application, abnormal events (e.g., life-threatening events) should be detected on time and be notified to the medical staff immediately. The delayed detection of critical events may not be acceptable in this case. Similar examples can be found in network intrusion detection, plant monitoring, and so on.

In many applications, the conditions used for event detection can be represented as conjunctions of intervals. For example, BodyMedia’s Armband (Teller, 2004) predefines the possible body status, such as sleeping, exercising, reading, etc. The application then uses a range of biosensor data to identify the user’s body status. A body status can be represented as a conjunction of intervals, each of which describes a range of normal sensing values. For example, the status “sleep” can be represented as interval conjunction of body temperature [36.0, 37.0] (°C), heart bit rate [80, 120], gait [0, 20], and others. An event that does not belong to any given status can be considered abnormal.

The number of conditions can become large in real-world applications. In this case, to meet the real-time constraint over bursty data arrivals, similar conditions can be clustered to reduce the number of comparisons. Many existing algorithms can be used for this purpose. Lingras et al. employed the rough set theory based on the K-means clustering algorithm to reflect unknown overlapping sets in clustering (Lingras and West, 2004). The same authors also proposed the use of fuzzy clustering as an alternative (Lingras and Yan, 2004). Asharaf et al. used the rough set theory in clustering based on the leader clustering algorithm (Asharaf et al., 2006). Regarding the dissimilarity measure, Souza and Carvalho (2004) introduced an adaptive method based on the city-block distance, where the dissimilarity between two intervals is measured adaptively by different weights. Chavent and Lechevallier (2002) used the Hausdorff distance to compare interval data.

One the other hand, none of these algorithms provides a method to control the amount of classification errors resulting from clustering. If event detection is performed based on the clustered intervals, the number of detection errors should be kept to within a tolerable bound. As an example, the medical center would limit the percentage of errors to less than 1% to avoid detection results with noise.

This paper proposes an interval clustering algorithm that provides an error control mechanism. The proposed algorithm enables a user to specify a permissible error bound which can be defined as a percentage of false positive errors that can occur during event detection. Given an error bound, the algorithm uses it as a threshold condition for clustering. In particular, when clustering a condition, the algorithm estimates the expected ratio of false positive errors based on the distribution of input events. The new condition can then be clustered only when its expected ratio is smaller than the given error bound.

The remaining part of this paper is organized as follows. Section 2 introduces the preliminary concepts relevant to interval clustering. Section 3 describes an error estimation measure and the interval clustering algorithm based on it. Section 4 provides the experimental results of the proposed algorithm. Section 5 concludes the discussion.

Section snippets

Preliminaries

Let X be a set of n conditions such that X={x1,,xn}. A condition, xj, is represented as a vector of m intervals such that xj=(xj1,xj2,,xjm), where xjt=[ajt,bjt]I={[a,b]:a,bR,ab} (1tm). A partition, P=(C1,C2,,Ck) of X into k equivalent classes (kn), can then be found. Fig. 1 gives an example of clustering n conditions into k classes; the conditions in a class (or a cluster) need not to be consecutive, as shown in the figure.

A cluster, Ci, is also represented as a vector. The reference

Proposed algorithm

The proposed algorithm can be viewed as an extension of the leader clustering algorithm, where an error control mechanism is augmented to it. To support error control, the algorithm estimates the amount of error incurred by clustering intervals. As a measure of the estimation error, the percentage of false positive errors was used, whose estimation is discussed in the first subsection. A clustering algorithm, which uses the estimation measure for error control, is then presented. To simplify

Experimental results

To test the proposed algorithm, experiments were conducted to see if the algorithm observes a given error bound ω when clustering conditions over non-uniformly distributed data. For the experiments, PhysioBank (PhysioBank) was used, which is a large archive of well-characterized digital recordings of physiological signals and related annotations for use by the biomedical research communities. Among the many databases in PhysioBank, the UCD (University College Dublin) sleep apnea database and

Conclusion

This paper proposed an interval clustering algorithm that observes a user-specified error bound. The core of the algorithm is an error estimation measure. The measure can be used to estimate the percentage of false positive errors over any type of input data whose distribution is unknown. In the proposed method, it is also used to measure the distance between intervals. Based on the measured distance, the algorithm identifies the closest cluster for a given condition. The new condition can then

References (12)

  • S. Asharaf et al.

    Rough set based incremental clustering of interval data

    Pattern Recognition Letters

    (2006)
  • A. Teller

    A platform for wearable physiological computing

    Interacting with Computers

    (2004)
  • Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J., 2002. Models and issues in data stream systems. In:...
  • M. Chavent et al.

    Dynamical clustering of interval data. Optimization of an adequacy criterion based on Hausdorff distance

  • L. Golab et al.

    Issues in data stream management

    ACM SIGMOD Record

    (2003)
  • A.L. Goldberg

    PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals

    Circulation

    (2000)
There are more references available in the full text version of this article.

Cited by (4)

  • A diagnostic method based on clustering qualitative event sequences

    2016, Computers and Chemical Engineering
    Citation Excerpt :

    Different other approaches are using the fuzzy c-means clustering (FCM, described in Alpaydin, 1998), a method based on the concept of fuzzy sets and logic (described originally in Zadeh, 1975). For example, fuzzy c-means clustering for fault classification is reported in Mercurio et al. (2009) and Petković et al. (2012) while it is used for process control in Kim and Kim (2014). The most widely used quantitative feature extraction procedures use statistical methods (e.g. PCA or PLS) for process monitoring and fault detection, for which good review papers have appeared recently, see Yin et al. (2012), Qin (2012) or MacGregor and Cinar (2012).

  • PP-OMDS: An effective and efficient framework for supporting privacy-preserving OLAP-based monitoring of data streams

    2018, ICEIS 2018 - Proceedings of the 20th International Conference on Enterprise Information Systems
  • Change detection using stream data clustering

    2015, International Journal of Applied Engineering Research
View full text