Novel class detection in data streams using local patterns and neighborhood graph
Introduction
The purpose of data stream classification is to determine which category an observation belongs to. These observations are part of an infinite length data stream. Infinite lengths, high speed, limitation of response time and concept drift are challenges that we face in classifying data streams. There are many researches that address the aforementioned challenges [1], [2], [3], [4], [5], [6], [7], [8], [9]; however, most of them ignore another major challenge “concept evolution” which led to the emergence of novel classes.
In the most real world data streams emergence of novel classes is an inevitable phenomenon. For example, a new kind of intrusion may appear in network traffic, or a new category of stones may be discovered by a Mars rover. Therefore, instances from all classes are not available at the start of the stream in order to train a learner. Also, the exact number of classes is unknown at first. In such a case, the goal of the learner is to accurately classify the instances that belong to existing classes and simultaneously detect emergence of novel classes. In this case, an existing class is defined as a class where at least one of its instances has been observed from the start of the stream. The general workflow and conditions for novel class detection in data streams are given below.
In the initial phase of training, learner uses the first M instances of the stream as the training set to build an initial classifier model (M is usually a small number). After building the initial classifier, for each new arrival instance, the learner has to determine the instance that belongs to either one of the existing classes (and specifically, which one) or a novel class. In order to detect the emergence of a novel class, the learner should see a group of newly arrived instances, not just one. Therefore, classification can be postponed until enough instances are seen by the learner to gain confidence in deciding whether the instance belongs to a novel class or not. However, there is a maximum allowable time up to which the learner can postpone classification of each instance. In supervised methods the learner receives true label of each instance within a time limit (or immediately) after the instance is classified. The learner updates its model periodically with respect to the observed instances.
There are two conditions that must be verified to declare the emergence of a novel class: cohesion–separation condition and threshold condition. The former condition is based on the cluster assumption and implies that the novel class instances must be more similar to each other than being similar to instances of other classes. The latter condition implies that the number of the candidate instances for a novel class must be more than a given threshold q. The threshold is being used to distinguish between outliers and novel class instances; when the number of candidates is less than the given threshold, we assume that they are outliers of existing classes. The appropriate value of threshold is determined by experts and is based on application.
Due to aforementioned conditions, a leaner has to check a group of instances in order to detect emergence of a novel class. Therefore, existing methods are either chunk-based [10], [11], [12] or based on time constraints [13], [14]. Also, these methods can be divided into two learning categories: supervised [13], [10], [11] and unsupervised [15], [14]. In the next section, we briefly discuss these categories.
A new supervised chunk-based approach for joint novel class detection and classification problem is proposed in this paper. The proposed method utilizes local patterns, which are based on the impact of some categorical features on the range of values for ordinal and continuous features. In addition, our method constructs a graph using novel class candidates and analyze its connected components. Using graphs help to check cohesion and separation more accurately. Like almost all existing approaches, we use ensemble learners however, we define new measures in order to update ensemble that enhances our method׳s precision. We call the proposed algorithm as LOCE (LOcal Classifier Ensemble). We apply LOCE on a number of real and synthetic benchmark data sets, and obtain superior performance over the state-of-the-art methods.
The rest of this paper is organized as follows. The related work discussed in Section 2. The proposed method is given in Section 3. Experimental results are presented in Section 4. This paper concludes with conclusions and future works in Section 5.
Section snippets
Related work
Novelty detection [16], [17], [18], [19] is similar to novel class detection. In novelty detection, observations which are non-similar to existing classes considered as novelty. However in novel class detection there are more conditions to be verified. First of all, similarity between novel class candidates have to be more than the similarity between them and the existing classes. Second, the number of candidates has to be more than a given threshold.
Existing approaches for novel class
The proposed method
In the related work section we discussed existing methods. As mentioned in detail, there are two major drawbacks in the existing methods. First, sudden or rapid concept drift causes temporary decrease in the efficiency of existing methods. Second, existing methods sometimes are unable to verify cohesion–separation condition accurately. In this section we propose a new ensemble method called LOCE to overcome the drawbacks of existing methods and also enhance the accuracy of classification and
Experiments
We implemented LOCE in Java and the code for K-means obtained from Weka machine learning open source repository.2 In order to evaluate LOCE, we tested it using several benchmark data sets. We also compare our method with CLAM and ExMiner which are the most well-known methods for novel class detection in data streams. We design a set of experiments to address the following questions:
- •
How do the different values of parameters affect the efficiency of the
Conclusion and future works
We have proposed a new method for joint classification and novel class detection in data streams. Proposed method utilizes local patterns to heighten the accuracy of both distinguishing between novel and existing classes׳ instances, as well as classifying existing classes׳ instances. A class in the chunk which appeared for the first time is considered as novel class, and in the subsequent chunks is considered as existing class. In order to address the slow adaptation to concept-drift issue in
Acknowledgment
The authors would like to thank the anonymous reviewers for their helpful comments that improved the paper.
Poorya ZareMoodi was born in Mashhad, Iran, in 1989. He received the B.Sc. degree in Computer Engineering from the Ferdowsi University of Mashhad, Iran, in 2011. In 2013, he completed the M.Sc. degree in Computer Engineering at the Sharif University of Technology, Iran. His research interests are in the areas of data stream mining, big data, pattern recognition and machine learning.
References (25)
- et al.
Learning from concept drifting data streams with unlabeled data
Neurocomputing
(2012) - et al.
A similarity-based approach for data stream classification
Expert Syst. Appl.
(2014) - et al.
An experimental evaluation of novelty detection methods
Neurocomputing
(2014) Kernel PCA for novelty detection
Pattern Recognit.
(2007)- et al.
A novelty detection machine and its application to bank failure prediction
Neurocomputing
(2014) - et al.
An adaptive ensemble classifier for mining concept drifting data streams
Exp. Syst. Appl.
(2013) - Z. Ahmadi, H. Beigy, Semi-supervised ensemble learning of data streams in the presence of concept drift, in:...
- et al.
Using a classifier pool in accuracy based tracking of recurring concepts in data stream classification
Evol. Syst.
(2013) - P. Sobhani, H. Beigy, New drift detection method for data streams, in: Proceedings of the 2nd International Conference...
- A. Gholipour, M.J. Hosseini, H. Beigy, An adaptive regression tree for non-stationary data streams, in: Proceedings of...
Cited by (0)
Poorya ZareMoodi was born in Mashhad, Iran, in 1989. He received the B.Sc. degree in Computer Engineering from the Ferdowsi University of Mashhad, Iran, in 2011. In 2013, he completed the M.Sc. degree in Computer Engineering at the Sharif University of Technology, Iran. His research interests are in the areas of data stream mining, big data, pattern recognition and machine learning.
Hamid Beigy received the B.Sc. and M.Sc. degrees in Computer Engineering from the Shiraz University in Iran, in 1992 and 1995, respectively. He also received the Ph.D. degree in Computer Engineering from the Amirkabir University of Technology, Iran, in 2004. Currently, he is an Associate Professor in Department of Computer Engineering at the Sharif University of Technology, Tehran, Iran. His research interests include learning systems and high performance computing.
Sajjad Kamali Siahroudi was born in Tehran, Iran, in 1986. He received the B.Sc. degree in Computer Engineering from the Buali Sina University, Iran, in 2010. In 2012, he completed the M.Sc. degree in Computer Engineering at the Sharif University of Technology, Iran. His research interests are in the areas of data stream mining, multi-label classification, pattern recognition and machine learning.