Elsevier

Neurocomputing

Volume 158, 22 June 2015, Pages 234-245
Neurocomputing

Novel class detection in data streams using local patterns and neighborhood graph

https://doi.org/10.1016/j.neucom.2015.01.037Get rights and content

Abstract

Data stream classification is one of the most challenging areas in the machine learning. In this paper, we focus on three major challenges namely infinite length, concept-drift and concept-evolution. Infinite length causes the inability to store all instances. Concept-drift is the change in the underlying concept and occurs in almost every data stream. Concept-evolution, in fact, is the arrival of novel classes and is an undeniable phenomenon in most real world data streams. There are lots of researches about data stream classification, but most of them focus on the first two challenges and ignore the last one. In this paper, we propose new method based on ensembles whose classifiers use local patterns to enhance the accuracy. Local pattern is a group of Boolean features which have local influence on ordinal and categorical features. Also, in order to enhance the accuracy of novel class detection we construct a neighborhood graph among novel class candidates and analyze connected components of the constructed graph. Experiments on both real and synthetic benchmark data sets show the superiority of the proposed method over the related state-of-the-art techniques.

Introduction

The purpose of data stream classification is to determine which category an observation belongs to. These observations are part of an infinite length data stream. Infinite lengths, high speed, limitation of response time and concept drift are challenges that we face in classifying data streams. There are many researches that address the aforementioned challenges [1], [2], [3], [4], [5], [6], [7], [8], [9]; however, most of them ignore another major challenge “concept evolution” which led to the emergence of novel classes.

In the most real world data streams emergence of novel classes is an inevitable phenomenon. For example, a new kind of intrusion may appear in network traffic, or a new category of stones may be discovered by a Mars rover. Therefore, instances from all classes are not available at the start of the stream in order to train a learner. Also, the exact number of classes is unknown at first. In such a case, the goal of the learner is to accurately classify the instances that belong to existing classes and simultaneously detect emergence of novel classes. In this case, an existing class is defined as a class where at least one of its instances has been observed from the start of the stream. The general workflow and conditions for novel class detection in data streams are given below.

In the initial phase of training, learner uses the first M instances of the stream as the training set to build an initial classifier model (M is usually a small number). After building the initial classifier, for each new arrival instance, the learner has to determine the instance that belongs to either one of the existing classes (and specifically, which one) or a novel class. In order to detect the emergence of a novel class, the learner should see a group of newly arrived instances, not just one. Therefore, classification can be postponed until enough instances are seen by the learner to gain confidence in deciding whether the instance belongs to a novel class or not. However, there is a maximum allowable time up to which the learner can postpone classification of each instance. In supervised methods the learner receives true label of each instance within a time limit (or immediately) after the instance is classified. The learner updates its model periodically with respect to the observed instances.

There are two conditions that must be verified to declare the emergence of a novel class: cohesion–separation condition and threshold condition. The former condition is based on the cluster assumption and implies that the novel class instances must be more similar to each other than being similar to instances of other classes. The latter condition implies that the number of the candidate instances for a novel class must be more than a given threshold q. The threshold is being used to distinguish between outliers and novel class instances; when the number of candidates is less than the given threshold, we assume that they are outliers of existing classes. The appropriate value of threshold is determined by experts and is based on application.

Due to aforementioned conditions, a leaner has to check a group of instances in order to detect emergence of a novel class. Therefore, existing methods are either chunk-based [10], [11], [12] or based on time constraints [13], [14]. Also, these methods can be divided into two learning categories: supervised [13], [10], [11] and unsupervised [15], [14]. In the next section, we briefly discuss these categories.

A new supervised chunk-based approach for joint novel class detection and classification problem is proposed in this paper. The proposed method utilizes local patterns, which are based on the impact of some categorical features on the range of values for ordinal and continuous features. In addition, our method constructs a graph using novel class candidates and analyze its connected components. Using graphs help to check cohesion and separation more accurately. Like almost all existing approaches, we use ensemble learners however, we define new measures in order to update ensemble that enhances our method׳s precision. We call the proposed algorithm as LOCE (LOcal Classifier Ensemble). We apply LOCE on a number of real and synthetic benchmark data sets, and obtain superior performance over the state-of-the-art methods.

The rest of this paper is organized as follows. The related work discussed in Section 2. The proposed method is given in Section 3. Experimental results are presented in Section 4. This paper concludes with conclusions and future works in Section 5.

Section snippets

Related work

Novelty detection [16], [17], [18], [19] is similar to novel class detection. In novelty detection, observations which are non-similar to existing classes considered as novelty. However in novel class detection there are more conditions to be verified. First of all, similarity between novel class candidates have to be more than the similarity between them and the existing classes. Second, the number of candidates has to be more than a given threshold.

Existing approaches for novel class

The proposed method

In the related work section we discussed existing methods. As mentioned in detail, there are two major drawbacks in the existing methods. First, sudden or rapid concept drift causes temporary decrease in the efficiency of existing methods. Second, existing methods sometimes are unable to verify cohesion–separation condition accurately. In this section we propose a new ensemble method called LOCE to overcome the drawbacks of existing methods and also enhance the accuracy of classification and

Experiments

We implemented LOCE in Java and the code for K-means obtained from Weka machine learning open source repository.2 In order to evaluate LOCE, we tested it using several benchmark data sets. We also compare our method with CLAM and ExMiner which are the most well-known methods for novel class detection in data streams. We design a set of experiments to address the following questions:

  • How do the different values of parameters affect the efficiency of the

Conclusion and future works

We have proposed a new method for joint classification and novel class detection in data streams. Proposed method utilizes local patterns to heighten the accuracy of both distinguishing between novel and existing classes׳ instances, as well as classifying existing classes׳ instances. A class in the chunk which appeared for the first time is considered as novel class, and in the subsequent chunks is considered as existing class. In order to address the slow adaptation to concept-drift issue in

Acknowledgment

The authors would like to thank the anonymous reviewers for their helpful comments that improved the paper.

Poorya ZareMoodi was born in Mashhad, Iran, in 1989. He received the B.Sc. degree in Computer Engineering from the Ferdowsi University of Mashhad, Iran, in 2011. In 2013, he completed the M.Sc. degree in Computer Engineering at the Sharif University of Technology, Iran. His research interests are in the areas of data stream mining, big data, pattern recognition and machine learning.

References (25)

  • A. Ghazikhani, R. Monsefi, H. Sadoghi Yazdi, Ensemble of online neural networks for non-stationary and imbalanced data...
  • M.J. Hosseini, Z. Ahmadi, H. Beigy, New management operations on classifiers pool to track recurring concepts, in:...
  • Cited by (0)

    Poorya ZareMoodi was born in Mashhad, Iran, in 1989. He received the B.Sc. degree in Computer Engineering from the Ferdowsi University of Mashhad, Iran, in 2011. In 2013, he completed the M.Sc. degree in Computer Engineering at the Sharif University of Technology, Iran. His research interests are in the areas of data stream mining, big data, pattern recognition and machine learning.

    Hamid Beigy received the B.Sc. and M.Sc. degrees in Computer Engineering from the Shiraz University in Iran, in 1992 and 1995, respectively. He also received the Ph.D. degree in Computer Engineering from the Amirkabir University of Technology, Iran, in 2004. Currently, he is an Associate Professor in Department of Computer Engineering at the Sharif University of Technology, Tehran, Iran. His research interests include learning systems and high performance computing.

    Sajjad Kamali Siahroudi was born in Tehran, Iran, in 1986. He received the B.Sc. degree in Computer Engineering from the Buali Sina University, Iran, in 2010. In 2012, he completed the M.Sc. degree in Computer Engineering at the Sharif University of Technology, Iran. His research interests are in the areas of data stream mining, multi-label classification, pattern recognition and machine learning.

    View full text