Skip to main content
Log in

A grid density based framework for classifying streaming data in the presence of concept drift

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Mining data streams is the process of extracting information from non-stopping, rapidly flowing data records to provide knowledge that is reliable and timely. Streaming data algorithms need to be one pass and operate under strict limitations of memory and response time. In addition, the classification of streaming data requires learning in an environment where the data characteristics might change constantly. Many of the classification algorithms presented in literature assume a 100 % labeling rate, which is impractical and expensive when data records are rapidly flowing in. In this paper, a new incremental grid density based learning framework, the GC3 framework, is proposed to perform classification of streaming data with concept drift and limited labeling. The proposed framework uses grid density clustering to detect changes in the input data space. It maintains an evolving ensemble of classifiers to learn and adapt to the model changes over time. The framework also uses a uniform grid density sampling mechanism to obtain a uniform subset of samples for better classification performance with a lower labeling rate. The entire framework is designed to be one-pass, incremental and work with limited memory to perform any-time classification on demand. Experimental comparison with state of the art concept drift handling systems demonstrate the GC3 frameworks ability to provide high classification performance, using fewer models in the ensemble and with only 4-6 % of the samples labeled. The results show that the GC3 framework is effective and attractive for use in real world data stream classification applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Bache, K, & Lichman, M (2013). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science, available online at http://archive.ics.uci.edu/ml.

  • Borchani, H, Larrañaga, P, & Bielza, C (2011). Classifying evolving data streams with partially labeled data. Intelligent Data Analysis, 15(5), 655–670.

    Google Scholar 

  • Cao, F, Ester, M, Qian, W, & Zhou, A (2006). Density-based clustering over an evolving data stream with noise. In Proceedings of the 2006 SIAM international conference on data mining (pp. 328–339).

  • Chen, S, & He, H (2011). Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach. Evolving Systems, 2(1), 35–50.

    Article  Google Scholar 

  • Chen, Y, & Tu, L (2007). Density-based clustering for real-time stream data. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 133–142). ACM.

  • Dean, J, & Ghemawat, S (2008). Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.

    Article  Google Scholar 

  • Farid, DM, Zhang, L, Hossain, A, Rahman, CM, Strachan, R, Sexton, G, & Dahal, K (2013). An adaptive ensemble classifier for mining concept drifting data streams. Expert Systems with Applications, 40(15), 5895–5906.

    Article  Google Scholar 

  • Gama, J, Medas, P, Castillo, G, & Rodrigues, P (2004). Learning with drift detection. In: Advances in artificial intelligence–SBIA 2004 (pp. 286–295), Springer.

  • Gama, J, Rodrigues, P P, & Sebastião, R (2009). Evaluating algorithms that learn from data streams. In Proceedings of the 2009 ACM symposium on applied Computing (pp. 1496–1500). ACM.

  • Gao, J, Fan, W, & Han, J (2007a). On appropriate assumptions to mine data streams: Analysis and practice. In Proceedings of the seventh IEEE international conference on data mining (ICDM’07) (pp. 143–152) IEEE.

  • Gao, J, Fan, W, Han, J, & Philip, S Y (2007b). A general framework for mining concept-drifting data streams with skewed distributions. In Proceedings of the 7th int conf on data mining. Philadelphia: SIAM .

  • Gong-De, G, Nan, L, & Li-Fei, C (2012). Classification for concept-drifting data streams with limited amount of labeled data. In International conference on automatic control and artificial intelligence (ACAI 2012) (pp. 638–644). IET.

  • Harries, M, & Wales, NS (1999). Splice-2 comparative evaluation: electricity pricing.

  • Hoens, TR, Polikar, R, & Chawla, NV (2012). Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence, 1(1), 89–101.

    Article  Google Scholar 

  • Hu, H, Kantardzic, M M, & Sethi, TS (2013). Selecting samples for labeling in unbalanced streaming data environments. In 2013 XXIV international symposium on information, communication and automation technologies (ICAT) (pp. 1–7). IEEE.

  • Jackowski, K, & Wozniak, M (2009). Adaptive splitting and selection method of classifier ensemble building. In Hybrid artificial intelligence systems (pp. 525–532). Springer.

  • Kantardzic, M (2011). Data mining: concepts, models, methods, and algorithms. Wiley .

  • Kantardzic, M, Ryu, JW, & Walgampaya, C (2010). Building a new classifier in an ensemble using streaming unlabeled data. In Trends in applied intelligent systems (pp. 77–86). Springer.

  • Katakis, I, Tsoumakas, G, & Vlahavas, IP (2008). An ensemble of classifiers for coping with recurring contexts in data streams. In ECAI (pp. 763–764).

  • Kolter, JZ, & Maloof, MA (2007). Dynamic weighted majority: an ensemble method for drifting concepts. The Journal of Machine Learning Research, 8, 2755–2790.

    MATH  Google Scholar 

  • Kong, X, & Yu, P (2011). An ensemble-based approach to fast classification of multi-label data streams. In 7th international conference on collaborative computing: networking, applications and worksharing (pp. 95–104). IEEE.

  • Kuncheva, L I (2000). Clustering-and-selection model for classifier combination. In 2000 Proceedings fourth international conference on knowledge-based intelligent engineering systems and allied technologies (Vol. 1, pp. 185–188). IEEE.

  • Kuncheva, LI (2004). Classifier ensembles for changing environments. In Multiple classifier systems (pp. 1–15). Springer.

  • Littlestone, N, & Warmuth, MK (1989). The weighted majority algorithm. In 30th annual symposium on foundations of computer science (pp. 256–261). IEEE.

  • Masud, MM, Gao, J, Khan, L, Han, J, & Thuraisingham, B (2008). A practical approach to classify evolving data streams: training with limited amount of labeled data. In Eighth IEEE international conference on data mining (ICDM’08) (pp. 929–934). IEEE.

  • Masud, MM, Al-Khateeb, TM, Khan, L, Aggarwal, C, Gao, J, Han, J, & Thuraisingham, B (2011a). Detecting recurring and novel classes in concept-drifting data streams. In: 2011 IEEE 11th international conference on data mining (ICDM) (pp. 1176–1181). IEEE.

  • Masud, MM, Gao, J, Khan, L, Han, J, & Thuraisingham, B (2011b). Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Transactions on Knowledge and Data Engineering, 23(6), 859–874.

    Article  Google Scholar 

  • Masud, M M, Chen, Q, Khan, L, Aggarwal, C C, Gao, J, Han, J, Srivastava, A, & Oza, N C (2013). Classification and adaptive novel class detection of feature-evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 25(7), 1484–1497.

    Article  Google Scholar 

  • MATLAB (2012). version (R2012a). The MathWorks Inc., Natick, Massachusetts.

  • Qin, X, Zhang, Y, Li, C, & Li, X (2013). Learning from data streams with only positive and unlabeled data. Journal of Intelligent Information Systems, 40(3), 405–430. doi:10.1007/s10844-012-0231-6.

    Article  Google Scholar 

  • Quinlan, JR (1996). Bagging, boosting, and C4.5. In AAAI/IAAI (Vol. 1, pp. 725–730).

  • Richards, G, & Wang, W (2012). What influences the accuracy of decision tree ensembles Journal of Intelligent Information Systems, 39 (3), 627–650. doi:10.1007/s10844-012-0206-7.

    Article  Google Scholar 

  • Rokach, L (2010). Ensemble-based classifiers. Artificial Intelligence Review, 33 (1–2), 1–39.

    Article  Google Scholar 

  • Ryu, J W, Kantardzic, M, & Walgampaya, C (2010). Ensemble classifier based on misclassified streaming data. In Proceedings of the 10th IASTED int. conf on artificial intelligence and applications (pp. 347–354). Austria.

  • Ryu, JW, Kantardzic, MM, & Kim, MW (2012a). Efficiently maintaining the performance of an ensemble classifier in streaming data. In Convergence and hybrid information technology (pp. 533–540). Springer.

  • Ryu, JW, Kantardzic, MM, Kim, MW, & Khil, AR (2012b). An efficient method of building an ensemble of classifiers in streaming data. In Big data analytics (pp. 122–133). Berlin Heidelberg: Springer .

  • Street, WN, & Kim, Y (2001). A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 377–382). ACM.

  • Sun, X, & Jiao, Y C (2009). pGrid: Parallel grid-based data stream clustering with mapreduce. Tech. rep., Oak Ridge National Laboratory.

  • Surowiecki, J (2005). The wisdom of crowds. Random House Digital Inc.

  • Tsoumakas, G, Partalas, I, & Vlahavas, I (2009). An ensemble pruning primer. In Applications of supervised and unsupervised ensemble methods (pp. 1–13). Springer.

  • Tsymbal, A (2004). The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin.

  • Tu, L, & Chen, Y (2009). Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3), 12.

    Article  Google Scholar 

  • Wan, L, Ng, WK, Dang, XH, Yu, PS, & Zhang, K (2009). Density-based clustering of data streams at multiple resolutions. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3), 14.

    Article  Google Scholar 

  • Wang, H, Fan, W, Yu, PS, & Han, J (2003). Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 226–235). ACM.

  • Woolam, C, Masud, MM, & Khan, L (2009). Lacking labels in the stream: classifying evolving stream data with few labels. In: Foundations of intelligent systems (pp. 552–562). Springer.

  • Wozniak, M, Kasprzak, A, & Cal, P (2013). Weighted aging classifier ensemble for the incremental drifted data streams. In Larsen, H, Martin-Bautista, M, Vila, M, Andreasen, T, & Christiansen, H (Eds.) Flexible query answering systems, lecture notes in computer science, (Vol. 8132 pp. 579–588). Berlin Heidelberg: Springer. doi:10.1007/978-3-642-40769-7_50.

    Chapter  Google Scholar 

  • Zhang, C, & Ma, Y (2012). Ensemble machine learning: methods and applications. Springer.

  • Zhao, Y, Cao, J, Zhang, C, & Zhang, S (2011). Enhancing grid-density based clustering for high dimensional data. Journal of Systems and Software, 84(9), 1524–1539.

    Article  Google Scholar 

  • Zliobaite, I (2009). Learning under concept drift: an overview. Tech. rep., Technical report, Vilnius University, 2009 techniques, related areas, applications Subjects: Artificial Intelligence.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tegjyot Singh Sethi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sethi, T.S., Kantardzic, M. & Hu, H. A grid density based framework for classifying streaming data in the presence of concept drift. J Intell Inf Syst 46, 179–211 (2016). https://doi.org/10.1007/s10844-015-0358-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-015-0358-3

Keywords

Navigation