Abstract
Classifying a stream of non-stationary data with recurrent drift is a challenging task and has been considered as an interesting problem in recent years. All of the existing approaches handling recurrent concepts maintain a pool of concepts/classifiers and use that pool for future classifications to reduce the error on classifying the instances from a recurring concept. However, the number of classifiers in the pool usually grows very fast as the accurate detection of an underlying concept is a challenging task in itself. Thus, there may be many concepts in the pool representing the same underlying concept. This paper proposes the GraphPool framework that refines the pool of concepts by applying a merging mechanism whenever necessary: after receiving a new batch of data, we extract a concept representation from the current batch considering the correlation among features. Then, we compare the current batch representation to the concept representations in the pool using a statistical multivariate likelihood test. If more than one concept is similar to the current batch, all the corresponding concepts will be merged. GraphPool not only keeps the concepts but also maintains the transition among concepts via a first-order Markov chain. The current state is maintained at all times and new instances are predicted based on that. Keeping these transitions helps to quickly recover from drifts in some real-world problems with periodic behavior. Comprehensive experimental results of the framework on synthetic and real-world data show the effectiveness of the framework in terms of performance and pool management.
Similar content being viewed by others
Notes
Raw data were extracted from http://db.csail.mit.edu/labdata/labdata.html.
Raw data were extracted from ftp://ftp.ncdc.noaa.gov/pub/data/gsod/.
We tried to compare our method to the method presented by Yang [53], as it has been proposed to handle recurrent concepts and has a close, yet different, approach from this paper. Unfortunately, we could not reach the corresponding author, and there were some unclear parts in the explanation of the method that prevents reimplementation. We have compared the concept similarity algorithm proposed in [53] to our statistical similarity test in the following subsections.
We have used the implementation provided at https://sites.google.com/site/moaextensions/ for RCD, Learn++.NSE and DWM.
We have used the code provided in the MOA framework.
The experiment was finished for only 10% of the data after 72 h.
References
Aggarwal CC (2014) Data classification: algorithms and applications. CRC Press, Boca Raton
Aggarwal CC, Han J, Wang J, Yu PS (2004) On demand classification of data streams. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), ACM, pp 503–508
Anderson TW (2003) An introduction to multivariate statistical analysis. Wiley, New York
Ángel AM, Bartolo GJ, Ernestina M (2016) Predicting recurring concepts on data-streams by means of a meta-model and a fuzzy similarity function. Expert Syst Appl 46:87–105
Baena-Garcıa M, del Campo-Ávila J, Fidalgo R, Bifet A, Gavalda R, Morales-Bueno R (2006) Early drift detection method. In: Proceedings of the fourth international workshop on knowledge discovery from data streams, vol 6, pp 77–86
Bengio Y, Frasconi P (1996) Input-output hmms for sequence processing. IEEE Trans Neural Netw 7(5):1231–1249
Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the seventh SIAM international conference on data mining (SDM), SIAM, pp 443–448
Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà R (2009) New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), ACM, pp 139–148
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010a) Moa: massive online analysis. J Mach Learn Res 11:1601–1604
Bifet A, Holmes G, Pfahringer B (2010b) Leveraging bagging for evolving data streams. In: Machine learning and knowledge discovery in databases: proceedings of european conference on machine learning (ECML/PKDD), Springer, pp 135–150
Bifet A, Read J, Zliobaite I, Pfahringer B, Holmes G (2013) Pitfalls in benchmarking data stream classification and how to avoid them. In: Machine learning and knowledge discovery in databases: proceedings of european conference on machine learning (ECML/PKDD), Springer, pp 465–479
Borchani H, Martínez AM, Masegosa AR, Langseth H, Nielsen TD, Salmerón A, Fernández A, Madsen AL, Sáez R (2015) Modeling concept drift: a probabilistic graphical model based approach. In: Proceedings of the international symposium on intelligent data analysis, Springer, pp 72–83
Brzeziński D, Stefanowski J (2011) Accuracy updated ensemble for data streams with concept drift. In: Proceedings of the 6th international conference on hybrid artificial intelligence systems, Springer, pp 155–163
Brzezinski D, Stefanowski J (2014) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94
Dietterich TG (2002) Machine learning for sequential data: a review. In: Caelli T, Amin A, Duin RPW, de Ridder D, Kamel M (eds) Structural, syntactic, and statistical pattern recognition. Springer, pp 15–30
Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531
Gama J (2010) Knowledge discovery from data streams. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, CRC Press, Boca Raton
Gama J, Kosina P (2014) Recurrent concepts in data streams classification. Knowl Inf Syst 40(3):489–507
Gama J, Zliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):1–44
Gomes JB, Gaber MM, Sousa PA, Menasalvas E (2013) Mining recurring concepts in a dynamic feature space. IEEE Trans Neural Netw Learn Syst 25(1):95–110
Gonçalves PM Jr, Barros RS (2013) RCD: a recurring concept drift framework. Pattern Recognit Lett 34(9):1018–1025
Hahsler M, Dunham MH (2011) Temporal structure learning for clustering massive data streams in real-time. In: Proceedings of the 2011 SIAM international conference on data mining (SDM), SIAM, pp 664–675
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newslett 11(1):10–18
Harries M (1999) Splice-2 comparative evaluation: electricity pricing. University of New South Wales, Technical report
Hosseini MJ, Ahmadi Z, Beigy H (2011) Pool and accuracy based stream classification: a new ensemble algorithm on data stream classification using recurring concepts detection. In: Proceedings of the IEEE 11th international conference on data mining workshops (ICDMW), IEEE, pp 588–595
Hosseini MJ, Ahmadi Z, Beigy H (2012) New management operations on classifiers pool to track recurring concepts. In: Proceedings of the 14th international conference on data warehousing and knowledge discovery (DaWaK), Springer, pp 327–339
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), ACM, pp 97–106
Jaber G, Cornuéjols A, Tarroux P (2013) Online learning: searching for the best forgetting strategy under concept drift. In: Proceedings of the 20th international conference neural information processing (ICONIP), Springer, pp 400–408
Kalnis P, Mamoulis N, Bakiras S (2005) On discovering moving clusters in spatio–temporal data. In: Proceedings of the 9th international symposium on advances in spatial and temporal databases (SSTD), Springer, pp 364–381
Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3):371–391
Kolter JZ, Maloof MA (2005) Using additive expert ensembles to cope with concept drift. In: Proceedings of the 22nd international conference on machine learning (ICML), ACM, pp 449–456
Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790
Krempl G, Zliobaite I, Brzeziński D, Hüllermeier E, Last M, Lemaire V, Noack T, Shaker A, Sievi S, Spiliopoulou M, Stefanowski J (2014) Open challenges for data stream mining research. ACM SIGKDD Explor Newslett 16(1):1–10
Kuncheva LI (2004) Classifier ensembles for changing environments. In: Proceedings of the 5th international workshop on multiple classifier systems (MCS), Springer, pp 1–15
Lazarescu M (2005) A multi-resolution learning approach to tracking concept drift and recurrent concepts. In: Proceedings of the 5th international workshop on pattern recognition in information systems (PRIS), pp 52–61
Lewandowski D, Kurowicka D, Joe H (2009) Generating random correlation matrices based on vines and extended onion method. J Multivar Anal 100(9):1989–2001
Littlestone N, Warmuth MK (1994) The weighted majority algorithm. Inf Comput 108(2):212–261
Masud MM, Chen Q, Khan L, Aggarwal C, Gao J, Han J, Thuraisingham B (2010) Addressing concept-evolution in concept-drifting data streams. In: Proceedings of the IEEE 10th international conference on data mining (ICDM), IEEE, pp 929–934
Minku LL, Yao X (2012) DDD: a new ensemble approach for dealing with concept drift. IEEE Trans Knowl Data Eng 24(4):619–633
Muirhead RJ (2009) Aspects of multivariate statistical theory, vol 197. Wiley, Hoboken
Nishida K, Yamauchi K, Omori T (2005) ACE: adaptive classifiers-ensemble system for concept-drifting environments. In: Proceedings of the 6th international workshop on multiple classifier systems (MCS), Springer, pp 176–185
Ntoutsi I, Spiliopoulou M, Theodoridis Y (2009) Tracing cluster transitions for different cluster types. Control Cybern 38(1):239–259
Oliveira MDB, Gama J (2010) MEC—monitoring clusters’ transitions. In: Proceedings of the fifth starting AI researchers’ symposium (STAIRS), pp 212–224
Oza NC (2005) Online bagging and boosting. IEEE Int Conf Syst Man Cybern 3:2340–2345
Oza NC, Russell S (2001) Experimental comparisons of online and batch versions of bagging and boosting. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), ACM, pp 359–364
Ramamurthy S, Bhatnagar R (2007) Tracking recurrent concept drift in streaming data using ensemble classifiers. In: Proceedings of the sixth international conference on machine learning and applications (ICMLA), IEEE, pp 404–409
Sakthithasan S, Pears R, Bifet A, Pfahringer B (2015) Use of ensembles of Fourier spectra in capturing recurrent concepts in data streams. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 1–8
Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) Monic: modeling and monitoring cluster transitions. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), ACM, pp 706–711
Street WN, Kim Y (2001) A streaming ensemble algorithm (sea) for large-scale classification. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), ACM, pp 377–382
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), ACM, pp 226–235
Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Disc 30(4):964–994
Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101
Yang Y, Wu X, Zhu X (2006) Mining in anticipation for concept change: proactive–reactive prediction in data streams. Data Min Knowl Discov 13(3):261–289
Zliobaite I, Pechenizkiy M, Gama J (2016) An overview of concept drift applications. In: Japkowicz N, Stefanowski J (eds) Big data analysis: new algorithms for a new society. Springer, pp 91–114
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ahmadi, Z., Kramer, S. Modeling recurring concepts in data streams: a graph-based framework. Knowl Inf Syst 55, 15–44 (2018). https://doi.org/10.1007/s10115-017-1070-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1070-0