Abstract
Recent approaches for classifying data streams are mostly based on supervised learning algorithms, which can only be trained with labeled data. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment where large volumes of data appear at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training and updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train and update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data. Empirical evaluation of both synthetic and real data reveals that our approach outperforms state-of-the-art stream classification algorithms that use ten times more labeled data than our approach.
Similar content being viewed by others
References
Aggarwal CC (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20: 137–156
Aggarwal CC, Han J, Wang J, Yu PS (2006) A framework for on-demand classification of evolving data streams. IEEE Trans Knowl Data Eng 18(5): 577–589
Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24: 171–196
Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. In: Procedings of nineteenth international conference on machine learning (ICML), Sydney, Australia, pp 19–26
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of SIAM international conference on data mining (SDM), Lake Buena Vista, FL, pp 333–344
Basu S, Bilenko M, Banerjee A, Mooney RJ (2006) Probabilistic semi-supervised clustering with constraints’. In: Chapelle O, Schoelkopf B, Zien A (eds) Semi-supervised learning. pp 73–102
Bengio Y, Delalleau O, Le Roux N (2006) Label propagation and quadratic criterion. In: Chapelle O, Schölkopf B, Zien A (eds) Semi-Supervised Learning. MIT Press, Cambridge, pp 193–216
Besag J (1986) On the statistical analysis of dirty pictures. J R Stat Soc Ser B (Methodological) 48(3): 259–302
Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of 21st international conference on machine learning (ICML), Banff, Canada, pp 81–88
Chen S, Wang H, Zhou S, Yu P (2008) Stop chasing trends: discovering high order models in evolving data. In: Proceedings of ICDE, pp 923–932
Cohn D, Caruana R, McCallum A (2003) Semi-supervised clustering with user feedback. Technical report TR2003-1892, Cornell University
Demiriz A, Bennett KP, Embrechts MJ (1999) Semi-supervised clustering using genetic algorithms. In: Artificial neural networks in engineering (ANNIE-99). ASME Press, pp 809–814
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc B 39: 1–38
Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining KDD. ACM Press, Boston MA, USA, pp 71–80
Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Seattle, WA, USA, pp 128–137
Fan W, an Huang Y, Wang H, Yu PS (2004) Active mining of data streams. In: Proceedings of SDM ’04’. pp 457–461
Gao J, Fan W, Han J (2007) On appropriate assumptions to mine data streams. In: Proceedings of seventh IEEE international conference on data mining (ICDM), Omaha, NE, USA, pp 143–152
Grossi V, Turini F (2011) Stream mining: a novel architecture for ensemble-based classification in preprints. knowl Inf Syst
Halkidi M, Gunopulos D, Kumar N, Vazirgiannis M, Domeniconi C (2005) A framework for semi-supervised learning based on subjective and objective clustering criteria. In: Proceedings of fifth IEEE international conference on data mining (ICDM), Houston, Texas, USA, pp 637–640
Hochbaum D, Shmoys D (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2): 180–184
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), San Francisco, CA, USA, pp 97–106
Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22: 371–391
KDD Cup 1999 Intrusion Detection Dataset (n.d.) http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html .
Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedigs of 19th international conference on machine learning (ICML). Morgan Kaufmann Publishers Inc., Sydney, pp 307–314
Kolter J, Maloof M (2005) Using additive expert ensembles to cope with concept drift. In: Proceedings of international conference on machine learning (ICML), Bonn, Germany, pp 449–456
Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8: 2755–2790
Kranen P, Assent I, Baldauf C, Seidl T (2010) The clustree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst (In preprints)
Kuncheva LI, Sánchez JS (2008) Nearest neighbour classifiers for streaming data with delayed labelling. In: ‘ICDM’. pp 869–874
Li P, Wu X, Hu X (2010) Learning from concept drifting data streams with unlabeled data. In: ‘AAAI’. pp 1945–1946
Li X, Yu PS, Liu B, Ng SK (2009) Positive unlabeled learning for data stream classification. In: ‘SDM’. pp 257–268
Masud MM, Gao J, Khan L, Han J, Thuraisingham B (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Proceedings if international conference on data mining (ICDM), Pisa, Italy, pp 929–934
Masud MM, Gao J, Khan L, Han J, Thuraisingham BM (2009) Integrating novel class detection with classification for concept-drifting data streams. In: ECML PKDD ’09, Vol. II. pp. 79–94
NASA Aviation Safety Reporting System (n.d.) http://akama.arc.nasa.gov/ASRSDBOnline/QueryWizard_Begin.aspx
Scholz M, Klinkenberg R (2005) An ensemble classifier for drifting concepts. In: Proceedings of second international workshop on knowledge discovery in data streams (IWKDDS), Porto, Portugal, pp 53–64
Tumer K, Ghosh J (1996) Error correlation and error reduction in ensemble classifiers. Connect Sci 8(304): 385–403
van Huyssteen GB, Puttkammer MJ, Pilon S, Groenewald HJ (2007) Using machine learning to annotate data for nlp tasks semi-automatically. In: Proceedings of computer-aided language processing (CALP’07)
Wagsta K, Cardie C, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of 18th international conference on machine learning (ICML), Morgan Kaufmann, Williamstown, MA, USA, pp 577–584
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Washington, DC, pp c226–c235
Woolam C, Masud MM, Khan L (2009) Lacking labels in the stream: classifying evolving stream data with few labels. In: Proceedings of international symposium on methodologies for intelligent systems (ISMIS), Prague, Czech Republic, pp 552–562
Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. In: Advances in neural information processing systems vol 15. MIT Press, pp 505–512
Yang Y, Wu X, Zhu X (2005) Combining proactive and reactive predictions for data streams. In: Proceedigs of KDD. pp 710–715
Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15: 181–214
Zhou D, Bousquet O, Lal TN, Weston J, Olkopf BS (2004) Learning with local and global consistency. In: Advances in neural information processing systems, vol 16. MIT Press, pp 321–328
Zhu X, Ding W, Yu P, Zhang C (2010) One-class learning and concept summarization for data streams. Knowl Inf Syst 1–31
Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inf Syst 9: 339–363
Zhu X, Zhang P, Lin X, Shi Y (2007) Active learning from data streams. In: Proceedings of ICDM ’07’, pp 757–762
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Masud, M.M., Woolam, C., Gao, J. et al. Facing the reality of data stream classification: coping with scarcity of labeled data. Knowl Inf Syst 33, 213–244 (2012). https://doi.org/10.1007/s10115-011-0447-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0447-8