Facing the reality of data stream classification: coping with scarcity of labeled data

Masud, Mohammad M.; Woolam, Clay; Gao, Jing; Khan, Latifur; Han, Jiawei; Hamlen, Kevin W.; Oza, Nikunj C.

doi:10.1007/s10115-011-0447-8

Facing the reality of data stream classification: coping with scarcity of labeled data

Regular Paper
Published: 20 November 2011

Volume 33, pages 213–244, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Mohammad M. Masud¹,
Clay Woolam¹,
Jing Gao²,
Latifur Khan¹,
Jiawei Han²,
Kevin W. Hamlen¹ &
…
Nikunj C. Oza³

847 Accesses
6 Altmetric
Explore all metrics

Abstract

Recent approaches for classifying data streams are mostly based on supervised learning algorithms, which can only be trained with labeled data. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment where large volumes of data appear at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training and updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train and update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data. Empirical evaluation of both synthetic and real data reveals that our approach outperforms state-of-the-art stream classification algorithms that use ten times more labeled data than our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transfer Learning for Semi-supervised Classification of Non-stationary Data Streams

Dynamic Ensemble Selection for Imbalanced Data Stream Classification with Limited Label Access

Weighted Ensemble Classification of Multi-label Data Streams

References

Aggarwal CC (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20: 137–156
Article Google Scholar
Aggarwal CC, Han J, Wang J, Yu PS (2006) A framework for on-demand classification of evolving data streams. IEEE Trans Knowl Data Eng 18(5): 577–589
Article Google Scholar
Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24: 171–196
Article Google Scholar
Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. In: Procedings of nineteenth international conference on machine learning (ICML), Sydney, Australia, pp 19–26
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of SIAM international conference on data mining (SDM), Lake Buena Vista, FL, pp 333–344
Basu S, Bilenko M, Banerjee A, Mooney RJ (2006) Probabilistic semi-supervised clustering with constraints’. In: Chapelle O, Schoelkopf B, Zien A (eds) Semi-supervised learning. pp 73–102
Bengio Y, Delalleau O, Le Roux N (2006) Label propagation and quadratic criterion. In: Chapelle O, Schölkopf B, Zien A (eds) Semi-Supervised Learning. MIT Press, Cambridge, pp 193–216
Google Scholar
Besag J (1986) On the statistical analysis of dirty pictures. J R Stat Soc Ser B (Methodological) 48(3): 259–302
MathSciNet MATH Google Scholar
Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of 21st international conference on machine learning (ICML), Banff, Canada, pp 81–88
Chen S, Wang H, Zhou S, Yu P (2008) Stop chasing trends: discovering high order models in evolving data. In: Proceedings of ICDE, pp 923–932
Cohn D, Caruana R, McCallum A (2003) Semi-supervised clustering with user feedback. Technical report TR2003-1892, Cornell University
Demiriz A, Bennett KP, Embrechts MJ (1999) Semi-supervised clustering using genetic algorithms. In: Artificial neural networks in engineering (ANNIE-99). ASME Press, pp 809–814
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc B 39: 1–38
MathSciNet MATH Google Scholar
Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining KDD. ACM Press, Boston MA, USA, pp 71–80
Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Seattle, WA, USA, pp 128–137
Fan W, an Huang Y, Wang H, Yu PS (2004) Active mining of data streams. In: Proceedings of SDM ’04’. pp 457–461
Gao J, Fan W, Han J (2007) On appropriate assumptions to mine data streams. In: Proceedings of seventh IEEE international conference on data mining (ICDM), Omaha, NE, USA, pp 143–152
Grossi V, Turini F (2011) Stream mining: a novel architecture for ensemble-based classification in preprints. knowl Inf Syst
Halkidi M, Gunopulos D, Kumar N, Vazirgiannis M, Domeniconi C (2005) A framework for semi-supervised learning based on subjective and objective clustering criteria. In: Proceedings of fifth IEEE international conference on data mining (ICDM), Houston, Texas, USA, pp 637–640
Hochbaum D, Shmoys D (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2): 180–184
Article MathSciNet MATH Google Scholar
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), San Francisco, CA, USA, pp 97–106
Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22: 371–391
Article Google Scholar
KDD Cup 1999 Intrusion Detection Dataset (n.d.) http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html .
Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedigs of 19th international conference on machine learning (ICML). Morgan Kaufmann Publishers Inc., Sydney, pp 307–314
Kolter J, Maloof M (2005) Using additive expert ensembles to cope with concept drift. In: Proceedings of international conference on machine learning (ICML), Bonn, Germany, pp 449–456
Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8: 2755–2790
MATH Google Scholar
Kranen P, Assent I, Baldauf C, Seidl T (2010) The clustree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst (In preprints)
Kuncheva LI, Sánchez JS (2008) Nearest neighbour classifiers for streaming data with delayed labelling. In: ‘ICDM’. pp 869–874
Li P, Wu X, Hu X (2010) Learning from concept drifting data streams with unlabeled data. In: ‘AAAI’. pp 1945–1946
Li X, Yu PS, Liu B, Ng SK (2009) Positive unlabeled learning for data stream classification. In: ‘SDM’. pp 257–268
Masud MM, Gao J, Khan L, Han J, Thuraisingham B (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Proceedings if international conference on data mining (ICDM), Pisa, Italy, pp 929–934
Masud MM, Gao J, Khan L, Han J, Thuraisingham BM (2009) Integrating novel class detection with classification for concept-drifting data streams. In: ECML PKDD ’09, Vol. II. pp. 79–94
NASA Aviation Safety Reporting System (n.d.) http://akama.arc.nasa.gov/ASRSDBOnline/QueryWizard_Begin.aspx
Scholz M, Klinkenberg R (2005) An ensemble classifier for drifting concepts. In: Proceedings of second international workshop on knowledge discovery in data streams (IWKDDS), Porto, Portugal, pp 53–64
Tumer K, Ghosh J (1996) Error correlation and error reduction in ensemble classifiers. Connect Sci 8(304): 385–403
Article Google Scholar
van Huyssteen GB, Puttkammer MJ, Pilon S, Groenewald HJ (2007) Using machine learning to annotate data for nlp tasks semi-automatically. In: Proceedings of computer-aided language processing (CALP’07)
Wagsta K, Cardie C, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of 18th international conference on machine learning (ICML), Morgan Kaufmann, Williamstown, MA, USA, pp 577–584
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Washington, DC, pp c226–c235
Woolam C, Masud MM, Khan L (2009) Lacking labels in the stream: classifying evolving stream data with few labels. In: Proceedings of international symposium on methodologies for intelligent systems (ISMIS), Prague, Czech Republic, pp 552–562
Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. In: Advances in neural information processing systems vol 15. MIT Press, pp 505–512
Yang Y, Wu X, Zhu X (2005) Combining proactive and reactive predictions for data streams. In: Proceedigs of KDD. pp 710–715
Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15: 181–214
Article Google Scholar
Zhou D, Bousquet O, Lal TN, Weston J, Olkopf BS (2004) Learning with local and global consistency. In: Advances in neural information processing systems, vol 16. MIT Press, pp 321–328
Zhu X, Ding W, Yu P, Zhang C (2010) One-class learning and concept summarization for data streams. Knowl Inf Syst 1–31
Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inf Syst 9: 339–363
Article Google Scholar
Zhu X, Zhang P, Lin X, Shi Y (2007) Active learning from data streams. In: Proceedings of ICDM ’07’, pp 757–762

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Texas at Dallas, Richardson, TX, 75080, USA
Mohammad M. Masud, Clay Woolam, Latifur Khan & Kevin W. Hamlen
Department of Computer Science, University of Illinois at Urbana Champaign, Urbana, IL, 61801, USA
Jing Gao & Jiawei Han
Intelligent Systems Division, NASA Ames Research Center, Moffett Field, CA, 94035, USA
Nikunj C. Oza

Authors

Mohammad M. Masud
View author publications
You can also search for this author inPubMed Google Scholar
Clay Woolam
View author publications
You can also search for this author inPubMed Google Scholar
Jing Gao
View author publications
You can also search for this author inPubMed Google Scholar
Latifur Khan
View author publications
You can also search for this author inPubMed Google Scholar
Jiawei Han
View author publications
You can also search for this author inPubMed Google Scholar
Kevin W. Hamlen
View author publications
You can also search for this author inPubMed Google Scholar
Nikunj C. Oza
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Mohammad M. Masud.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Masud, M.M., Woolam, C., Gao, J. et al. Facing the reality of data stream classification: coping with scarcity of labeled data. Knowl Inf Syst 33, 213–244 (2012). https://doi.org/10.1007/s10115-011-0447-8

Download citation

Received: 06 May 2009
Revised: 26 April 2011
Accepted: 22 October 2011
Published: 20 November 2011
Issue Date: October 2012
DOI: https://doi.org/10.1007/s10115-011-0447-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Facing the reality of data stream classification: coping with scarcity of labeled data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Transfer Learning for Semi-supervised Classification of Non-stationary Data Streams

Dynamic Ensemble Selection for Imbalanced Data Stream Classification with Limited Label Access

Weighted Ensemble Classification of Multi-label Data Streams

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now