Skip to main content
Log in

Improving semi-supervised co-forest algorithm in evolving data streams

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Semi-supervised learning, which uses a large amount of unlabeled data to improve the performance of a classifier when only a limited amount of labeled data is available, has become a hot topic in machine learning research recently. In this paper, we propose a semi-supervised ensemble of classifiers approach, for learning in time-varying data streams. This algorithm maintains all the desirable properties of the semi-supervised Co-trained random FOREST algorithm (Co-Forest) and extends it into evolving data streams. It assigns a weight to each example according to Poisson(1) to simulate the bootstrap sample method in data streams, which is used to keep the diversity of Random Forest. By utilizing incremental learning technology, it avoids unnecessary repetition training and improves the accuracy of base models. In addition, the ADaptive WINdowing (ADWIN2) is introduced to deal with concept drift, which makes it adapt to the varying environment. Empirical evaluation on both synthetic data and UCI data reveals that our proposed method outperforms state-of-the-art semi-supervised and supervised methods in time-varying data streams, and also achieves relatively high performance in stationary streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the eighth international conference on database theory. Springer, pp 420–434

  2. Angiulli F, Fassetti F (2007) Detecting distance-based outliers in streams of data. In: Proceedings of the sixteenth ACM conference on information and knowledge management. ACM, pp 811–820

  3. Angluin D, Laird P (1988) Learning from noisy examples. Mach Learn 2(4):343–370

    Google Scholar 

  4. Bache K, Lichman M (2013) UCI machine learning repository

  5. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful?. In: Proceedings of the seventh international conference on database theory. Springer, pp 217– 235

  6. Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the SIAM international conference on data mining. SIAM, pp 443–448

  7. Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà R (2009) New ensemble methods for evolving data streams. In: Proceedings of the fifteenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 139–148

  8. Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) Moa: massive online analysis. J Mach Learn Res 11(5):1601–1604

    Google Scholar 

  9. Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory. ACM, pp 92–100

  10. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  11. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  12. Brzezinski D, Stefanowski J (2014) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94

    Article  Google Scholar 

  13. Burchett J, Shankar M, Hamza AB, Guenther BD, Pitsianis N, Brady DJ (2006) Lightweight biometric detection system for human classification using pyroelectric infrared detectors. Appl Opt 45(13):3031–3037

    Article  Google Scholar 

  14. Cao L, Yang D, Wang Q, Yu Y, Wang J, Rundensteiner EA (2014) Scalable distance-based outlier detection over high-volume data streams. In: Proceedings of the thirtieth IEEE international conference on data engineering. IEEE, pp 76–87

  15. Chapelle O, Schölkopf B, Zien A (2006) Semi-Supervised Learning. MIT Press, Cambridge

    Book  Google Scholar 

  16. Chen WJ, Shao YH, Xu DK, Fu YF (2014) Manifold proximal support vector machine for semi-supervised classification. Appl Intell 40(4):623–638

    Article  Google Scholar 

  17. Dai Q (2013) A competitive ensemble pruning approach based on cross-validation technique. Knowl-Based Syst 37:394–414

    Article  Google Scholar 

  18. Dai Q, Song G (2016) A novel supervised competitive learning algorithm. Neurocomputing 191:356–362

    Article  Google Scholar 

  19. Dai Q, Ye R, Liu Z (2017) Considering diversity and accuracy simultaneously for ensemble pruning. Appl Soft Comput 58:75–91

    Article  Google Scholar 

  20. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc. Ser B (methodol) 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  21. Domeniconi C, Gunopulos D (2001) Incremental support vector machine construction. In: Proceedings of the IEEE international conference on data mining. IEEE, pp 589–592

  22. Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 71–80

  23. Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531

    Article  Google Scholar 

  24. Frinken V, Fischer A, Baumgartner M, Bunke H (2014) Keyword spotting for self-training of BLSTM NN based handwriting recognition systems. Pattern Recogn 47(3):1073–1082

    Article  Google Scholar 

  25. Fujino A, Ueda N (2016) A semi-supervised AUC optimization method with generative models. In: Proceedings of the sixteenth IEEE international conference on data mining. IEEE, pp 883–888

  26. Gama J, Rodrigues P (2009) An overview on mining data streams. Found Comput Intell 6:29–45

    Google Scholar 

  27. Gama J, żliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44

    Article  MATH  Google Scholar 

  28. Hajmohammadi MS, Ibrahim R, Selamat A, Fujita H (2015) Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples. Inf Sci 317:67–77

    Article  Google Scholar 

  29. Haque A, Khan L, Baron M (2016) Sand: semi-supervised adaptive novel class detection and classification over data stream. In: Proceedings of the thirtieth AAAI conference on artificial intelligence. AAAI, pp 1652–1658

  30. He Y, Zhou D (2011) Self-training from labeled features for sentiment analysis. Inf Process Manag 47 (4):606–616

    Article  Google Scholar 

  31. Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Amer Stat Assoc 58 (301):13–30

    Article  MathSciNet  MATH  Google Scholar 

  32. Hosseini MJ, Gholipour A, Beigy H (2016) An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams. Knowl Inf Syst 46(3):567–597

    Article  Google Scholar 

  33. Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 97–106

  34. Iosifidis V, Ntoutsi E (2017) Large scale sentiment learning with limited labels. In: Proceedings of the twenty-third ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1823–1832

  35. Jiang B, Chen H, Yuan B, Yao X (2017) Scalable graph-based semi-supervised learning through sparse bayesian model. IEEE Trans Knowl Data Eng 29(12):2758–2771

    Article  Google Scholar 

  36. Joachims T (1999) Transductive inference for text classification using support vector machines. In: Proceedings of the sixteenth international conference on machine learning. ACM, pp 200–209

  37. Kale A, Ingle M (2015) Svm based feature extraction for novel class detection from streaming data. Int J Comput Appl 110(9):1–3

    Google Scholar 

  38. Khemchandani R, Chandra S et al (2007) Twin support vector machines for pattern classification. IEEE Trans Pattern Anal Mach Intell 29(5):905–910

    Article  MATH  Google Scholar 

  39. Kingma DP, Mohamed S, Rezende DJ, Welling M (2014) Semi-supervised learning with deep generative models. In: Proceedings of advances in neural information processing systems. MIT Press, pp 3581–3589

  40. Kourtellis N, Morales GDF, Bifet A, Murdopo A (2016) VHT: vertical hoeffding tree. In: Proceedings of IEEE international conference on big data. IEEE, pp 915–922

  41. Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Inf Fusion 37:132–156

    Article  Google Scholar 

  42. Li M, Zhou ZH (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst Man Cybern-Part A: Syst Hum 37(6):1088–1098

    Article  Google Scholar 

  43. Liu B, Xiao Y, Cao L (2017) Svm-based multi-state-mapping approach for multi-class classification. Knowl-Based Syst 129:79–96

    Article  Google Scholar 

  44. Maaløe L, Sønderby CK, Sønderby SK, Winther O (2015) Improving semi-supervised learning with auxiliary deep generative models. In: Proceedings of NIPS workshop on advances in approximate bayesian inference

  45. Masoumi M, Hamza AB (2017) Shape classification using spectral graph wavelets. Appl Intell 47(4):1256–1269

    Article  Google Scholar 

  46. Masud MM, Woolam C, Gao J, Khan L, Han J, Hamlen KW, Oza NC (2012) Facing the reality of data stream classification: coping with scarcity of labeled data. Knowl Inf Syst 33(1):213–244

    Article  Google Scholar 

  47. Mohebbi H, Mu Y, Ding W (2017) Learning weighted distance metric from group level information and its parallel implementation. Appl Intell 46(1):180–196

    Article  Google Scholar 

  48. Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569

    Article  Google Scholar 

  49. Nigam K, Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the ninth international conference on information and knowledge management. ACM, pp 86–93

  50. Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2):103–134

    Article  MATH  Google Scholar 

  51. Oza NC (2005) Online bagging and boosting. In: Proceedings of IEEE international conference on systems, man and cybernetics. IEEE, pp 2340–2345

  52. Oza NC, Russell S (2001) Experimental comparisons of online and batch versions of bagging and boosting. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 359–364

  53. Prakash VJ, Nithya DL (2014) A survey on semi-supervised learning techniques. Int J Comput Trends Technol 8(1):25–29

    Article  Google Scholar 

  54. Qi Z, Tian Y, Shi Y (2012) Laplacian twin support vector machine for semi-supervised classification. Neural Netw 35:46–53

    Article  MATH  Google Scholar 

  55. Rasmus A, Berglund M, Honkala M, Valpola H, Raiko T (2015) Semi-supervised learning with ladder networks. In: Proceedings of advances in neural information processing systems. MIT Press, pp 3546–3554

  56. Rutkowski L, Jaworski M, Pietruczuk L, Duda P (2014) The CART decision tree for mining data streams. Inf Sci 266:1–15

    Article  MATH  Google Scholar 

  57. Street WN, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 377–382

  58. Sun J, Fujita H, Chen P, Li H (2017) Dynamic financial distress prediction with concept drift based on time weighting combined with Adaboost support vector machine ensemble. Knowl-Based Syst 120:4–14

    Article  Google Scholar 

  59. Sun Y, Tang K, Minku LL, Wang S, Yao X (2016) Online ensemble learning of data streams with gradually evolved classes. IEEE Trans Knowl Data Eng 28(6):1532–1545

    Article  Google Scholar 

  60. Tsymbal A (2004) The problem of concept drift: definitions and related work. Technical Report TCDCS- 2004-15, Computer Science Department, Trinity College Dublin

  61. Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington

    Google Scholar 

  62. Xu S, Wang J (2016) A fast incremental extreme learning machine algorithm for data streams classification. Expert Syst Appl 65:332–344

    Article  Google Scholar 

  63. Zhang YM, Huang K, Geng GG, Liu CL (2015) MTC: a fast and robust graph-based transductive learning method. IEEE Trans Neural Netw Learn Syst 26(9):1979–1991

    Article  MathSciNet  Google Scholar 

  64. Zhao X, Evans N, Dugelay JL (2011) Semi-supervised face recognition with LDA self-training. In: Proceedings of eighteenth IEEE international conference on image processing. IEEE, pp 3041–3044

  65. Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with local and global consistency. In: Proceedings of advances in neural information processing systems. MIT Press, pp 321–328

  66. Zhou ZH, Wu J, Tang W (2002) Ensembling neural networks: many could be better than all. Artif Intell 137(1-2):239–263

    Article  MathSciNet  MATH  Google Scholar 

  67. Zhu QH, Wang ZZ, Mao XJ, Yang YB (2017) Spatial locality-preserving feature coding for image classification. Appl Intell 47(1):148–157

    Article  Google Scholar 

  68. Zhu X (2006) Semi-supervised learning literature survey. Comput Sci Univ Wis-Madison 2(3):4

    Google Scholar 

  69. Zhu X, Ghahramani Z, Lafferty JD (2003) Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the 20th international conference on machine learning. ACM, pp 912–919

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China(Grant No.2016YFB0800605 and No.2016YFB0800604) and Natural Science Foundation of China(Grant No.61402308 and No.61572334).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Li, T. Improving semi-supervised co-forest algorithm in evolving data streams. Appl Intell 48, 3248–3262 (2018). https://doi.org/10.1007/s10489-018-1149-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1149-7

Keywords

Navigation