Improving semi-supervised co-forest algorithm in evolving data streams

Wang, Yi; Li, Tao

doi:10.1007/s10489-018-1149-7

Improving semi-supervised co-forest algorithm in evolving data streams

Published: 20 February 2018

Volume 48, pages 3248–3262, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

1295 Accesses
Explore all metrics

Abstract

Semi-supervised learning, which uses a large amount of unlabeled data to improve the performance of a classifier when only a limited amount of labeled data is available, has become a hot topic in machine learning research recently. In this paper, we propose a semi-supervised ensemble of classifiers approach, for learning in time-varying data streams. This algorithm maintains all the desirable properties of the semi-supervised Co-trained random FOREST algorithm (Co-Forest) and extends it into evolving data streams. It assigns a weight to each example according to Poisson(1) to simulate the bootstrap sample method in data streams, which is used to keep the diversity of Random Forest. By utilizing incremental learning technology, it avoids unnecessary repetition training and improves the accuracy of base models. In addition, the ADaptive WINdowing (ADWIN2) is introduced to deal with concept drift, which makes it adapt to the varying environment. Empirical evaluation on both synthetic data and UCI data reveals that our proposed method outperforms state-of-the-art semi-supervised and supervised methods in time-varying data streams, and also achieves relatively high performance in stationary streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic Forest for Learning from Data Streams with Varying Feature Spaces

Adaptive random forests for evolving data stream classification

Article 13 June 2017

Improving the Efficiency of Ensemble Classifier Adaptive Random Forest with Meta Level Learning for Real-Time Data Streams

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the eighth international conference on database theory. Springer, pp 420–434
Angiulli F, Fassetti F (2007) Detecting distance-based outliers in streams of data. In: Proceedings of the sixteenth ACM conference on information and knowledge management. ACM, pp 811–820
Angluin D, Laird P (1988) Learning from noisy examples. Mach Learn 2(4):343–370
Google Scholar
Bache K, Lichman M (2013) UCI machine learning repository
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful?. In: Proceedings of the seventh international conference on database theory. Springer, pp 217– 235
Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the SIAM international conference on data mining. SIAM, pp 443–448
Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà R (2009) New ensemble methods for evolving data streams. In: Proceedings of the fifteenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 139–148
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) Moa: massive online analysis. J Mach Learn Res 11(5):1601–1604
Google Scholar
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory. ACM, pp 92–100
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Brzezinski D, Stefanowski J (2014) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94
Article Google Scholar
Burchett J, Shankar M, Hamza AB, Guenther BD, Pitsianis N, Brady DJ (2006) Lightweight biometric detection system for human classification using pyroelectric infrared detectors. Appl Opt 45(13):3031–3037
Article Google Scholar
Cao L, Yang D, Wang Q, Yu Y, Wang J, Rundensteiner EA (2014) Scalable distance-based outlier detection over high-volume data streams. In: Proceedings of the thirtieth IEEE international conference on data engineering. IEEE, pp 76–87
Chapelle O, Schölkopf B, Zien A (2006) Semi-Supervised Learning. MIT Press, Cambridge
Book Google Scholar
Chen WJ, Shao YH, Xu DK, Fu YF (2014) Manifold proximal support vector machine for semi-supervised classification. Appl Intell 40(4):623–638
Article Google Scholar
Dai Q (2013) A competitive ensemble pruning approach based on cross-validation technique. Knowl-Based Syst 37:394–414
Article Google Scholar
Dai Q, Song G (2016) A novel supervised competitive learning algorithm. Neurocomputing 191:356–362
Article Google Scholar
Dai Q, Ye R, Liu Z (2017) Considering diversity and accuracy simultaneously for ensemble pruning. Appl Soft Comput 58:75–91
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc. Ser B (methodol) 39(1):1–38
MathSciNet MATH Google Scholar
Domeniconi C, Gunopulos D (2001) Incremental support vector machine construction. In: Proceedings of the IEEE international conference on data mining. IEEE, pp 589–592
Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 71–80
Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531
Article Google Scholar
Frinken V, Fischer A, Baumgartner M, Bunke H (2014) Keyword spotting for self-training of BLSTM NN based handwriting recognition systems. Pattern Recogn 47(3):1073–1082
Article Google Scholar
Fujino A, Ueda N (2016) A semi-supervised AUC optimization method with generative models. In: Proceedings of the sixteenth IEEE international conference on data mining. IEEE, pp 883–888
Gama J, Rodrigues P (2009) An overview on mining data streams. Found Comput Intell 6:29–45
Google Scholar
Gama J, żliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44
Article MATH Google Scholar
Hajmohammadi MS, Ibrahim R, Selamat A, Fujita H (2015) Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples. Inf Sci 317:67–77
Article Google Scholar
Haque A, Khan L, Baron M (2016) Sand: semi-supervised adaptive novel class detection and classification over data stream. In: Proceedings of the thirtieth AAAI conference on artificial intelligence. AAAI, pp 1652–1658
He Y, Zhou D (2011) Self-training from labeled features for sentiment analysis. Inf Process Manag 47 (4):606–616
Article Google Scholar
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Amer Stat Assoc 58 (301):13–30
Article MathSciNet MATH Google Scholar
Hosseini MJ, Gholipour A, Beigy H (2016) An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams. Knowl Inf Syst 46(3):567–597
Article Google Scholar
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 97–106
Iosifidis V, Ntoutsi E (2017) Large scale sentiment learning with limited labels. In: Proceedings of the twenty-third ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1823–1832
Jiang B, Chen H, Yuan B, Yao X (2017) Scalable graph-based semi-supervised learning through sparse bayesian model. IEEE Trans Knowl Data Eng 29(12):2758–2771
Article Google Scholar
Joachims T (1999) Transductive inference for text classification using support vector machines. In: Proceedings of the sixteenth international conference on machine learning. ACM, pp 200–209
Kale A, Ingle M (2015) Svm based feature extraction for novel class detection from streaming data. Int J Comput Appl 110(9):1–3
Google Scholar
Khemchandani R, Chandra S et al (2007) Twin support vector machines for pattern classification. IEEE Trans Pattern Anal Mach Intell 29(5):905–910
Article MATH Google Scholar
Kingma DP, Mohamed S, Rezende DJ, Welling M (2014) Semi-supervised learning with deep generative models. In: Proceedings of advances in neural information processing systems. MIT Press, pp 3581–3589
Kourtellis N, Morales GDF, Bifet A, Murdopo A (2016) VHT: vertical hoeffding tree. In: Proceedings of IEEE international conference on big data. IEEE, pp 915–922
Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Inf Fusion 37:132–156
Article Google Scholar
Li M, Zhou ZH (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst Man Cybern-Part A: Syst Hum 37(6):1088–1098
Article Google Scholar
Liu B, Xiao Y, Cao L (2017) Svm-based multi-state-mapping approach for multi-class classification. Knowl-Based Syst 129:79–96
Article Google Scholar
Maaløe L, Sønderby CK, Sønderby SK, Winther O (2015) Improving semi-supervised learning with auxiliary deep generative models. In: Proceedings of NIPS workshop on advances in approximate bayesian inference
Masoumi M, Hamza AB (2017) Shape classification using spectral graph wavelets. Appl Intell 47(4):1256–1269
Article Google Scholar
Masud MM, Woolam C, Gao J, Khan L, Han J, Hamlen KW, Oza NC (2012) Facing the reality of data stream classification: coping with scarcity of labeled data. Knowl Inf Syst 33(1):213–244
Article Google Scholar
Mohebbi H, Mu Y, Ding W (2017) Learning weighted distance metric from group level information and its parallel implementation. Appl Intell 46(1):180–196
Article Google Scholar
Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569
Article Google Scholar
Nigam K, Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the ninth international conference on information and knowledge management. ACM, pp 86–93
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2):103–134
Article MATH Google Scholar
Oza NC (2005) Online bagging and boosting. In: Proceedings of IEEE international conference on systems, man and cybernetics. IEEE, pp 2340–2345
Oza NC, Russell S (2001) Experimental comparisons of online and batch versions of bagging and boosting. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 359–364
Prakash VJ, Nithya DL (2014) A survey on semi-supervised learning techniques. Int J Comput Trends Technol 8(1):25–29
Article Google Scholar
Qi Z, Tian Y, Shi Y (2012) Laplacian twin support vector machine for semi-supervised classification. Neural Netw 35:46–53
Article MATH Google Scholar
Rasmus A, Berglund M, Honkala M, Valpola H, Raiko T (2015) Semi-supervised learning with ladder networks. In: Proceedings of advances in neural information processing systems. MIT Press, pp 3546–3554
Rutkowski L, Jaworski M, Pietruczuk L, Duda P (2014) The CART decision tree for mining data streams. Inf Sci 266:1–15
Article MATH Google Scholar
Street WN, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 377–382
Sun J, Fujita H, Chen P, Li H (2017) Dynamic financial distress prediction with concept drift based on time weighting combined with Adaboost support vector machine ensemble. Knowl-Based Syst 120:4–14
Article Google Scholar
Sun Y, Tang K, Minku LL, Wang S, Yao X (2016) Online ensemble learning of data streams with gradually evolved classes. IEEE Trans Knowl Data Eng 28(6):1532–1545
Article Google Scholar
Tsymbal A (2004) The problem of concept drift: definitions and related work. Technical Report TCDCS- 2004-15, Computer Science Department, Trinity College Dublin
Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington
Google Scholar
Xu S, Wang J (2016) A fast incremental extreme learning machine algorithm for data streams classification. Expert Syst Appl 65:332–344
Article Google Scholar
Zhang YM, Huang K, Geng GG, Liu CL (2015) MTC: a fast and robust graph-based transductive learning method. IEEE Trans Neural Netw Learn Syst 26(9):1979–1991
Article MathSciNet Google Scholar
Zhao X, Evans N, Dugelay JL (2011) Semi-supervised face recognition with LDA self-training. In: Proceedings of eighteenth IEEE international conference on image processing. IEEE, pp 3041–3044
Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with local and global consistency. In: Proceedings of advances in neural information processing systems. MIT Press, pp 321–328
Zhou ZH, Wu J, Tang W (2002) Ensembling neural networks: many could be better than all. Artif Intell 137(1-2):239–263
Article MathSciNet MATH Google Scholar
Zhu QH, Wang ZZ, Mao XJ, Yang YB (2017) Spatial locality-preserving feature coding for image classification. Appl Intell 47(1):148–157
Article Google Scholar
Zhu X (2006) Semi-supervised learning literature survey. Comput Sci Univ Wis-Madison 2(3):4
Google Scholar
Zhu X, Ghahramani Z, Lafferty JD (2003) Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the 20th international conference on machine learning. ACM, pp 912–919

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China(Grant No.2016YFB0800605 and No.2016YFB0800604) and Natural Science Foundation of China(Grant No.61402308 and No.61572334).

Author information

Authors and Affiliations

College of Computer Science, Sichuan University, Chengdu, 610065, China
Yi Wang & Tao Li

Authors

Yi Wang
View author publications
You can also search for this author inPubMed Google Scholar
Tao Li
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yi Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Li, T. Improving semi-supervised co-forest algorithm in evolving data streams. Appl Intell 48, 3248–3262 (2018). https://doi.org/10.1007/s10489-018-1149-7

Download citation

Published: 20 February 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s10489-018-1149-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving semi-supervised co-forest algorithm in evolving data streams

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dynamic Forest for Learning from Data Streams with Varying Feature Spaces

Adaptive random forests for evolving data stream classification

Improving the Efficiency of Ensemble Classifier Adaptive Random Forest with Meta Level Learning for Real-Time Data Streams

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now