Parameter Estimation in Semi-Random Decision Tree Ensembling on Streaming Data

Li, Peipei; Liang, Qianhui; Wu, Xindong; Hu, Xuegang

doi:10.1007/978-3-642-01307-2_35

Peipei Li^23,24,
Qianhui Liang²³,
Xindong Wu^24,25 &
…
Xuegang Hu²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5476))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3156 Accesses
7 Citations

Abstract

The induction error in random tree ensembling results mainly from the strength of decision trees and the dependency between base classifiers. In order to reduce the errors due to both factors, a Semi-Random Decision Tree Ensembling (SRDTE) for mining streaming data is proposed based on our previous work on SRMTDS. The model contains semi-random decision trees that are independent in the generation process and have no interaction with each other in the individual decisions of classification. The main idea is to minimize correlation among the classifiers. We claim that the strength of decision trees is closely related to the estimation values of the parameters, including the height of a tree, the count of trees and the parameter of n _min in the Hoeffding Bounds. We analyze these parameters of the model and design strategies for better adaptation to streaming data. The main strategies include an incremental generation of sub-trees after seeing real training instances, a data structure for quick search and a voting mechanism for classification. Our evaluation in the 0-1 loss function shows that SRDTE has improved the performance in terms of predictive accuracy and robustness. We have applied SRDTE to e-business data streams and proved its feasibility and effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed data streams. In: 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 523–528 (2003)
Google Scholar
Fan, W., Wang, H., Yu, P.S., Ma, S.: Is random model better? On its accuracy and efficiency. In: 3rd IEEE International Conference on Data Mining, pp. 51–58 (2003)
Google Scholar
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Article MathSciNet MATH Google Scholar
Gao, J., Fan, W., Han, J.: On Appropriate Assumptions to Mine Data Streams: Analysis and Practice. In: 7th IEEE International Conference on Data Mining, Omaha, Nebraska, USA, pp. 143–152 (2007)
Google Scholar
Ho, T.K.: The Random Subspace Method for Constructing Decision Forests. IEEE Trans. Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
Article Google Scholar
Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, pp. 71–80 (2000)
Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, 13–30 (1963)
Article MathSciNet MATH Google Scholar
Blake, C., Keogh, E., Merz, C.: UCI repository of machine learning databases (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Yahoo! shopping web services, http://developer.yahoo.com/everything.html
Hu, X., Li, P., Wu, X., Wu, G.: A Semi-Random Multiple Decision-Tree Algorithm for Mining Data Streams. Journal of Computer Science and Technology 22(5), 711–724 (2007)
Article Google Scholar
Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.-Y.: BOAT-optimistic decision tree construction. In: 1999 ACM SIGMOD International Conference on Management of Data, pp. 169–180 (1999)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data mining. In: 22nd International Conference onVery Large Databases, pp. 544–555. Morgan Kaufmann, San Francisco (1996)
Google Scholar
Fan, W.: On the Optimality of Probability Estimation by Random Decision Trees. In: 9th National Conference on Artificial Intelligence (AAAI 2004), pp. 336–341. AAAI Press, San Jose (2004)
Google Scholar
Maron, O., Moore, A.W.: Hoeffding races: Accelerating model selection search for classification and function approximation. In: Cowan, J.D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing Systems, pp. 59–66. Morgan Kaufmann, San Mato (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Systems, Singapore Management University, Singapore, 178902
Peipei Li & Qianhui Liang
School of Computer Science and Information Technology, Hefei University of Technology, China, 230009
Peipei Li, Xindong Wu & Xuegang Hu
Department of Computer Science, University of Vermont, USA, 05405
Xindong Wu

Authors

Peipei Li
View author publications
You can also search for this author in PubMed Google Scholar
Qianhui Liang
View author publications
You can also search for this author in PubMed Google Scholar
Xindong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xuegang Hu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5 Tiwanont Road, 12000, Bangkadi, Muang, Pathumthani, Thailand
Thanaruk Theeramunkong
Dept. of Computer Engineering, Faculty of Engineering, Chulalongkorn University, 10330, Bangkok, Thailand
Boonserm Kijsirikul
Faculty of Science & Engineering, York University, 355 Lumbers Building, 4700 Keele Street, M3J 1P3, Toronto, Ontario, Canada
Nick Cercone
School of Knowledge Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, 923-1292, Ishikawa, Japan
Tu-Bao Ho

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, P., Liang, Q., Wu, X., Hu, X. (2009). Parameter Estimation in Semi-Random Decision Tree Ensembling on Streaming Data. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_35

Download citation

DOI: https://doi.org/10.1007/978-3-642-01307-2_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics