Abstract
Random forests are a popular classification method based on an ensemble of a single type of decision tree. In the literature, there are many different types of decision tree algorithms, including C4.5, CART and CHAID. Each type of decision tree algorithms may capture different information and structures. In this paper, we propose a novel random forest algorithm, called a hybrid random forest. We ensemble multiple types of decision trees into a random forest, and exploit diversity of the trees to enhance the resulting model. We conducted a series of experiments on six text classification datasets to compare our method with traditional random forest methods and some other text categorization methods. The results show that our method consistently outperforms these compared methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Breiman, L., Friedman, J.H., Olshen R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth and Brooks/Cole Advanced Books and Software, Monterey, CA (1984)
Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
Biggs, D., Suen, E.: A method of choosing multiway partitions for classification and decision trees. Journal of Applied Statistics 18(1), 49–62 (1991)
Ture, M., Kurt, I., Turhan Kurum, A., Ozdamar, K.: Comparing classification techniques for predicting essential hypertension. Expert Systems with Applications 29(3), 583–588 (2005)
Klema, J., Almonayyes, A.: Automatic categorization of fanatic texts using random forests. Kuwait Journal of Science and Engineering 33(2), 1–18 (2006)
Begum, N., Fattah, M.A., Ren, F.J.: Automatic text summarization using support vector machine. International Journal of Innovative Computing Information and Control 5(7), 1987–1996 (2009)
Chen, J.N., Huang, H.K., Tian, S.F., Qu, Y.L.: Feature selection for text classification with naive bayes. Expert Systems with Applications 36(3), 5432–5435 (2009)
Tan, S.: Neighbor-weighted K-nearest neighbor for unbalance text corpus. Expert Systems with Applications 28(4), 667–671 (2005)
Dietterich, T.G.: Machine learning research: Four current directions. AI Magazine 18(4), 97–136 (1997)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: ACM SIGIR 1999, pp. 42–49 (1999)
Han, E.-H(S.), Karypis, G.: Centroid-based Document Classification: Analysis and Experimental Results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
TREC. Text retrieval conference, http://trec.nist.gov
Lewis, D.D.: Reuters-21578 text categorization test collection distribution 1.0 (2011), http://www.research.att.com/~lewis
Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: An interactive retrieval evaluation and new large test collection for research. In: SIGIR 1994, pp. 192–201 (1994)
Moore, J., Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B.: Web page categorization and feature selection using association rule and principal component clustering. In: WITS 1997 (1997)
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI Workshop 1998, pp. 41–48 (1998)
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, Burlington (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xu, B., Huang, J.Z., Williams, G., Li, M.J., Ye, Y. (2012). Hybrid Random Forests: Advantages of Mixed Trees in Classifying Text Data. In: Tan, PN., Chawla, S., Ho, C.K., Bailey, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2012. Lecture Notes in Computer Science(), vol 7301. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30217-6_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-30217-6_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30216-9
Online ISBN: 978-3-642-30217-6
eBook Packages: Computer ScienceComputer Science (R0)