Skip to main content

Hybrid Random Forests: Advantages of Mixed Trees in Classifying Text Data

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7301))

Included in the following conference series:

  • 3188 Accesses

Abstract

Random forests are a popular classification method based on an ensemble of a single type of decision tree. In the literature, there are many different types of decision tree algorithms, including C4.5, CART and CHAID. Each type of decision tree algorithms may capture different information and structures. In this paper, we propose a novel random forest algorithm, called a hybrid random forest. We ensemble multiple types of decision trees into a random forest, and exploit diversity of the trees to enhance the resulting model. We conducted a series of experiments on six text classification datasets to compare our method with traditional random forest methods and some other text categorization methods. The results show that our method consistently outperforms these compared methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  2. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)

    Article  Google Scholar 

  3. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)

    Google Scholar 

  4. Breiman, L., Friedman, J.H., Olshen R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth and Brooks/Cole Advanced Books and Software, Monterey, CA (1984)

    Google Scholar 

  5. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)

    MathSciNet  MATH  Google Scholar 

  6. Biggs, D., Suen, E.: A method of choosing multiway partitions for classification and decision trees. Journal of Applied Statistics 18(1), 49–62 (1991)

    Article  Google Scholar 

  7. Ture, M., Kurt, I., Turhan Kurum, A., Ozdamar, K.: Comparing classification techniques for predicting essential hypertension. Expert Systems with Applications 29(3), 583–588 (2005)

    Article  Google Scholar 

  8. Klema, J., Almonayyes, A.: Automatic categorization of fanatic texts using random forests. Kuwait Journal of Science and Engineering 33(2), 1–18 (2006)

    Google Scholar 

  9. Begum, N., Fattah, M.A., Ren, F.J.: Automatic text summarization using support vector machine. International Journal of Innovative Computing Information and Control 5(7), 1987–1996 (2009)

    Google Scholar 

  10. Chen, J.N., Huang, H.K., Tian, S.F., Qu, Y.L.: Feature selection for text classification with naive bayes. Expert Systems with Applications 36(3), 5432–5435 (2009)

    Article  Google Scholar 

  11. Tan, S.: Neighbor-weighted K-nearest neighbor for unbalance text corpus. Expert Systems with Applications 28(4), 667–671 (2005)

    Article  Google Scholar 

  12. Dietterich, T.G.: Machine learning research: Four current directions. AI Magazine 18(4), 97–136 (1997)

    Google Scholar 

  13. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: ACM SIGIR 1999, pp. 42–49 (1999)

    Google Scholar 

  14. Han, E.-H(S.), Karypis, G.: Centroid-based Document Classification: Analysis and Experimental Results. In: Zighed, D.A., Komorowski, J., Å»ytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  15. TREC. Text retrieval conference, http://trec.nist.gov

  16. Lewis, D.D.: Reuters-21578 text categorization test collection distribution 1.0 (2011), http://www.research.att.com/~lewis

  17. Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: An interactive retrieval evaluation and new large test collection for research. In: SIGIR 1994, pp. 192–201 (1994)

    Google Scholar 

  18. Moore, J., Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B.: Web page categorization and feature selection using association rule and principal component clustering. In: WITS 1997 (1997)

    Google Scholar 

  19. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI Workshop 1998, pp. 41–48 (1998)

    Google Scholar 

  20. Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, Burlington (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xu, B., Huang, J.Z., Williams, G., Li, M.J., Ye, Y. (2012). Hybrid Random Forests: Advantages of Mixed Trees in Classifying Text Data. In: Tan, PN., Chawla, S., Ho, C.K., Bailey, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2012. Lecture Notes in Computer Science(), vol 7301. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30217-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-30217-6_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-30216-9

  • Online ISBN: 978-3-642-30217-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics