Skip to main content

Short Text Classification Using Semantic Random Forest

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8646))

Abstract

Using traditional Random Forests in short text classification revealed a performance degradation compared to using them for standard texts. Shortness, sparseness and lack of contextual information in short texts are the reasons of this degradation. Existing solutions to overcome these issues are mainly based on data enrichment. However, data enrichment can also introduce noise. We propose a new approach that combines data enrichment with the introduction of semantics in Random Forests. Each short text is enriched with data semantically similar to its words. These data come from an external source of knowledge distributed into topics thanks to the Latent Dirichlet Allocation model. Learning process in Random Forests is adapted to consider semantic relations between words while building the trees. Tests performed on search-snippets using the new method showed significant improvements in the classification. The accuracy has increased by 34% compared to traditional Random Forests and by 20% compared to MaxEnt.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yang, L., Li, C., Ding, Q., Li, L.: Combining Lexical and Semantic Features for Short Text Classification. In: 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems - KES (2013)

    Google Scholar 

  2. Amaratunga, D., Cabrera, J., Lee, Y.S.: Enriched Random Forests. Bioinformatics 24(18), 2010–2014 (2008)

    Article  Google Scholar 

  3. Chen, M., Jin, X., Shen, D.: Short Text Classification Improved by Learning Multi-Granularity Topics. In: 22nd International Joint Conference on Artificial Intelligence (2011)

    Google Scholar 

  4. Song, Y., Wang, H., Wang, Z., Li, H., Chen, W.: Short Text Conceptualization using a Probabilistic Knowledge base. In: 22nd International Joint Conference on Artificial Intelligence, pp. 2330–2336 (2011)

    Google Scholar 

  5. Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  6. Guerts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63, 3–42 (2006)

    Article  Google Scholar 

  7. Chen, C., Liaw, A., Breiman, L.: Using Random Forest to Learn Imbalanced Data (2004)

    Google Scholar 

  8. Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. In: www 2008 Data Mining-Learning, Beijing, China (2008)

    Google Scholar 

  9. Hu, X., Zhang, X., Caimei, L., Park, E.K., Zhou, X.: Exploiting Wikipedia as External Knowledge for Document Clustering. In: KDD 2009, Paris, France (2009)

    Google Scholar 

  10. Hu, X., Sun, N., Zhang, C., Tat-Seng, C.: Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge. In: CIKM 2009, Hong Kong, China, pp. 2–6 (2009)

    Google Scholar 

  11. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research, 993–1022 (2003)

    Google Scholar 

  12. Dumais, S.T.: Latent Semantic Indexing. In: TExt REtrieval Conference, pp. 219–230 (1995)

    Google Scholar 

  13. Berger, A., Pietra, A., Pietra, J.: A maximum Entropy Approach to Natural Language Processing. Computational Linguistics 22(1), 39–71 (1996)

    Google Scholar 

  14. Caragea, D., Bahirwani, V., Aljandal, W., Hsu, W.: Ontology-Based Link Prediction in the LiveJournal Social Network. In: 8th Symposium on Abstraction, Reformulation and Approximation (2009)

    Google Scholar 

  15. Chen, Z., Zhang, W.: Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight. Plos Computational Biology 9, e1002956 (2013)

    Google Scholar 

  16. Scikit-Learn Machine Learning in Python, http://scikit-learn.org

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Bouaziz, A., Dartigues-Pallez, C., da Costa Pereira, C., Precioso, F., Lloret, P. (2014). Short Text Classification Using Semantic Random Forest. In: Bellatreche, L., Mohania, M.K. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2014. Lecture Notes in Computer Science, vol 8646. Springer, Cham. https://doi.org/10.1007/978-3-319-10160-6_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10160-6_26

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10159-0

  • Online ISBN: 978-3-319-10160-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics