Short Text Classification Using Semantic Random Forest

Bouaziz, Ameni; Dartigues-Pallez, Christel; da Costa Pereira, Célia; Precioso, Frédéric; Lloret, Patrick

doi:10.1007/978-3-319-10160-6_26

Short Text Classification Using Semantic Random Forest

Ameni Bouaziz¹⁷,
Christel Dartigues-Pallez¹⁷,
Célia da Costa Pereira¹⁷,
Frédéric Precioso¹⁷ &
…
Patrick Lloret¹⁸

Conference paper

2056 Accesses
18 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8646))

Abstract

Using traditional Random Forests in short text classification revealed a performance degradation compared to using them for standard texts. Shortness, sparseness and lack of contextual information in short texts are the reasons of this degradation. Existing solutions to overcome these issues are mainly based on data enrichment. However, data enrichment can also introduce noise. We propose a new approach that combines data enrichment with the introduction of semantics in Random Forests. Each short text is enriched with data semantically similar to its words. These data come from an external source of knowledge distributed into topics thanks to the Latent Dirichlet Allocation model. Learning process in Random Forests is adapted to consider semantic relations between words while building the trees. Tests performed on search-snippets using the new method showed significant improvements in the classification. The accuracy has increased by 34% compared to traditional Random Forests and by 20% compared to MaxEnt.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yang, L., Li, C., Ding, Q., Li, L.: Combining Lexical and Semantic Features for Short Text Classification. In: 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems - KES (2013)
Google Scholar
Amaratunga, D., Cabrera, J., Lee, Y.S.: Enriched Random Forests. Bioinformatics 24(18), 2010–2014 (2008)
Article Google Scholar
Chen, M., Jin, X., Shen, D.: Short Text Classification Improved by Learning Multi-Granularity Topics. In: 22nd International Joint Conference on Artificial Intelligence (2011)
Google Scholar
Song, Y., Wang, H., Wang, Z., Li, H., Chen, W.: Short Text Conceptualization using a Probabilistic Knowledge base. In: 22nd International Joint Conference on Artificial Intelligence, pp. 2330–2336 (2011)
Google Scholar
Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)
Article MATH Google Scholar
Guerts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63, 3–42 (2006)
Article Google Scholar
Chen, C., Liaw, A., Breiman, L.: Using Random Forest to Learn Imbalanced Data (2004)
Google Scholar
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. In: www 2008 Data Mining-Learning, Beijing, China (2008)
Google Scholar
Hu, X., Zhang, X., Caimei, L., Park, E.K., Zhou, X.: Exploiting Wikipedia as External Knowledge for Document Clustering. In: KDD 2009, Paris, France (2009)
Google Scholar
Hu, X., Sun, N., Zhang, C., Tat-Seng, C.: Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge. In: CIKM 2009, Hong Kong, China, pp. 2–6 (2009)
Google Scholar
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research, 993–1022 (2003)
Google Scholar
Dumais, S.T.: Latent Semantic Indexing. In: TExt REtrieval Conference, pp. 219–230 (1995)
Google Scholar
Berger, A., Pietra, A., Pietra, J.: A maximum Entropy Approach to Natural Language Processing. Computational Linguistics 22(1), 39–71 (1996)
Google Scholar
Caragea, D., Bahirwani, V., Aljandal, W., Hsu, W.: Ontology-Based Link Prediction in the LiveJournal Social Network. In: 8th Symposium on Abstraction, Reformulation and Approximation (2009)
Google Scholar
Chen, Z., Zhang, W.: Integrative Analysis Using Module-Guided Random Forests Reveals Correlated Genetic Factors Related to Mouse Weight. Plos Computational Biology 9, e1002956 (2013)
Google Scholar
Scikit-Learn Machine Learning in Python, http://scikit-learn.org

Download references

Author information

Authors and Affiliations

Laboratoire I3S (CNRS UMR-7271), Université Nice Sophia Antipolis, France
Ameni Bouaziz, Christel Dartigues-Pallez, Célia da Costa Pereira & Frédéric Precioso
Semantic Group Company, Paris, France
Patrick Lloret

Authors

Ameni Bouaziz
View author publications
You can also search for this author in PubMed Google Scholar
Christel Dartigues-Pallez
View author publications
You can also search for this author in PubMed Google Scholar
Célia da Costa Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Frédéric Precioso
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Lloret
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LIAS/ISAE-ENSMA, Téléport 2, 1 avenue Clément Ader, BP 40109, 86961, Futuroscope Chasseneuil Cedex, France
Ladjel Bellatreche
IBM Research - India, 4, Block-C, Institutional Area, 110070, Vasant Kunj, New Delhi, India
Mukesh K. Mohania

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bouaziz, A., Dartigues-Pallez, C., da Costa Pereira, C., Precioso, F., Lloret, P. (2014). Short Text Classification Using Semantic Random Forest. In: Bellatreche, L., Mohania, M.K. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2014. Lecture Notes in Computer Science, vol 8646. Springer, Cham. https://doi.org/10.1007/978-3-319-10160-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-10160-6_26
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10159-0
Online ISBN: 978-3-319-10160-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics