Centroid-Means-Embedding: An Approach to Infusing Word Embeddings into Features for Text Classification

Sohrab, Mohammad Golam; Miwa, Makoto; Sasaki, Yutaka

doi:10.1007/978-3-319-18038-0_23

Mohammad Golam Sohrab¹⁰,
Makoto Miwa¹⁰ &
Yutaka Sasaki¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9077))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3594 Accesses
2 Citations

Abstract

This paper presents word embedding-based approach to text classification. In this study, we introduce a new vector space model called Semantically-Augmented Statistical Vector Space Model (SAS-VSM) that is a statistical VSM with a semantic VSM for information access systems, especially for automatic text classification. In the SAS-VSM, we first implement a primary approach to concatenate continuous-valued semantic features with an existing statistical VSM. We, then, introduce the Centroid-Means-Embedding (CME) method that updates existing statistical feature vectors with semantic knowledge. Experimental results show that the proposed CME-based SAS-VSM approaches are promising over the different weighting approaches on the 20 Newsgroups and RCV1-v2/LYRL2004 datasets using Support Vector Machine (SVM) classifiers to enhance the classification tasks. Our approach outperformed other approaches in both micro-F\(_1\) and categorical performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ren, F., Sohrab, M.G.: Class-indexing-based term weighting for automatic text classification. Information Sciences 236, 109–125 (2013)
Article Google Scholar
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: 18th ACM Symposium on Applied Computing, pp. 784–788. Florida (2003)
Google Scholar
Jiang, G., Wanxiang, C., Haifeng, W., Ting, K.: Revisiting embedding features for simple semi-supervised learning. In: 2014 Conference on Empirical Methods in Natural Language Processing, pp. 110–120. Qatar (2014)
Google Scholar
Flora, S., Agus, T.: Experiments in Term Weighting for Novelty mining. Expert Systems with Applications 38, 14094–14101 (2011)
Google Scholar
Jeffrey, P., Richard, S., Christopher, D. M.: Glove: global vectors for word representation. In: 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. Qatar (2014)
Google Scholar
Guo, Y., Shao, Z., Hua, N.: Automatic text categorization based on content analysis with cognitive situation models. Information Sciences 180, 613–630 (2010)
Article MathSciNet Google Scholar
Huang, E.H., Socher, R., Christopher, D.M., Andrew, Y.N.: Improving word representations via global context and multiple word prototypes. In: 50th Annual meeting of the Association for Computational Linguistics, pp. 873–882. Korea (2012)
Google Scholar
Salton, G.: A theory of indexing. Bristol, UK (1975)
Google Scholar
Kang, B., Lee, S.: Document indexing: A concept-based approach to term weight estimation. Information Processing and Management 41(5), 1065–1080 (2005)
Article MATH MathSciNet Google Scholar
Kansheng, S., Jie, H., Hai-tao, L., Nai-tong, Z., Wen-tao, S.: Efficient text classification method based on improved term reduction and term weighting. The Journal of China Universities of Posts and Telecommunications 18, 131–135 (2011)
Article Google Scholar
Ko, Y., Seo, J.: Text Classification From Unlabeled documents with bootstrapping and feature projection techniques. Information Processing and management 45, 70–83 (2009)
Article Google Scholar
Xia, R., Zong, C., Li, S.: Ensemble of feature sets and classification algorithms for sentiment classification. Information Sciences 181, 1138–1152 (2011)
Article Google Scholar
Sohrab, M.G., Ren, F.: Class-indexing: the effectiveness of class-space-density in high and low-dimensional vector space for text classification, In: 2nd International Conference of Cloud Computing and Intelligence Systems, pp. 2034–2042. China (2012)
Google Scholar
Sparck, K.J.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)
Article Google Scholar
Wu, X., Kumar, V., et al.: Top 10 algorithms in data mining, Knowledge. Information Systems 14, 1–37 (2008)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)
Google Scholar
Liu, Y., Loh, H., Sun, A.: Imbalanced text classification: A term weighting approach. Expert Systems with Applications 36(1), 690–701 (2009)
Article Google Scholar
Luo, Q., Chen, E., Xiong, H.: A semantic term weighting scheme for text classification. Expert Systems with Applications 38(10), 12708–12716 (2011)
Article Google Scholar
Salton, G., Yang, C.S., Yu, C.T.: Contribution to the theory of indexing. In: IFIP Congress 74, Stockholm. American Elsevier, New York (1973)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Salton, G., McGill, M.J.: Introduction to modern information retrieval. New York (1983)
Google Scholar
Quoc, L., Mikolov, T.: Distributed representation of sentences and documents. In: 31th International Conference on Machine Learning, pp. 1188–1196 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Engineering, Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, 468-8511, Japan
Mohammad Golam Sohrab, Makoto Miwa & Yutaka Sasaki

Authors

Mohammad Golam Sohrab
View author publications
You can also search for this author in PubMed Google Scholar
Makoto Miwa
View author publications
You can also search for this author in PubMed Google Scholar
Yutaka Sasaki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Golam Sohrab .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Tru Cao
Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Nanjing University, Nanjing, China
Zhi-Hua Zhou
Japan Advanced Institute of Science and Technology, Nomi City, Japan
Tu-Bao Ho
University of Hong Kong, Hong Kong, Hong Kong SAR
David Cheung
Osaka University, Osaka, Japan
Hiroshi Motoda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sohrab, M.G., Miwa, M., Sasaki, Y. (2015). Centroid-Means-Embedding: An Approach to Infusing Word Embeddings into Features for Text Classification. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9077. Springer, Cham. https://doi.org/10.1007/978-3-319-18038-0_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-18038-0_23
Published: 17 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18037-3
Online ISBN: 978-3-319-18038-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics