Abstract
This paper presents word embedding-based approach to text classification. In this study, we introduce a new vector space model called Semantically-Augmented Statistical Vector Space Model (SAS-VSM) that is a statistical VSM with a semantic VSM for information access systems, especially for automatic text classification. In the SAS-VSM, we first implement a primary approach to concatenate continuous-valued semantic features with an existing statistical VSM. We, then, introduce the Centroid-Means-Embedding (CME) method that updates existing statistical feature vectors with semantic knowledge. Experimental results show that the proposed CME-based SAS-VSM approaches are promising over the different weighting approaches on the 20 Newsgroups and RCV1-v2/LYRL2004 datasets using Support Vector Machine (SVM) classifiers to enhance the classification tasks. Our approach outperformed other approaches in both micro-F\(_1\) and categorical performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ren, F., Sohrab, M.G.: Class-indexing-based term weighting for automatic text classification. Information Sciences 236, 109–125 (2013)
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: 18th ACM Symposium on Applied Computing, pp. 784–788. Florida (2003)
Jiang, G., Wanxiang, C., Haifeng, W., Ting, K.: Revisiting embedding features for simple semi-supervised learning. In: 2014 Conference on Empirical Methods in Natural Language Processing, pp. 110–120. Qatar (2014)
Flora, S., Agus, T.: Experiments in Term Weighting for Novelty mining. Expert Systems with Applications 38, 14094–14101 (2011)
Jeffrey, P., Richard, S., Christopher, D. M.: Glove: global vectors for word representation. In: 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. Qatar (2014)
Guo, Y., Shao, Z., Hua, N.: Automatic text categorization based on content analysis with cognitive situation models. Information Sciences 180, 613–630 (2010)
Huang, E.H., Socher, R., Christopher, D.M., Andrew, Y.N.: Improving word representations via global context and multiple word prototypes. In: 50th Annual meeting of the Association for Computational Linguistics, pp. 873–882. Korea (2012)
Salton, G.: A theory of indexing. Bristol, UK (1975)
Kang, B., Lee, S.: Document indexing: A concept-based approach to term weight estimation. Information Processing and Management 41(5), 1065–1080 (2005)
Kansheng, S., Jie, H., Hai-tao, L., Nai-tong, Z., Wen-tao, S.: Efficient text classification method based on improved term reduction and term weighting. The Journal of China Universities of Posts and Telecommunications 18, 131–135 (2011)
Ko, Y., Seo, J.: Text Classification From Unlabeled documents with bootstrapping and feature projection techniques. Information Processing and management 45, 70–83 (2009)
Xia, R., Zong, C., Li, S.: Ensemble of feature sets and classification algorithms for sentiment classification. Information Sciences 181, 1138–1152 (2011)
Sohrab, M.G., Ren, F.: Class-indexing: the effectiveness of class-space-density in high and low-dimensional vector space for text classification, In: 2nd International Conference of Cloud Computing and Intelligence Systems, pp. 2034–2042. China (2012)
Sparck, K.J.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)
Wu, X., Kumar, V., et al.: Top 10 algorithms in data mining, Knowledge. Information Systems 14, 1–37 (2008)
Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)
Liu, Y., Loh, H., Sun, A.: Imbalanced text classification: A term weighting approach. Expert Systems with Applications 36(1), 690–701 (2009)
Luo, Q., Chen, E., Xiong, H.: A semantic term weighting scheme for text classification. Expert Systems with Applications 38(10), 12708–12716 (2011)
Salton, G., Yang, C.S., Yu, C.T.: Contribution to the theory of indexing. In: IFIP Congress 74, Stockholm. American Elsevier, New York (1973)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Salton, G., McGill, M.J.: Introduction to modern information retrieval. New York (1983)
Quoc, L., Mikolov, T.: Distributed representation of sentences and documents. In: 31th International Conference on Machine Learning, pp. 1188–1196 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Sohrab, M.G., Miwa, M., Sasaki, Y. (2015). Centroid-Means-Embedding: An Approach to Infusing Word Embeddings into Features for Text Classification. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9077. Springer, Cham. https://doi.org/10.1007/978-3-319-18038-0_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-18038-0_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18037-3
Online ISBN: 978-3-319-18038-0
eBook Packages: Computer ScienceComputer Science (R0)