Skip to main content

Centroid-Means-Embedding: An Approach to Infusing Word Embeddings into Features for Text Classification

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9077))

Included in the following conference series:

Abstract

This paper presents word embedding-based approach to text classification. In this study, we introduce a new vector space model called Semantically-Augmented Statistical Vector Space Model (SAS-VSM) that is a statistical VSM with a semantic VSM for information access systems, especially for automatic text classification. In the SAS-VSM, we first implement a primary approach to concatenate continuous-valued semantic features with an existing statistical VSM. We, then, introduce the Centroid-Means-Embedding (CME) method that updates existing statistical feature vectors with semantic knowledge. Experimental results show that the proposed CME-based SAS-VSM approaches are promising over the different weighting approaches on the 20 Newsgroups and RCV1-v2/LYRL2004 datasets using Support Vector Machine (SVM) classifiers to enhance the classification tasks. Our approach outperformed other approaches in both micro-F\(_1\) and categorical performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ren, F., Sohrab, M.G.: Class-indexing-based term weighting for automatic text classification. Information Sciences 236, 109–125 (2013)

    Article  Google Scholar 

  2. Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: 18th ACM Symposium on Applied Computing, pp. 784–788. Florida (2003)

    Google Scholar 

  3. Jiang, G., Wanxiang, C., Haifeng, W., Ting, K.: Revisiting embedding features for simple semi-supervised learning. In: 2014 Conference on Empirical Methods in Natural Language Processing, pp. 110–120. Qatar (2014)

    Google Scholar 

  4. Flora, S., Agus, T.: Experiments in Term Weighting for Novelty mining. Expert Systems with Applications 38, 14094–14101 (2011)

    Google Scholar 

  5. Jeffrey, P., Richard, S., Christopher, D. M.: Glove: global vectors for word representation. In: 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. Qatar (2014)

    Google Scholar 

  6. Guo, Y., Shao, Z., Hua, N.: Automatic text categorization based on content analysis with cognitive situation models. Information Sciences 180, 613–630 (2010)

    Article  MathSciNet  Google Scholar 

  7. Huang, E.H., Socher, R., Christopher, D.M., Andrew, Y.N.: Improving word representations via global context and multiple word prototypes. In: 50th Annual meeting of the Association for Computational Linguistics, pp. 873–882. Korea (2012)

    Google Scholar 

  8. Salton, G.: A theory of indexing. Bristol, UK (1975)

    Google Scholar 

  9. Kang, B., Lee, S.: Document indexing: A concept-based approach to term weight estimation. Information Processing and Management 41(5), 1065–1080 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  10. Kansheng, S., Jie, H., Hai-tao, L., Nai-tong, Z., Wen-tao, S.: Efficient text classification method based on improved term reduction and term weighting. The Journal of China Universities of Posts and Telecommunications 18, 131–135 (2011)

    Article  Google Scholar 

  11. Ko, Y., Seo, J.: Text Classification From Unlabeled documents with bootstrapping and feature projection techniques. Information Processing and management 45, 70–83 (2009)

    Article  Google Scholar 

  12. Xia, R., Zong, C., Li, S.: Ensemble of feature sets and classification algorithms for sentiment classification. Information Sciences 181, 1138–1152 (2011)

    Article  Google Scholar 

  13. Sohrab, M.G., Ren, F.: Class-indexing: the effectiveness of class-space-density in high and low-dimensional vector space for text classification, In: 2nd International Conference of Cloud Computing and Intelligence Systems, pp. 2034–2042. China (2012)

    Google Scholar 

  14. Sparck, K.J.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)

    Article  Google Scholar 

  15. Wu, X., Kumar, V., et al.: Top 10 algorithms in data mining, Knowledge. Information Systems 14, 1–37 (2008)

    Google Scholar 

  16. Lewis, D.D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  17. Liu, Y., Loh, H., Sun, A.: Imbalanced text classification: A term weighting approach. Expert Systems with Applications 36(1), 690–701 (2009)

    Article  Google Scholar 

  18. Luo, Q., Chen, E., Xiong, H.: A semantic term weighting scheme for text classification. Expert Systems with Applications 38(10), 12708–12716 (2011)

    Article  Google Scholar 

  19. Salton, G., Yang, C.S., Yu, C.T.: Contribution to the theory of indexing. In: IFIP Congress 74, Stockholm. American Elsevier, New York (1973)

    Google Scholar 

  20. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  21. Salton, G., McGill, M.J.: Introduction to modern information retrieval. New York (1983)

    Google Scholar 

  22. Quoc, L., Mikolov, T.: Distributed representation of sentences and documents. In: 31th International Conference on Machine Learning, pp. 1188–1196 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Golam Sohrab .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Sohrab, M.G., Miwa, M., Sasaki, Y. (2015). Centroid-Means-Embedding: An Approach to Infusing Word Embeddings into Features for Text Classification. In: Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D., Motoda, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2015. Lecture Notes in Computer Science(), vol 9077. Springer, Cham. https://doi.org/10.1007/978-3-319-18038-0_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18038-0_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18037-3

  • Online ISBN: 978-3-319-18038-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics