Skip to main content

Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2019)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13452))

  • 356 Accesses

Abstract

Extracting meaningful features from documents can prove critical for a variety of tasks such as classification, clustering and semantic analysis. However, traditional approaches to document feature extraction mainly rely on first-order word statistics that are very high dimensional and do not capture well the semantic of the documents. For this reason, in this paper we present a novel approach that extracts document features based on a combination of a constructed word taxonomy and a word embedding in vector space. The feature extraction consists of three main steps: first, a word embedding technique is used to map all the words in the vocabulary onto a vector space. Second, the words in the vocabulary are organised into a hierarchy of clusters (word clusters) by using k-means hierarchically. Lastly, the individual documents are projected onto the word clusters based on a predefined set of keywords, leading to a compact representation as a mixture of keywords. The extracted features can be used for a number of tasks including document classification and clustering as well as semantic analysis of the documents generated by specific individuals over time. For the experiments, we have employed a dataset of transcripts of phone calls between claim managers and clients collected by the Transport Accident Commission of the Victorian Government. The experimental results show that the proposed approach has been capable of achieving comparable or higher accuracy than conventional feature extraction approaches and with a much more compact representation.

S. Seifollahi—Currently working at Resolution Life (Australia). This work was performed while at the University of Technology Sydney.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alshari, E.M., Azman, A., Doraisamy, S., Mustapha, N., Alkeshr, M.: Improvement of sentiment analysis based on clustering of Word2Vec features. In: Proceedings - International Workshop on Database and Expert Systems Applications, DEXA (2017)

    Google Scholar 

  2. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Gabow, H. (Ed.) Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms [SODA07], pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)

    Google Scholar 

  3. Asim, M.N., Wasim, M., Khan, M.U.G., Mahmood, W., Abbasi, H.M.: A survey of ontology learning techniques and applications. Database (2018)

    Google Scholar 

  4. Bagirov, A., Seifollahi, S., Piccardi, M., Zare, E., Kruger, B.: SMGKM: an efficient incremental algorithm for clustering document collections. In: CICLing 2018 (2018)

    Google Scholar 

  5. Brock, G., Pihur, V., Datta, S., Datta, S.: clValid: An R package for cluster validation. J. Stat. Softw. 25, 1–22 (2008)

    Article  Google Scholar 

  6. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)

    Google Scholar 

  7. Cheng, Y.: Ontology-based fuzzy semantic clustering. In: Proceedings - 3rd International Conference on Convergence and Hybrid Information Technology, ICCIT 2008, vol. 2, pp. 128–133 (2008)

    Google Scholar 

  8. Dhillon, S., Fan, J., Guan, Y.: Efficient clustering of very large document collections. In: Kamath, C., Kumar, V., Grossman, R., Namburu, R., (eds.), Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, Oxford (2001)

    Google Scholar 

  9. Elsayed, A., Mokhtar, H.M.O., Ismail, O.: Ontology based document clustering using Mapreduce. Int. J. Database Manage. Syst. 7(2), 1–12 (2015)

    Article  Google Scholar 

  10. Erra, U., Senatore, S., Minnella, F., Caggianese, G.: Approximate TF-IDF based on topic extraction from massive message stream using the GPU. Inf. Sci. 292, 143–161 (2015)

    Article  Google Scholar 

  11. Fodeh, S., Punch, B., Tan, P.-N.: On ontology-driven document clustering using core semantic features. Knowl. Inf. Syst. 28(2), 395–421 (2011)

    Article  Google Scholar 

  12. Friedman, J.H.: On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Min. Knowl. Disc. 1(1), 55–77 (1997)

    Article  Google Scholar 

  13. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  14. A. Hotho, S. Staab, and G. Stumme. Ontologies improve text document clustering. In Third IEEE International Conference on Data Mining, pages 541–544, 2003

    Google Scholar 

  15. Kim, J., Rousseau, F., Vazirgiannis, M.: Convolutional sentence kernel from word embeddings for short text categorization. In: Proceedings EMNLP 2015, September, pp. 775–780 (2015)

    Google Scholar 

  16. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. Proc. ICML 37, 957–966 (2015)

    Google Scholar 

  17. Lenc, L., Král, P.: Word embeddings for multi-label document classification. In: Proceedings of Recent Advances in Natural Language Processing, pp. 431–437 (2017)

    Google Scholar 

  18. Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: Proceedings of IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 136–140 (2015)

    Google Scholar 

  19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Arxiv, pp. 1–12 (2013)

    Google Scholar 

  20. Moseley, B., Wang, J.R.: Approximation bounds for hierarchical clustering: average linkage, bisecting K-means, and local search. In: Number Nips, pp. 3097–3106 (2017)

    Google Scholar 

  21. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings EMNLP 2014, pp. 1532–1543 (2014)

    Google Scholar 

  22. Qimin, C., Qiao, G., Yongliang, W., Xianghua, W.: Text clustering using VSM with feature clusters. Neural Comput. Appl. 26(4), 995–1003 (2015)

    Article  Google Scholar 

  23. Seifollahi, S., Bagirov, A., Layton, R., Gondal, I.: Optimization based clustering algorithms for authorship analysis of phishing emails. Neural Process. Lett. 46(2), 411–425 (2017)

    Article  Google Scholar 

  24. Seifollahi, S., Piccardi, M., Borzeshi, E.Z., Kruger, B.: Taxonomy-augmented features for document clustering. In: Islam, R., et al. (eds.) AusDM 2018. CCIS, vol. 996, pp. 241–252. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-6661-1_19

    Chapter  Google Scholar 

  25. Stein, R.A., Jaques, P.A., Valiati, J.F.: An analysis of hierarchical text classification using word embeddings. Inf. Sci. 471, 216–232 (2019)

    Article  Google Scholar 

  26. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining , vol. 400, pp. 1–2 (2000)

    Google Scholar 

  27. Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings ACL, pp. 1555–1565 (2014)

    Google Scholar 

  28. Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016)

    Article  Google Scholar 

  29. Zhang, D., Xu, H., Su, Z., Xu, Y.: Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst. Appl. 42(4), 1857–1863 (2015)

    Article  Google Scholar 

  30. Zhu, L., Wang, G., Zou, X.: A study of Chinese document representation and classification with Word2vec. In: Proceedings - 2016 9th International Symposium on Computational Intelligence and Design, ISCID 2016, pp. 1:298–302 (2017)

    Google Scholar 

Download references

Acknowledgement

This project has been funded by the Capital Markets Cooperative Research Centre and the Transport Accident Commission of Victoria. Acknowledgements and thanks to our industry partners David Attwood (Lead Operational Management and Data Research) and Bernie Kruger (Business Intelligence and Data Science Lead). This research has received ethics approval from University of Technology Sydney (UTS HREC REF NO. ETH16-0968).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sattar Seifollahi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Seifollahi, S., Piccardi, M. (2023). Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13452. Springer, Cham. https://doi.org/10.1007/978-3-031-24340-0_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24340-0_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24339-4

  • Online ISBN: 978-3-031-24340-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics