Skip to main content

GloSOPHIA: An Enhanced Textual Based Clustering Approach by Word Embeddings

  • Conference paper
  • First Online:
Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2019 (AISI 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1058))

  • 2367 Accesses

Abstract

Textual case based reasoning (TCBR) is a challenging problem because a single case may consist of different topics and complex linguistic terms. Many efforts have been made to enhance retrieval process in TCBR using clustering methods. This paper proposes an enhanced clustering approach called GloSOPHIA (GloVe SOPHIA). It is based on extending SOPHIA by integrating word embeddings technique to enhance knowledge discovery in TCBR. To evaluate the quality of the proposed method, we will apply the GloSOPHIA to an Arabic newspaper corpus called watan-2004 and will compare the results with SOPHIA (SOPHisticated Information Analysis), K-means, and Self-Organizing Map (SOM) with different types of evaluation criteria. The results show that GloSOPHIA outperforms the 3 other clustering methods in most of the evaluation criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aamodt, A., Plaza, E.: Case-based reasoning: foundational issues, methodological variations, and system approaches. AI Commun. 7(1), 39–59 (1994)

    Google Scholar 

  2. Recio-Garcıa, J.A., Dıaz-Agudo, B., González-Calero, P.A.: Textual CBR in jCOLIBRI: from retrieval to reuse. In: Proceedings of the ICCBR 2007 Workshop on Textual Case-Based Reasoning: Beyond Retrieval (2007)

    Google Scholar 

  3. Witten, I.H., et al.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Boston (2016)

    MATH  Google Scholar 

  4. Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval. Natural Lang. Eng. 16(1), 100–103 (2010)

    Article  Google Scholar 

  5. Weber, R.O., Ashley, K.D., Brüninghaus, S.: Textual case-based reasoning. Knowl. Eng. Rev. 20(3), 255–260 (2005)

    Article  Google Scholar 

  6. Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128. Springer, Boston (2012)

    Google Scholar 

  7. Allahyari, M., et al.: A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017)

  8. Silge, J., Robinson, D.: Text Mining with R: A Tidy Approach. O’Reilly Media, Sebastopol (2017)

    Google Scholar 

  9. Patterson, D., et al.: SOPHIA-TCBR: a knowledge discovery framework for textual case-based reasoning. Knowl. Based Syst. 21(5), 404–414 (2008)

    Article  Google Scholar 

  10. Hirschberg, J., Manning, C.D.: Advances in natural language processing. Science 349(6245), 261–266 (2015)

    Article  MathSciNet  Google Scholar 

  11. Mikolov, T., et al: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  12. Cunningham, C., et al.: Investigating graphs in textual case-based reasoning. In: European Conference on Case-Based Reasoning. Springer, Heidelberg (2004)

    Google Scholar 

  13. Proctor, J.M., Waldstein, I., Weber, R.: Identifying facts for TCBR. In: ICCBR Workshops (2005)

    Google Scholar 

  14. Fornells, A., et al.: Integration of a methodology for cluster-based retrieval in jColibri. In: International Conference on Case-Based Reasoning. Springer, Heidelberg (2009)

    Google Scholar 

  15. Kohonen, T.: The self-organizing map. Proc. IEEE 78(9), 1464–1480 (1990)

    Article  Google Scholar 

  16. Osiński, S., Stefanowski, J., Weiss, D.: Lingo: search results clustering algorithm based on singular value decomposition. In: Intelligent Information Processing and Web Mining, pp. 359–368. Springer, Heidelberg (2004)

    Google Scholar 

  17. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)

    Google Scholar 

  18. Abbas, M., Smaili, K., Berkani, D.: Evaluation of topic identification methods on arabic Corpora. J. Digit. Inf. Manage. 9(5), 185–192 (2011)

    Google Scholar 

  19. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a k-means clustering algorithm. J. Roy. Stat. Soc. (Appl. Stat.) 28(1), 100–108 (1979)

    MATH  Google Scholar 

  20. Kelaiaia, A., Merouani, H.F.: Clustering with probabilistic topic models on arabic texts: a comparative study of LDA and K-means. Int. Arab J. Inf. Technol. 13(2), 332–338 (2016)

    Google Scholar 

  21. Hajič, J., et al.: Prague Arabic Dependency Treebank 1.0. (2009)

    Google Scholar 

  22. Smrz, O., Bielicky, V., Hajic, J.: Prague Arabic dependency treebank: a word on the million words (2008)‏

    Google Scholar 

  23. Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. No. CMU-CS-96–118. Carnegie-mellon Univ. Pittsburgh dept. of computer science (1996)

    Google Scholar 

  24. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991)

    Article  MathSciNet  Google Scholar 

  25. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  26. Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974)

    Article  MathSciNet  Google Scholar 

  27. Handl, J., Knowles, J.: Exploiting the trade-off—the benefits of multiple objectives in data clustering. In: International Conference on Evolutionary Multi-Criterion Optimization. Springer, Heidelberg (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ehab Terra , Ammar Mohammed or Hesham A. Hefny .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Terra, E., Mohammed, A., Hefny, H.A. (2020). GloSOPHIA: An Enhanced Textual Based Clustering Approach by Word Embeddings. In: Hassanien, A., Shaalan, K., Tolba, M. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2019. AISI 2019. Advances in Intelligent Systems and Computing, vol 1058. Springer, Cham. https://doi.org/10.1007/978-3-030-31129-2_64

Download citation

Publish with us

Policies and ethics