Abstract
Textual case based reasoning (TCBR) is a challenging problem because a single case may consist of different topics and complex linguistic terms. Many efforts have been made to enhance retrieval process in TCBR using clustering methods. This paper proposes an enhanced clustering approach called GloSOPHIA (GloVe SOPHIA). It is based on extending SOPHIA by integrating word embeddings technique to enhance knowledge discovery in TCBR. To evaluate the quality of the proposed method, we will apply the GloSOPHIA to an Arabic newspaper corpus called watan-2004 and will compare the results with SOPHIA (SOPHisticated Information Analysis), K-means, and Self-Organizing Map (SOM) with different types of evaluation criteria. The results show that GloSOPHIA outperforms the 3 other clustering methods in most of the evaluation criteria.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aamodt, A., Plaza, E.: Case-based reasoning: foundational issues, methodological variations, and system approaches. AI Commun. 7(1), 39–59 (1994)
Recio-Garcıa, J.A., Dıaz-Agudo, B., González-Calero, P.A.: Textual CBR in jCOLIBRI: from retrieval to reuse. In: Proceedings of the ICCBR 2007 Workshop on Textual Case-Based Reasoning: Beyond Retrieval (2007)
Witten, I.H., et al.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Boston (2016)
Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval. Natural Lang. Eng. 16(1), 100–103 (2010)
Weber, R.O., Ashley, K.D., Brüninghaus, S.: Textual case-based reasoning. Knowl. Eng. Rev. 20(3), 255–260 (2005)
Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Mining Text Data, pp. 77–128. Springer, Boston (2012)
Allahyari, M., et al.: A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017)
Silge, J., Robinson, D.: Text Mining with R: A Tidy Approach. O’Reilly Media, Sebastopol (2017)
Patterson, D., et al.: SOPHIA-TCBR: a knowledge discovery framework for textual case-based reasoning. Knowl. Based Syst. 21(5), 404–414 (2008)
Hirschberg, J., Manning, C.D.: Advances in natural language processing. Science 349(6245), 261–266 (2015)
Mikolov, T., et al: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Cunningham, C., et al.: Investigating graphs in textual case-based reasoning. In: European Conference on Case-Based Reasoning. Springer, Heidelberg (2004)
Proctor, J.M., Waldstein, I., Weber, R.: Identifying facts for TCBR. In: ICCBR Workshops (2005)
Fornells, A., et al.: Integration of a methodology for cluster-based retrieval in jColibri. In: International Conference on Case-Based Reasoning. Springer, Heidelberg (2009)
Kohonen, T.: The self-organizing map. Proc. IEEE 78(9), 1464–1480 (1990)
Osiński, S., Stefanowski, J., Weiss, D.: Lingo: search results clustering algorithm based on singular value decomposition. In: Intelligent Information Processing and Web Mining, pp. 359–368. Springer, Heidelberg (2004)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)
Abbas, M., Smaili, K., Berkani, D.: Evaluation of topic identification methods on arabic Corpora. J. Digit. Inf. Manage. 9(5), 185–192 (2011)
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a k-means clustering algorithm. J. Roy. Stat. Soc. (Appl. Stat.) 28(1), 100–108 (1979)
Kelaiaia, A., Merouani, H.F.: Clustering with probabilistic topic models on arabic texts: a comparative study of LDA and K-means. Int. Arab J. Inf. Technol. 13(2), 332–338 (2016)
Hajič, J., et al.: Prague Arabic Dependency Treebank 1.0. (2009)
Smrz, O., Bielicky, V., Hajic, J.: Prague Arabic dependency treebank: a word on the million words (2008)
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. No. CMU-CS-96–118. Carnegie-mellon Univ. Pittsburgh dept. of computer science (1996)
Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Comput. Appl. Math. 20, 53–65 (1987)
Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974)
Handl, J., Knowles, J.: Exploiting the trade-off—the benefits of multiple objectives in data clustering. In: International Conference on Evolutionary Multi-Criterion Optimization. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Terra, E., Mohammed, A., Hefny, H.A. (2020). GloSOPHIA: An Enhanced Textual Based Clustering Approach by Word Embeddings. In: Hassanien, A., Shaalan, K., Tolba, M. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2019. AISI 2019. Advances in Intelligent Systems and Computing, vol 1058. Springer, Cham. https://doi.org/10.1007/978-3-030-31129-2_64
Download citation
DOI: https://doi.org/10.1007/978-3-030-31129-2_64
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31128-5
Online ISBN: 978-3-030-31129-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)