skip to main content
10.1145/3419604.3419779acmotherconferencesArticle/Chapter ViewAbstractPublication PagessitaConference Proceedingsconference-collections
research-article

K-means, HAC and FCM Which Clustering Approach for Arabic Text?

Published:08 November 2020Publication History

ABSTRACT

Today, we are witnessing rapid growth in Web resources that allow Internet users to express and share their ideas, opinions, and judgments on a variety of issues. Several classification approaches have been proposed to classify textual data. But all these approaches require us to label the clusters we want to obtain. Which, in reality, is not available because we do not know in advance the information that can be proposed through these opinions. To overcome this constraint, clustering approaches such as K-mean, HAC or FCM can be exploited. In this paper, we present and compare these approaches. And to show the importance of exploiting clustering algorithms, to classify and analyze textual data in Arabic. By applying them to a real case that has created a great debate in Morocco, which is the case of teachers contracting with academies.

References

  1. Goutam Chakraborty, Murali Pagolu, and Satish Garla. 2014. Text mining and analysis: practical methods, examples, and case studies using SAS. SAS Institute. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Patricia S. Abril and Robert Plant. 1979. Speaker-independent recognition of isolated words using clustering techniques. IEEE Transactions on Acoustics, Speech, and Signal Processing 27, 4 (Aug. 1979), 336--349.Google ScholarGoogle Scholar
  3. J. E. Doran F. R. Hodson, P. H. A. Sneath. 1966. Some experiments in the numerical analysis of archaeological data. Biometrika 53, 3-4 (1966), 311--324.Google ScholarGoogle ScholarCross RefCross Ref
  4. Jon R Kettenring, William H Rogers, Martin E Smith, and Jack L Warner. 1976. Cluster analysis applied to the validation of course objectives. Journal of Educational Statistics 1, 1 (1976), 39--57.Google ScholarGoogle ScholarCross RefCross Ref
  5. Inderjit S Dhillon, James Fan, and Yuqiang Guan. 2001. Efcient clustering of very large document collections. In Data mining for scientifc and engineering applications. Springer, 357--381.Google ScholarGoogle Scholar
  6. Michael Steinbach George Karypis, Vipin Kumar, and Michael Steinbach. 2000. A comparison of document clustering techniques. In TextMining Workshop at KDD2000 (May 2000).Google ScholarGoogle Scholar
  7. Ram Gnanadesikan, Jon R Kettenring, and James M Landwehr. 1977. Interpreting and assessing the results of cluster analyses. Bulletin of the International Statistical Institute 47, 2 (1977), 451--463.Google ScholarGoogle Scholar
  8. Yasser Saissi, Ahmed Zellou, and Ali Idri. 2018. A new clustering approach to identify the values to query the deep web access forms. In 2018 4th International Conference on Computer and Technology Applications (ICCTA) (Istanbul, Turkey). IEEE, 111--116.Google ScholarGoogle ScholarCross RefCross Ref
  9. Anna Huang. 2008. Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand (Christchurch, New Zealand). 49--56.Google ScholarGoogle Scholar
  10. Raihana Ferdous et al. 2009. An efcient k-means algorithm integrated with Jaccard distance measure for document clustering. In 2009 First Asian Himalayas International Conference on Internet (Kathmandu, Nepal). IEEE, 1--6.Google ScholarGoogle Scholar
  11. Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Vol. 463. ACM press New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Janmenjoy Nayak, Bighnaraj Naik, and HSr Behera. 2015. Fuzzy C-means (FCM) clustering algorithm: a decade review from 2000 to 2014. In Computational intelligence in data mining-volume 2. Springer, 133--149.Google ScholarGoogle Scholar
  13. Osama Abu Abbas. 2008. Comparisons Between Data Clustering Algorithms. International Arab Journal of Information Technology (IAJIT) 5, 3 (2008), 133--149.Google ScholarGoogle Scholar
  14. Mounzer Boubou. 2007. Contribution aux méthodes de classifcation non supervisée via des approches prétopologiques et d'agrégation d'opinions. Ph.D. Dissertation. Université Claude Bernard - Lyon I. AAT 8506171.Google ScholarGoogle Scholar
  15. Gengxin Chen, Saied A Jaradat, Nila Banerjee, Tetsuya S Tanaka, Minoru SH Ko, and Michael Q Zhang. 2002. Evaluation and comparison of clustering algorithms in analyzing ES cell gene expression data. Statistica Sinica 12, 1 (2002), 241--262.Google ScholarGoogle Scholar
  16. Maryam Bakhshi, Mohammad-Reza Feizi-Derakhshi, and E Zafarani. 2012. Review and comparison between clustering algorithms with duplicate entities detection purpose. International Journal of Computer Science & Emerging Technologies 3, 3 (2012), 108--114.Google ScholarGoogle Scholar
  17. Abdelkarim Ben Ayed, Mohamed Ben Halima, and Adel M Alimi. 2014. Survey on clustering methods: Towards fuzzy clustering for big data. In 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR) (Tunis, Tunisia). IEEE, 331--336.Google ScholarGoogle ScholarCross RefCross Ref
  18. Abla Chouni Benabdellah, Asmaa Benghabrit, and Imane Bouhaddou. 2019. A survey of clustering algorithms for an industrial context. Procedia computer science 148 (2019), 291--302.Google ScholarGoogle Scholar
  19. Benjamin Schelling and Claudia Plant. 2018. KMN-Removing Noise from K-Means Clustering Results. In International Conference on Big Data Analytics and Knowledge Discovery. Springer, 137--151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Abu Bakr Soliman, Kareem Eissa, and Samhaa R El-Beltagy. 2017. Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Computer Science 117 (2017), 256--265.Google ScholarGoogle ScholarCross RefCross Ref
  21. Sathees Kumar and R Karthika. 2014. A survey on text mining process and techniques. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) 3, 7 (2014), 2279--2284.Google ScholarGoogle Scholar
  22. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google ScholarGoogle Scholar
  23. Ramzan Talib, Muhammad Kashif Hanif, Shaeela Ayesha, and Fakeeha Fatima. 2016. Text mining: techniques, applications and issues. International Journal of Advanced Computer Science and Applications 7, 11 (2016), 414--418.Google ScholarGoogle ScholarCross RefCross Ref
  24. AL-Shatnawi Atallah and Khairuddin Omar. 2008. Methods of arabic language baseline detection-The state of art. IJCSNS 8, 10 (2008), 137.Google ScholarGoogle Scholar
  25. Iskandar Keskes, Farah Benamara, and Lamia Hadrich Belguith. 2013. Segmenting Arabic Texts into Elementary Discourse Units (Segmentation de textes arabes en unités discursives minimales)[in French]. In Proceedings of TALN 2013 (Volume 1: Long Papers). 435--449.Google ScholarGoogle Scholar
  26. Shereen Khoja and Roger Garside. 1999. Stemming arabic text. Lancaster, UK, Computing Department, Lancaster University (1999).Google ScholarGoogle Scholar
  27. Rehab Duwairi, Mohammad Al-Refai, and Natheer Khasawneh. 2007. Stemming versus light stemming as feature selection techniques for Arabic text categorization. In 2007 Innovations in Information Technologies (IIT). IEEE, 446--450.Google ScholarGoogle Scholar
  28. George W Adamson and Jillian Boreham. 1974. The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Information storage and retrieval 10, 7-8 (1974), 253--260.Google ScholarGoogle Scholar
  29. Abdullah Wahbeh, Mohammed Al-Kabi, Qasem Al-Radaideh, Emad Al-Shawakfa, and Izzat Alsmadi. 2011. The effect of stemming on Arabic text classifcation: an empirical study. International Journal of Information Retrieval Research (IJIRR) 1, 3 (2011), 54--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Karen Sparck Jones. 1972. A statistical interpretation of term specifcity and its application in retrieval. Journal of documentation (1972).Google ScholarGoogle Scholar
  31. Charrad Malika, N Ghazzali, V Boiteau, and A Niknafs. 2014. NbClust: an R package for determining the relevant number of clusters in a data Set. J. Stat. Softw 61 (2014), 1--36.Google ScholarGoogle Scholar
  32. Alboukadel Kassambara. 2017. Determining the optimal number of clusters: 3 must know methods. Available onli ne: https://www.datanovia.com/en/lessons/determiningthe-optimal-number-of-clusters-3-must-know-methods/.(accessed on 31 April 2018) (2017).Google ScholarGoogle Scholar

Index Terms

  1. K-means, HAC and FCM Which Clustering Approach for Arabic Text?

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications
      September 2020
      333 pages
      ISBN:9781450377331
      DOI:10.1145/3419604

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 November 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader