research-article

K-means, HAC and FCM Which Clustering Approach for Arabic Text?

Authors:
Lahbib Ajallouda

ENSIAS, Mohammed V University in Rabat, Rabat, Morocco

ENSIAS, Mohammed V University in Rabat, Rabat, Morocco
View Profile

,
Fatima Zahra Fagroud

LTIM - FSBM, Hassan II university, Casablanca, Morocco

LTIM - FSBM, Hassan II university, Casablanca, Morocco
View Profile

,
Ahmed Zellou

ENSIAS, Mohammed V University in Rabat, Rabat, Morocco

ENSIAS, Mohammed V University in Rabat, Rabat, Morocco
View Profile

,
El Habib Benlahmar

LTIM - FSBM, Hassan II university, Casablanca, Morocco

LTIM - FSBM, Hassan II university, Casablanca, Morocco
View Profile

SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and ApplicationsSeptember 2020Article No.: 29Pages 1–8https://doi.org/10.1145/3419604.3419779

Published:08 November 2020Publication History

SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications

Pages 1–8

ABSTRACT

Today, we are witnessing rapid growth in Web resources that allow Internet users to express and share their ideas, opinions, and judgments on a variety of issues. Several classification approaches have been proposed to classify textual data. But all these approaches require us to label the clusters we want to obtain. Which, in reality, is not available because we do not know in advance the information that can be proposed through these opinions. To overcome this constraint, clustering approaches such as K-mean, HAC or FCM can be exploited. In this paper, we present and compare these approaches. And to show the importance of exploiting clustering algorithms, to classify and analyze textual data in Arabic. By applying them to a real case that has created a great debate in Morocco, which is the case of teachers contracting with academies.

References

Goutam Chakraborty, Murali Pagolu, and Satish Garla. 2014. Text mining and analysis: practical methods, examples, and case studies using SAS. SAS Institute. Google ScholarDigital Library
Patricia S. Abril and Robert Plant. 1979. Speaker-independent recognition of isolated words using clustering techniques. IEEE Transactions on Acoustics, Speech, and Signal Processing 27, 4 (Aug. 1979), 336--349.Google Scholar
J. E. Doran F. R. Hodson, P. H. A. Sneath. 1966. Some experiments in the numerical analysis of archaeological data. Biometrika 53, 3-4 (1966), 311--324.Google ScholarCross Ref
Jon R Kettenring, William H Rogers, Martin E Smith, and Jack L Warner. 1976. Cluster analysis applied to the validation of course objectives. Journal of Educational Statistics 1, 1 (1976), 39--57.Google ScholarCross Ref
Inderjit S Dhillon, James Fan, and Yuqiang Guan. 2001. Efcient clustering of very large document collections. In Data mining for scientifc and engineering applications. Springer, 357--381.Google Scholar
Michael Steinbach George Karypis, Vipin Kumar, and Michael Steinbach. 2000. A comparison of document clustering techniques. In TextMining Workshop at KDD2000 (May 2000).Google Scholar
Ram Gnanadesikan, Jon R Kettenring, and James M Landwehr. 1977. Interpreting and assessing the results of cluster analyses. Bulletin of the International Statistical Institute 47, 2 (1977), 451--463.Google Scholar
Yasser Saissi, Ahmed Zellou, and Ali Idri. 2018. A new clustering approach to identify the values to query the deep web access forms. In 2018 4th International Conference on Computer and Technology Applications (ICCTA) (Istanbul, Turkey). IEEE, 111--116.Google ScholarCross Ref
Anna Huang. 2008. Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand (Christchurch, New Zealand). 49--56.Google Scholar
Raihana Ferdous et al. 2009. An efcient k-means algorithm integrated with Jaccard distance measure for document clustering. In 2009 First Asian Himalayas International Conference on Internet (Kathmandu, Nepal). IEEE, 1--6.Google Scholar
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Vol. 463. ACM press New York. Google ScholarDigital Library
Janmenjoy Nayak, Bighnaraj Naik, and HSr Behera. 2015. Fuzzy C-means (FCM) clustering algorithm: a decade review from 2000 to 2014. In Computational intelligence in data mining-volume 2. Springer, 133--149.Google Scholar
Osama Abu Abbas. 2008. Comparisons Between Data Clustering Algorithms. International Arab Journal of Information Technology (IAJIT) 5, 3 (2008), 133--149.Google Scholar
Mounzer Boubou. 2007. Contribution aux méthodes de classifcation non supervisée via des approches prétopologiques et d'agrégation d'opinions. Ph.D. Dissertation. Université Claude Bernard - Lyon I. AAT 8506171.Google Scholar
Gengxin Chen, Saied A Jaradat, Nila Banerjee, Tetsuya S Tanaka, Minoru SH Ko, and Michael Q Zhang. 2002. Evaluation and comparison of clustering algorithms in analyzing ES cell gene expression data. Statistica Sinica 12, 1 (2002), 241--262.Google Scholar
Maryam Bakhshi, Mohammad-Reza Feizi-Derakhshi, and E Zafarani. 2012. Review and comparison between clustering algorithms with duplicate entities detection purpose. International Journal of Computer Science & Emerging Technologies 3, 3 (2012), 108--114.Google Scholar
Abdelkarim Ben Ayed, Mohamed Ben Halima, and Adel M Alimi. 2014. Survey on clustering methods: Towards fuzzy clustering for big data. In 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR) (Tunis, Tunisia). IEEE, 331--336.Google ScholarCross Ref
Abla Chouni Benabdellah, Asmaa Benghabrit, and Imane Bouhaddou. 2019. A survey of clustering algorithms for an industrial context. Procedia computer science 148 (2019), 291--302.Google Scholar
Benjamin Schelling and Claudia Plant. 2018. KMN-Removing Noise from K-Means Clustering Results. In International Conference on Big Data Analytics and Knowledge Discovery. Springer, 137--151.Google ScholarDigital Library
Abu Bakr Soliman, Kareem Eissa, and Samhaa R El-Beltagy. 2017. Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Computer Science 117 (2017), 256--265.Google ScholarCross Ref
Sathees Kumar and R Karthika. 2014. A survey on text mining process and techniques. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) 3, 7 (2014), 2279--2284.Google Scholar
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
Ramzan Talib, Muhammad Kashif Hanif, Shaeela Ayesha, and Fakeeha Fatima. 2016. Text mining: techniques, applications and issues. International Journal of Advanced Computer Science and Applications 7, 11 (2016), 414--418.Google ScholarCross Ref
AL-Shatnawi Atallah and Khairuddin Omar. 2008. Methods of arabic language baseline detection-The state of art. IJCSNS 8, 10 (2008), 137.Google Scholar
Iskandar Keskes, Farah Benamara, and Lamia Hadrich Belguith. 2013. Segmenting Arabic Texts into Elementary Discourse Units (Segmentation de textes arabes en unités discursives minimales)[in French]. In Proceedings of TALN 2013 (Volume 1: Long Papers). 435--449.Google Scholar
Shereen Khoja and Roger Garside. 1999. Stemming arabic text. Lancaster, UK, Computing Department, Lancaster University (1999).Google Scholar
Rehab Duwairi, Mohammad Al-Refai, and Natheer Khasawneh. 2007. Stemming versus light stemming as feature selection techniques for Arabic text categorization. In 2007 Innovations in Information Technologies (IIT). IEEE, 446--450.Google Scholar
George W Adamson and Jillian Boreham. 1974. The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Information storage and retrieval 10, 7-8 (1974), 253--260.Google Scholar
Abdullah Wahbeh, Mohammed Al-Kabi, Qasem Al-Radaideh, Emad Al-Shawakfa, and Izzat Alsmadi. 2011. The effect of stemming on Arabic text classifcation: an empirical study. International Journal of Information Retrieval Research (IJIRR) 1, 3 (2011), 54--70. Google ScholarDigital Library
Karen Sparck Jones. 1972. A statistical interpretation of term specifcity and its application in retrieval. Journal of documentation (1972).Google Scholar
Charrad Malika, N Ghazzali, V Boiteau, and A Niknafs. 2014. NbClust: an R package for determining the relevant number of clusters in a data Set. J. Stat. Softw 61 (2014), 1--36.Google Scholar
Alboukadel Kassambara. 2017. Determining the optimal number of clusters: 3 must know methods. Available onli ne: https://www.datanovia.com/en/lessons/determiningthe-optimal-number-of-clusters-3-must-know-methods/.(accessed on 31 April 2018) (2017).Google Scholar

Index Terms

K-means, HAC and FCM Which Clustering Approach for Arabic Text?
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning

Recommendations

A dissimilarity measure based Fuzzy c-means FCM clustering algorithm

According to the definition of cluster objects belonging to same cluster must have high similarity while objects belonging to different clusters should be highly dissimilar. In the same way cluster validity indices for analyzing clustering result are ...
Read More
Ant clustering algorithm with K-harmonic means clustering

Clustering is an unsupervised learning procedure and there is no a prior knowledge of data distribution. It organizes a set of objects/data into similar groups called clusters, and the objects within one cluster are highly similar and dissimilar with ...
Read More
Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering
Abstract
Data distribution has a significant impact on clustering results. This study focuses on the effect of cluster size distribution on clustering, namely the uniform effect of k-means and fuzzy c-means (FCM) clustering. We first provide some related ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications
September 2020
333 pages
ISBN:9781450377331
DOI:10.1145/3419604

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 November 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Arabic Text Mining
Clustering
Fuzzy C-Means
HAC
K-means
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 27
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

K-means, HAC and FCM Which Clustering Approach for Arabic Text?

SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

A dissimilarity measure based Fuzzy c-means FCM clustering algorithm

Ant clustering algorithm with K-harmonic means clustering

Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

K-means, HAC and FCM Which Clustering Approach for Arabic Text?

SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

A dissimilarity measure based Fuzzy c-means FCM clustering algorithm

Ant clustering algorithm with K-harmonic means clustering

Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media