ABSTRACT
Today, we are witnessing rapid growth in Web resources that allow Internet users to express and share their ideas, opinions, and judgments on a variety of issues. Several classification approaches have been proposed to classify textual data. But all these approaches require us to label the clusters we want to obtain. Which, in reality, is not available because we do not know in advance the information that can be proposed through these opinions. To overcome this constraint, clustering approaches such as K-mean, HAC or FCM can be exploited. In this paper, we present and compare these approaches. And to show the importance of exploiting clustering algorithms, to classify and analyze textual data in Arabic. By applying them to a real case that has created a great debate in Morocco, which is the case of teachers contracting with academies.
- Goutam Chakraborty, Murali Pagolu, and Satish Garla. 2014. Text mining and analysis: practical methods, examples, and case studies using SAS. SAS Institute. Google ScholarDigital Library
- Patricia S. Abril and Robert Plant. 1979. Speaker-independent recognition of isolated words using clustering techniques. IEEE Transactions on Acoustics, Speech, and Signal Processing 27, 4 (Aug. 1979), 336--349.Google Scholar
- J. E. Doran F. R. Hodson, P. H. A. Sneath. 1966. Some experiments in the numerical analysis of archaeological data. Biometrika 53, 3-4 (1966), 311--324.Google ScholarCross Ref
- Jon R Kettenring, William H Rogers, Martin E Smith, and Jack L Warner. 1976. Cluster analysis applied to the validation of course objectives. Journal of Educational Statistics 1, 1 (1976), 39--57.Google ScholarCross Ref
- Inderjit S Dhillon, James Fan, and Yuqiang Guan. 2001. Efcient clustering of very large document collections. In Data mining for scientifc and engineering applications. Springer, 357--381.Google Scholar
- Michael Steinbach George Karypis, Vipin Kumar, and Michael Steinbach. 2000. A comparison of document clustering techniques. In TextMining Workshop at KDD2000 (May 2000).Google Scholar
- Ram Gnanadesikan, Jon R Kettenring, and James M Landwehr. 1977. Interpreting and assessing the results of cluster analyses. Bulletin of the International Statistical Institute 47, 2 (1977), 451--463.Google Scholar
- Yasser Saissi, Ahmed Zellou, and Ali Idri. 2018. A new clustering approach to identify the values to query the deep web access forms. In 2018 4th International Conference on Computer and Technology Applications (ICCTA) (Istanbul, Turkey). IEEE, 111--116.Google ScholarCross Ref
- Anna Huang. 2008. Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand (Christchurch, New Zealand). 49--56.Google Scholar
- Raihana Ferdous et al. 2009. An efcient k-means algorithm integrated with Jaccard distance measure for document clustering. In 2009 First Asian Himalayas International Conference on Internet (Kathmandu, Nepal). IEEE, 1--6.Google Scholar
- Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Vol. 463. ACM press New York. Google ScholarDigital Library
- Janmenjoy Nayak, Bighnaraj Naik, and HSr Behera. 2015. Fuzzy C-means (FCM) clustering algorithm: a decade review from 2000 to 2014. In Computational intelligence in data mining-volume 2. Springer, 133--149.Google Scholar
- Osama Abu Abbas. 2008. Comparisons Between Data Clustering Algorithms. International Arab Journal of Information Technology (IAJIT) 5, 3 (2008), 133--149.Google Scholar
- Mounzer Boubou. 2007. Contribution aux méthodes de classifcation non supervisée via des approches prétopologiques et d'agrégation d'opinions. Ph.D. Dissertation. Université Claude Bernard - Lyon I. AAT 8506171.Google Scholar
- Gengxin Chen, Saied A Jaradat, Nila Banerjee, Tetsuya S Tanaka, Minoru SH Ko, and Michael Q Zhang. 2002. Evaluation and comparison of clustering algorithms in analyzing ES cell gene expression data. Statistica Sinica 12, 1 (2002), 241--262.Google Scholar
- Maryam Bakhshi, Mohammad-Reza Feizi-Derakhshi, and E Zafarani. 2012. Review and comparison between clustering algorithms with duplicate entities detection purpose. International Journal of Computer Science & Emerging Technologies 3, 3 (2012), 108--114.Google Scholar
- Abdelkarim Ben Ayed, Mohamed Ben Halima, and Adel M Alimi. 2014. Survey on clustering methods: Towards fuzzy clustering for big data. In 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR) (Tunis, Tunisia). IEEE, 331--336.Google ScholarCross Ref
- Abla Chouni Benabdellah, Asmaa Benghabrit, and Imane Bouhaddou. 2019. A survey of clustering algorithms for an industrial context. Procedia computer science 148 (2019), 291--302.Google Scholar
- Benjamin Schelling and Claudia Plant. 2018. KMN-Removing Noise from K-Means Clustering Results. In International Conference on Big Data Analytics and Knowledge Discovery. Springer, 137--151.Google ScholarDigital Library
- Abu Bakr Soliman, Kareem Eissa, and Samhaa R El-Beltagy. 2017. Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Computer Science 117 (2017), 256--265.Google ScholarCross Ref
- Sathees Kumar and R Karthika. 2014. A survey on text mining process and techniques. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) 3, 7 (2014), 2279--2284.Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- Ramzan Talib, Muhammad Kashif Hanif, Shaeela Ayesha, and Fakeeha Fatima. 2016. Text mining: techniques, applications and issues. International Journal of Advanced Computer Science and Applications 7, 11 (2016), 414--418.Google ScholarCross Ref
- AL-Shatnawi Atallah and Khairuddin Omar. 2008. Methods of arabic language baseline detection-The state of art. IJCSNS 8, 10 (2008), 137.Google Scholar
- Iskandar Keskes, Farah Benamara, and Lamia Hadrich Belguith. 2013. Segmenting Arabic Texts into Elementary Discourse Units (Segmentation de textes arabes en unités discursives minimales)[in French]. In Proceedings of TALN 2013 (Volume 1: Long Papers). 435--449.Google Scholar
- Shereen Khoja and Roger Garside. 1999. Stemming arabic text. Lancaster, UK, Computing Department, Lancaster University (1999).Google Scholar
- Rehab Duwairi, Mohammad Al-Refai, and Natheer Khasawneh. 2007. Stemming versus light stemming as feature selection techniques for Arabic text categorization. In 2007 Innovations in Information Technologies (IIT). IEEE, 446--450.Google Scholar
- George W Adamson and Jillian Boreham. 1974. The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Information storage and retrieval 10, 7-8 (1974), 253--260.Google Scholar
- Abdullah Wahbeh, Mohammed Al-Kabi, Qasem Al-Radaideh, Emad Al-Shawakfa, and Izzat Alsmadi. 2011. The effect of stemming on Arabic text classifcation: an empirical study. International Journal of Information Retrieval Research (IJIRR) 1, 3 (2011), 54--70. Google ScholarDigital Library
- Karen Sparck Jones. 1972. A statistical interpretation of term specifcity and its application in retrieval. Journal of documentation (1972).Google Scholar
- Charrad Malika, N Ghazzali, V Boiteau, and A Niknafs. 2014. NbClust: an R package for determining the relevant number of clusters in a data Set. J. Stat. Softw 61 (2014), 1--36.Google Scholar
- Alboukadel Kassambara. 2017. Determining the optimal number of clusters: 3 must know methods. Available onli ne: https://www.datanovia.com/en/lessons/determiningthe-optimal-number-of-clusters-3-must-know-methods/.(accessed on 31 April 2018) (2017).Google Scholar
Index Terms
- K-means, HAC and FCM Which Clustering Approach for Arabic Text?
Recommendations
A dissimilarity measure based Fuzzy c-means FCM clustering algorithm
According to the definition of cluster objects belonging to same cluster must have high similarity while objects belonging to different clusters should be highly dissimilar. In the same way cluster validity indices for analyzing clustering result are ...
Ant clustering algorithm with K-harmonic means clustering
Clustering is an unsupervised learning procedure and there is no a prior knowledge of data distribution. It organizes a set of objects/data into similar groups called clusters, and the objects within one cluster are highly similar and dissimilar with ...
Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering
AbstractData distribution has a significant impact on clustering results. This study focuses on the effect of cluster size distribution on clustering, namely the uniform effect of k-means and fuzzy c-means (FCM) clustering. We first provide some related ...
Comments