Abstract
Arabic Document Clustering has increasingly become an important task for obtaining good results with the unsupervised learning task. This paper aims to evaluate the impact of the five measures (Cosine similarity, Jaccard coefficient, Pearson correlation, Euclidean distance and Averaged Kullback- Leibler Divergence) for Document Clustering with two types of pre-processing morphology-based The Information Science Research Institute (ISRI) is equivalent to the root-based stemmer and light stemmer; and without stemming without morphology) for an Arabic dataset. Stemming is known as a computational process used to reduce words to their stems. For classification, it is categorised as a recall-enhancing or precision-enhancing component. It is concluded that the method of ISRI for words is proved to be better than without stemming methods which use a five similarities/distance measures for Document Clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Khoja, S.: APT: Arabic part-of-speech tagger. In: Proceedings of the Student Workshop at NAACL, pp. 20–25 (2001)
Abu-Salem, H., Al-Omari, M., Evens, M.W.: Stemming methodologies over individual query words for an Arabic information retrieval system. J. Am. Soc. Inf. Sci. 50(6), 524–529 (1999)
Ozgür, A.: Supervised and unsupervised machine learning techniques for text document categorization (2004)
Khoja, S., Garside, R.: Stemming Arabic Text. Computing Department, Lancaster University, Lancaster (1999) http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.Ps
Larkey, L., Larkey, L.S., Connell, M.E.: Arabic information retrieval at UMass in TREC-10. In: Proceedings TREC, pp. 562–570 (2001)
Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Finland (2002)
Duwairi, R.: A Distance-based Classifier for Arabic Text Categorization. In: Proceedings of DMIN, pp. 187–192 (2005)
El Kourdi, M., Bensaid, A., Rachidi, T.: Automatic Arabic document categorization based on the Naïve Bayes algorithm. In: Proceeding Semitic 2004 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, pp. 51–58 (2004)
Mustafa, S.H., Al-Radaideh, Q.A.: Using N-grams for Arabic text searching. J. Am. Soc. Inf. Sci. Technol. 55(11), 1002–1007 (2004)
Darwish, K., Hassan, H., Emam, O.: Examining the effect of improved context sensitive morphology on Arabic information retrieval. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pp. 25–30 (2005)
Taghva, K., Elkhoury, R., Coombs, J.: Arabic stemming without a root dictionary. In: ITCC International Conference on Information Technology: Coding and Computing, vol. 1, pp. 152–157 (2005)
Froud, H., Benslimane, R., Lachkar, A., Ouatik, S.A.: Stemming and similarity measures for Arabic Documents Clustering. In: 5th International Symposium on I/V Communications and Mobile Network (ISVC), pp. 1–4 (2010)
Khreisat, L.: Arabic text classification using N-gram frequency statistics a comparative study. In: DMIN, pp. 78–82 (2006)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. ACM press, New York, Key: citeulike:532542 (1999)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering, pp. 16–22 (1999)
Jain, A.K., Dubes, R.C.: Algorithms for clustering data (1988)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, pp. 525–526 (2000)
Al-Salemi, B., Ab Aziz, M.J.: Statistical Bayesian Learning for Automatic Arabic Text Categorization. J. Comput. Sci., 39–45 (2011)
Al-Shammari, E., Lin, J.: Towards an error-free Arabic stemming. In: Proceeding of the 2nd ACM Workshop on Improving non English Web Searching, pp. 9–16. ACM, Napa Valley (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bsoul, Q.W., Mohd, M. (2011). Effect of ISRI Stemming on Similarity Measure for Arabic Document Clustering. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds) Information Retrieval Technology. AIRS 2011. Lecture Notes in Computer Science, vol 7097. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25631-8_53
Download citation
DOI: https://doi.org/10.1007/978-3-642-25631-8_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25630-1
Online ISBN: 978-3-642-25631-8
eBook Packages: Computer ScienceComputer Science (R0)