Clustering with Probabilistic Topic Models on Arabic Texts

Kelaiaia, Abdessalem; Merouani, Hayet Farida

doi:10.1007/978-3-319-00560-7_11

Abdessalem Kelaiaia⁵ &
Hayet Farida Merouani⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 488))

1522 Accesses
5 Citations

Abstract

Recently, probabilistic topic models such as LDA (Latent Dirichlet Allocation) have been widely used for applications in many text mining tasks such as retrieval, summarization, and clustering on different languages. In this paper we present a first comparative study between LDA and K-means, two well-known methods respectively in topics identification and clustering applied on Arabic texts. Our aim is to compare the influence of morpho-syntactic characteristics of Arabic language on performance of first method compared to the second one. In order to study different aspects of those methods the study is conducted on benchmark document collection in which the quality of clustering was measured by the use of two well-known evaluation measure, F-measure and Entropy. The results consistently show that LDA perform best results more than K-means in most cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abbas, M., Smaili, K., Berkani, D.: Multi-Category Support Vector Machines for Identifying Arabic Topics. Advances in Computational Linguistics, Special issue of Journal of Research in computing Science 41, 217–226 (2009)
Google Scholar
Blei, D., Lafferty, J.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning (2006)
Google Scholar
Blei, D., Lafferty, J.: A correlated topic model of science. Annals of Applied Statistics 1(1), 17–35 (2007)
Article MathSciNet MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Brahmi, A., Ech-cherif, E., Benyettou, A.: Arabic texts analysis for topic modeling evaluation. Information Retrieval 14 (2011)
Google Scholar
Darwish, K., Oard, D.W.: Evidence combination for Arabic-English retrieval. In: TREC, pp. 703–710. NIST, Gaithersburg (2002)
Google Scholar
Darwish, K., Hassan, H., Emam, O.: Examining the Effect of Improved Context Sensitive Morphology on Arabic Information Retrieval. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, USA, pp. 25–30 (2005)
Google Scholar
Diab, M., Hacioglu, K., Jurafsky, D.: Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. In: Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLT-NAACL 2004), USA, pp. 149–152 (2004)
Google Scholar
El Sulaiti, L.: L’arabe contemporain. Radio Qatar, Qatar (2003)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Science 101, 5228–5235 (2004)
Article Google Scholar
Huot, CH., Coupet, P.: Le Text Mining sur la langue Arabe : application au traitement des sources ouvertes. TEMIS SA, Paris, France (2005)
Google Scholar
Larkey, L.S., Ballesteros, L., Connell, M.E.: Arabic Computational Morphology. In: Light Stemming for Arabic Information Retrieval. Springer (2007)
Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear time document clustering. In: Proceedings of the Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Google Scholar
Lu, Y., Mei, Q., Zhai, C.: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf. Retrieval 14(2001), 178–203 (2011)
Article Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, pp. 327–331. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Mccallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002), http://mallet.cs.umass.edu
Řehůřek, R., Sojka, P.: Gensim – Python Framework for Vector Space Modelling, NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic (2011), http://radimrehurek.com/gensim/
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979)
Google Scholar
Rosenzvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, Alberta, Canada (2004)
Google Scholar
Saad, M.K., Achour, W.: OSAC: Open Source Arabic Corpora, 6th ArchEng International Symposiums. In: The 6th International Symposium on Electrical and Electronics Engineering and Computer Science, pp. 118–123. European University of Lefke, Cyprus (2010)
Google Scholar
Sawaf, H., Zaplo, J., Ney, H.: Statistical Classification Methods for Arabic News Articles. In: Proceedings of the ACL/EACL Workshop on ARABIC Language Processing: Status and Prospects, Toulouse, France (2001)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27, 379–423, 623–656 (1948)
Article MathSciNet MATH Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD Workshop, Text Mining, Minnesota, USA (2000)
Google Scholar
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis, Technical Report #01-40, University of Minnesota (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer sciences Department, University of may 08, 1945, Guelma, Algeria
Abdessalem Kelaiaia
LRI Laboratory, Computer sciences Department, University of Badji Mokhtar, Annaba, Algeria
Hayet Farida Merouani

Authors

Abdessalem Kelaiaia
View author publications
You can also search for this author in PubMed Google Scholar
Hayet Farida Merouani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdessalem Kelaiaia .

Editor information

Editors and Affiliations

Département d’informatique, Centre Universitaire Taher Moula de Saida, Ennasr Saida, Algeria
Abdelmalek Amine
Department of Electrical and Computer Engineering, Concordia University, Montreal, Québec, Canada
Ait Mohamed Otmane
LIAS/ISAE-ENSMA, Futuroscope Chasseneuil Cedex, France
Ladjel Bellatreche

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kelaiaia, A., Merouani, H.F. (2013). Clustering with Probabilistic Topic Models on Arabic Texts. In: Amine, A., Otmane, A., Bellatreche, L. (eds) Modeling Approaches and Algorithms for Advanced Computer Applications. Studies in Computational Intelligence, vol 488. Springer, Cham. https://doi.org/10.1007/978-3-319-00560-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-00560-7_11
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-00559-1
Online ISBN: 978-3-319-00560-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics