skip to main content
10.1145/2908446.2908494acmotherconferencesArticle/Chapter ViewAbstractPublication PagesinfosConference Proceedingsconference-collections
research-article

Keyphrase-Based Hierarchical Clustering for Arabic Documents

Published: 09 May 2016 Publication History

Abstract

The vast amount of available Arabic web pages and text files on the internet makes it necessary to organize data in an easy way for user browsing. Document clustering is a good solution for this problem, which groups similar data into clusters with meaningful labels. In this paper, we propose a domain independent approach, which builds a hierarchical meaningful clustering tree. The proposed approach overcomes the problem of high dimensionality of feature vector by representing each document with its keyphrases. In addition, we introduced a new similarity measure by taking the common lemma form keyphrases among feature vectors of documents. This improves the accuracy of the clustering process with reduced complexity. Many experiments are carried out to evaluate the accuracy of clustering using String-based, Corpus-based, and Knowledge-based similarity measures. A dataset consists of 345 Arabic documents and covering 12 domains is used in these experiments. The results show that applying lexical similarity using keyphrase based gives more accurate clusters labels than using semantic similarity. The best purity result achieved is 0.955, which is obtained using the common lemma form keyphrases similarity method.

References

[1]
Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A., and Fellbaum, C., 2006. Introducing the Arabic wordnet project. In Proceedings of the Third International WordNet Conference, 295--300.
[2]
El-Shishtawy, T. and Al-Sammak, A., 2012. Arabic keyphrase extraction using linguistic knowledge and machine learning techniques. arXiv preprint arXiv:1203.4605.
[3]
Francetic, M., Nagode, M., and Nastav, B., 2005. Hierarchical clustering with concave data sets. Metodoloski Zvezki 2, 2, 173.
[4]
Gomaa, W.H. and Fahmy, A.A., 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13, 13--18.
[5]
Graham, R.L. and Hell, P., 1985. On the history of the minimum spanning tree problem. Annals of the History of Computing 7, 1, 43--57.
[6]
Huang, A., 2008. Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, 49--56.
[7]
Jain, A.K., Murty, M.N., and Flynn, P.J., 1999. Data clustering: a review. ACM computing surveys (CSUR) 31, 3, 264--323.
[8]
Jensi, R. and Jiji, D.G.W., 2014. A Survey on optimization approaches to text document clustering. arXiv preprint arXiv:1401.2229.
[9]
Karypis, M.S.G., Kumar, V., and Steinbach, M., 2000. A comparison of document clustering techniques. In KDD workshop on Text Mining.
[10]
Kaufman, R., 1990. Finding Groups in Data: An Introduction to Cluster Analysis.
[11]
Kolb, P., 2008. Disco: A multilingual database of distributionally similar words. Proceedings of KONVENS-2008, Berlin.
[12]
Kolb, P., 2009. Experiments on the difference between semantic similarity and relatedness. In Proceedings of the 17th Nordic Conference on Computational Linguistics-NODALIDA'09.
[13]
Meng, L., Huang, R., and Gu, J., 2013. A review of semantic similarity measures in wordnet. International Journal of Hybrid Information Technology 6, 1, 1--12.
[14]
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K.J., 1990. Introduction to wordnet: An on-line lexical database*. International journal of lexicography 3, 4, 235--244.
[15]
Molijy, A.A., Hmeidi, I., and Alsmadi, I., 2012. Indexing of Arabic documents automatically based on lexical analysis. arXiv preprint arXiv:1205.1602.
[16]
Rodríguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M., Martí, M.A., Black, W., Elkateb, S., Kirk, J., and Pease, A., 2008. Arabic wordnet: Current state and future extensions. In Proceedings of The Fourth Global WordNet Conference, Szeged, Hungary.
[17]
Rosell, M., 2006. Introduction to information retrieval and text clustering. KTH CSC.
[18]
Sahmoudi, I., Froud, H., and Lachkar, A., 2014. A new keyphrases extraction method based on suffix tree data structure for arabic documents clustering. arXiv preprint arXiv:1401.5644.
[19]
Sahoo, N., Callan, J., Krishnan, R., Duncan, G., and Padman, R., 2006. Incremental hierarchical clustering of text documents. In Proceedings of the 15th ACM international conference on Information and knowledge management ACM, 357--366.
[20]
Shalizi, C., 2009. Distances between Clustering, Hierarchical Clustering. Lectures notes.
[21]
Tombros, A., 2002. The effectiveness of query-based hierarchic clustering of documents for information retrieval University of Glasgow.
[22]
Wu, Z. and Palmer, M., 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics Association for Computational Linguistics, 133--138.

Cited By

View all
  • (2022)Exploring text representation impact on K-means based arabic text documents clustering2022 International Conference on Intelligent Systems and Computer Vision (ISCV)10.1109/ISCV54655.2022.9806067(1-5)Online publication date: 18-May-2022
  • (2022)Arabic Document Clustering: A Survey2022 4th International Conference on Current Research in Engineering and Science Applications (ICCRESA)10.1109/ICCRESA57091.2022.10352511(59-64)Online publication date: 20-Dec-2022
  • (2021)Application of big data language recognition technology and GPU parallel computing in English teaching visualization systemInternational Journal of Speech Technology10.1007/s10772-021-09904-125:3(667-677)Online publication date: 18-Oct-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
INFOS '16: Proceedings of the 10th International Conference on Informatics and Systems
May 2016
347 pages
ISBN:9781450340625
DOI:10.1145/2908446
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Agglomerative Hierarchical document clustering
  2. Keyphrase
  3. Lemma
  4. Lexical similarity
  5. Semantic similarity

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

INFOS '16

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Exploring text representation impact on K-means based arabic text documents clustering2022 International Conference on Intelligent Systems and Computer Vision (ISCV)10.1109/ISCV54655.2022.9806067(1-5)Online publication date: 18-May-2022
  • (2022)Arabic Document Clustering: A Survey2022 4th International Conference on Current Research in Engineering and Science Applications (ICCRESA)10.1109/ICCRESA57091.2022.10352511(59-64)Online publication date: 20-Dec-2022
  • (2021)Application of big data language recognition technology and GPU parallel computing in English teaching visualization systemInternational Journal of Speech Technology10.1007/s10772-021-09904-125:3(667-677)Online publication date: 18-Oct-2021
  • (2018)Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic DocumentsIEEE Access10.1109/ACCESS.2018.28526486(42740-42749)Online publication date: 2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media