short-paper

Keyword Extraction from Arabic Documents using Term Equivalence Classes

Author:

Arafat AwajanAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 14, Issue 2

Article No.: 7, Pages 1 - 18

https://doi.org/10.1145/2665077

Published: 20 April 2015 Publication History

Abstract

The rapid growth of the Internet and other computing facilities in recent years has resulted in the creation of a large amount of text in electronic form, which has increased the interest in and importance of different automatic text processing applications, including keyword extraction and term indexing. Although keywords are very useful for many applications, most documents available online are not provided with keywords. We describe a method for extracting keywords from Arabic documents. This method identifies the keywords by combining linguistics and statistical analysis of the text without using prior knowledge from its domain or information from any related corpus. The text is preprocessed to extract the main linguistic information, such as the roots and morphological patterns of derivative words. A cleaning phase is then applied to eliminate the meaningless words from the text. The most frequent terms are clustered into equivalence classes in which the derivative words generated from the same root and the non-derivative words generated from the same stem are placed together, and their count is accumulated. A vector space model is then used to capture the most frequent N-gram in the text. Experiments carried out using a real-world dataset show that the proposed method achieves good results with an average precision of 31% and average recall of 53% when tested against manually assigned keywords.

References

[1]

Al-Sughaier, I. and Al-Kharashi, I. 2004. Arabic morphological analysis techniques: A comprehensive survey. J. Am. Soc. Inform. Sci. Technol. 55, 3, 189--213.

Digital Library

[2]

Awajan, A. 2011. Multilayer model for Arabic text compression. Int. Arab J. Inform. Technol. 8, 2, 188--196.

[3]

Beesley, R. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th International Conference on Computational Linguistics (COLING’96). 89--94.

Digital Library

[4]

Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A., Ould Abdallahi Ould Bebah, M., and Shoul, M. 2010. Alkhalil Morpho Sys: A morphosyntactic analysis system for Arabic texts. In Proceedings of the International Arab Conference on Information Technology. http://www.itpapers.info/acit10/Papers/f653.

[5]

Cohen, J. D. 1995. Language and domain-independent automatic indexing terms for abstracting. J. Am. Soc. Inform. Sci. 46, 3, 162--174.

Digital Library

[6]

Diab, M., Hacioglu, K., and Jurafsky, D. 2007. Automatic processing of modern standard Arabic text. In Arabic Computational Morphology. Springer, 159--179.

[7]

El-Beltagy, S. and Rafea, A. 2008. KP-Miner: A keyphrase extraction system for English and Arabic documents. Inform. Sys. 34, 1, 132--144.

Digital Library

[8]

El-Shishtawy, T. and Al-Sammak, A. 2009. Arabic keyphrase extraction using linguistic knowledge and machine learning techniques, In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools. The MEDAR Consortium.

[9]

ESCWA. 2012. Status of the digital Arabic content industry in the Arab region. Economic and Social Commission for Western Asia-United Nations. http://www.escwa.un.org/information/publications/edit/upload/E_ESCWA_ICTD_12_TP-4_E.pdf.

[10]

Giarlo, M. J. 2006. A comparative analysis of keyword extraction techniques, Rutgers, University. http://lackoftalent.org/michael/papers/596.pdf.

[11]

Green, S. and Manning, C. D. 2010. Better Arabic parsing: Baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING). 394--402.

Digital Library

[12]

Habash, N. Y. 2012. Introduction to Arabic Language Processing. Morgan and Claypool.

Digital Library

[13]

Habash, N., Soudi, A., and Buckwalter, T. 2007. On Arabic transliteration. In Arabic Computational Morphology. Springer. 15--22.

[14]

Hmeidi, I., Kanaan, G. and Evens, M. 1997. Design and implementation of automatic indexing for information retrieval with Arabic documents. J. Amer. Soc. Inform. Sci. 48, 10, 867--881.

Digital Library

[15]

Hulth, A. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

Digital Library

[16]

Hulth, A. 2004. Combining machine learning and natural language processing for automatic keyword extraction. Ph.D. Dissertation, Department of Computer and Systems Sciences, Stockholm University.

[17]

Jones, K. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Document. 28, 1, 11--21.

[18]

Liu, Z., Li, P., Zheng, Y., and Sun, M. 2009. Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 257--266.

Digital Library

[19]

Manning, C. D., Raghavan, P., and Schtze, H. 2009. An Introduction to Information Retrieval. Cambridge University Press, UK.

Digital Library

[20]

Matsuo, Y. and Ishizuka, M. 2004. Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Art. Intell. Tools. 13, 1, 157--169.

[21]

Mihalcea, R. and Tarau, P. 2004. TextRank: Brining order into texts. In Proceedings of EMNLP. Association for Computational Linguistics. 404--411.

[22]

Rose, S., Engel, D., Cramer, N., and Cowley, W. 2010. Automatic keyword extraction from individual documents. In Text Mining: Applications and Theory, M. W. Berry and J. Kogan (Eds.). John Wiley & Sons. 3--20.

[23]

Saad, M. 2011. Arabic Corpora. http://sourceforge.net/projects/ar-textmining/files/Arbic-Corpora/. (Last accessed 5/13).

[24]

Salton, G., Wong, A., and Yang, C. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11, 613--620.

Digital Library

[25]

Turney, P. D. 1999. Learning algorithm for keyphrase extraction. Technical Report ERB-1057. National Research Council Technology of Canada, Institute for Information Technology. http://arxiv.org/ftp/cs/papers/0212/0212013.pdf.

[26]

Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. 1999. KEA: Practical automatic keyphrase extraction. In Proceedings of the 4th Conference on Digital Libraries (DL’99). 254--256.

Digital Library

Cited By

Loukam M(2023)Comparison of Naïve Bayes with graph based methods for keyphrase extraction in modern standard Arabic languageInternational Journal of Speech Technology10.1007/s10772-022-10009-626:1(141-150)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1007/s10772-022-10009-6
Shah HFränti P(2022)Combining statistical, structural, and linguistic features for keyword extraction from web pagesApplied Computing and Intelligence10.3934/aci.20220072:2(115-132)Online publication date: 2022
https://doi.org/10.3934/aci.2022007
Halabi DFayyoumi EAwajan A(2022)I3rab: A New Arabic Dependency Treebank Based on Arabic Grammatical TheoryACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347229521:2(1-32)Online publication date: 31-Mar-2022
https://dl.acm.org/doi/10.1145/3472295
Show More Cited By

Index Terms

Keyword Extraction from Arabic Documents using Term Equivalence Classes
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

A Link Prediction Approach for Accurately Mapping a Large-scale Arabic Lexical Resource to English WordNet

Success of Natural Language Processing (NLP) models, just like all advanced machine learning models, rely heavily on large -scale lexical resources. For English, English WordNet (EWN) is a leading example of a large-scale resource that has enabled ...
Keyword Extraction Based on Lexical Chains and Word Co-occurrence for Chinese News Web Pages
ICDMW '08: Proceedings of the 2008 IEEE International Conference on Data Mining Workshops

This paper presents a new keyword extraction algorithm for Chinese news web pages using lexical chains and word co-occurrence combined with frequency features, cohesion features, and corelation features. A lexical chain is an external performance ...
Building an Arabic Sentiment Lexicon Using Semi-supervised Learning

Sentiment analysis is the process of determining a predefined sentiment from text written in a natural language with respect to the entity to which it is referring. A number of lexical resources are available to facilitate this task in English. One such ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 14, Issue 2

March 2015

96 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/2764912

Editor:
Richard Sproat
Google, Inc., USA

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2015

Accepted: 01 August 2014

Revised: 01 February 2014

Received: 01 November 2013

Published in TALLIP Volume 14, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
411
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Loukam M(2023)Comparison of Naïve Bayes with graph based methods for keyphrase extraction in modern standard Arabic languageInternational Journal of Speech Technology10.1007/s10772-022-10009-626:1(141-150)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1007/s10772-022-10009-6
Shah HFränti P(2022)Combining statistical, structural, and linguistic features for keyword extraction from web pagesApplied Computing and Intelligence10.3934/aci.20220072:2(115-132)Online publication date: 2022
https://doi.org/10.3934/aci.2022007
Halabi DFayyoumi EAwajan A(2022)I3rab: A New Arabic Dependency Treebank Based on Arabic Grammatical TheoryACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347229521:2(1-32)Online publication date: 31-Mar-2022
https://dl.acm.org/doi/10.1145/3472295
Rezqa EBaraka R(2021)Document Classification Based on Metadata and Keywords Extraction2021 Palestinian International Conference on Information and Communication Technology (PICICT)10.1109/PICICT53635.2021.00016(18-24)Online publication date: Sep-2021
https://doi.org/10.1109/PICICT53635.2021.00016
Al Etaiwi WAwajan ASuleiman D(2019)Keywords Extraction from Arabic Documents Using Centrality Measures2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS)10.1109/SNAMS.2019.8931808(237-241)Online publication date: Oct-2019
https://doi.org/10.1109/SNAMS.2019.8931808
Armouty BTedmori S(2019)Automated Keyword Extraction using Support Vector Machine from Arabic News Documents2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)10.1109/JEEIT.2019.8717420(342-346)Online publication date: Apr-2019
https://doi.org/10.1109/JEEIT.2019.8717420
Suleiman DAwajan A(2019)Using Part of Speech Tagging for Improving Word2vec Model2019 2nd International Conference on new Trends in Computing Sciences (ICTCS)10.1109/ICTCS.2019.8923081(1-7)Online publication date: Oct-2019
https://doi.org/10.1109/ICTCS.2019.8923081
Suleiman DAwajan Aal Etaiwi W(2019)Arabic Text Keywords Extraction using Word2vec2019 2nd International Conference on new Trends in Computing Sciences (ICTCS)10.1109/ICTCS.2019.8923034(1-7)Online publication date: Oct-2019
https://doi.org/10.1109/ICTCS.2019.8923034
Halabi DAwajan A(2019)Graph-Based Arabic Key-phrases Extraction2019 2nd International Conference on new Trends in Computing Sciences (ICTCS)10.1109/ICTCS.2019.8923029(1-7)Online publication date: Oct-2019
https://doi.org/10.1109/ICTCS.2019.8923029
Lazemi SEbrahimpour-Komleh HNoroozi N(2019)PAKE: a supervised approach for Persian automatic keyword extraction using statistical featuresSN Applied Sciences10.1007/s42452-019-1627-51:12Online publication date: 8-Nov-2019
https://doi.org/10.1007/s42452-019-1627-5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents