skip to main content
short-paper

Keyword Extraction from Arabic Documents using Term Equivalence Classes

Published: 20 April 2015 Publication History

Abstract

The rapid growth of the Internet and other computing facilities in recent years has resulted in the creation of a large amount of text in electronic form, which has increased the interest in and importance of different automatic text processing applications, including keyword extraction and term indexing. Although keywords are very useful for many applications, most documents available online are not provided with keywords. We describe a method for extracting keywords from Arabic documents. This method identifies the keywords by combining linguistics and statistical analysis of the text without using prior knowledge from its domain or information from any related corpus. The text is preprocessed to extract the main linguistic information, such as the roots and morphological patterns of derivative words. A cleaning phase is then applied to eliminate the meaningless words from the text. The most frequent terms are clustered into equivalence classes in which the derivative words generated from the same root and the non-derivative words generated from the same stem are placed together, and their count is accumulated. A vector space model is then used to capture the most frequent N-gram in the text. Experiments carried out using a real-world dataset show that the proposed method achieves good results with an average precision of 31% and average recall of 53% when tested against manually assigned keywords.

References

[1]
Al-Sughaier, I. and Al-Kharashi, I. 2004. Arabic morphological analysis techniques: A comprehensive survey. J. Am. Soc. Inform. Sci. Technol. 55, 3, 189--213.
[2]
Awajan, A. 2011. Multilayer model for Arabic text compression. Int. Arab J. Inform. Technol. 8, 2, 188--196.
[3]
Beesley, R. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th International Conference on Computational Linguistics (COLING’96). 89--94.
[4]
Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A., Ould Abdallahi Ould Bebah, M., and Shoul, M. 2010. Alkhalil Morpho Sys: A morphosyntactic analysis system for Arabic texts. In Proceedings of the International Arab Conference on Information Technology. http://www.itpapers.info/acit10/Papers/f653.
[5]
Cohen, J. D. 1995. Language and domain-independent automatic indexing terms for abstracting. J. Am. Soc. Inform. Sci. 46, 3, 162--174.
[6]
Diab, M., Hacioglu, K., and Jurafsky, D. 2007. Automatic processing of modern standard Arabic text. In Arabic Computational Morphology. Springer, 159--179.
[7]
El-Beltagy, S. and Rafea, A. 2008. KP-Miner: A keyphrase extraction system for English and Arabic documents. Inform. Sys. 34, 1, 132--144.
[8]
El-Shishtawy, T. and Al-Sammak, A. 2009. Arabic keyphrase extraction using linguistic knowledge and machine learning techniques, In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools. The MEDAR Consortium.
[9]
ESCWA. 2012. Status of the digital Arabic content industry in the Arab region. Economic and Social Commission for Western Asia-United Nations. http://www.escwa.un.org/information/publications/edit/upload/E_ESCWA_ICTD_12_TP-4_E.pdf.
[10]
Giarlo, M. J. 2006. A comparative analysis of keyword extraction techniques, Rutgers, University. http://lackoftalent.org/michael/papers/596.pdf.
[11]
Green, S. and Manning, C. D. 2010. Better Arabic parsing: Baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING). 394--402.
[12]
Habash, N. Y. 2012. Introduction to Arabic Language Processing. Morgan and Claypool.
[13]
Habash, N., Soudi, A., and Buckwalter, T. 2007. On Arabic transliteration. In Arabic Computational Morphology. Springer. 15--22.
[14]
Hmeidi, I., Kanaan, G. and Evens, M. 1997. Design and implementation of automatic indexing for information retrieval with Arabic documents. J. Amer. Soc. Inform. Sci. 48, 10, 867--881.
[15]
Hulth, A. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
[16]
Hulth, A. 2004. Combining machine learning and natural language processing for automatic keyword extraction. Ph.D. Dissertation, Department of Computer and Systems Sciences, Stockholm University.
[17]
Jones, K. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Document. 28, 1, 11--21.
[18]
Liu, Z., Li, P., Zheng, Y., and Sun, M. 2009. Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 257--266.
[19]
Manning, C. D., Raghavan, P., and Schtze, H. 2009. An Introduction to Information Retrieval. Cambridge University Press, UK.
[20]
Matsuo, Y. and Ishizuka, M. 2004. Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Art. Intell. Tools. 13, 1, 157--169.
[21]
Mihalcea, R. and Tarau, P. 2004. TextRank: Brining order into texts. In Proceedings of EMNLP. Association for Computational Linguistics. 404--411.
[22]
Rose, S., Engel, D., Cramer, N., and Cowley, W. 2010. Automatic keyword extraction from individual documents. In Text Mining: Applications and Theory, M. W. Berry and J. Kogan (Eds.). John Wiley & Sons. 3--20.
[23]
Saad, M. 2011. Arabic Corpora. http://sourceforge.net/projects/ar-textmining/files/Arbic-Corpora/. (Last accessed 5/13).
[24]
Salton, G., Wong, A., and Yang, C. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11, 613--620.
[25]
Turney, P. D. 1999. Learning algorithm for keyphrase extraction. Technical Report ERB-1057. National Research Council Technology of Canada, Institute for Information Technology. http://arxiv.org/ftp/cs/papers/0212/0212013.pdf.
[26]
Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. 1999. KEA: Practical automatic keyphrase extraction. In Proceedings of the 4th Conference on Digital Libraries (DL’99). 254--256.

Cited By

View all
  • (2023)Comparison of Naïve Bayes with graph based methods for keyphrase extraction in modern standard Arabic languageInternational Journal of Speech Technology10.1007/s10772-022-10009-626:1(141-150)Online publication date: 1-Mar-2023
  • (2022)Combining statistical, structural, and linguistic features for keyword extraction from web pagesApplied Computing and Intelligence10.3934/aci.20220072:2(115-132)Online publication date: 2022
  • (2022)I3rab: A New Arabic Dependency Treebank Based on Arabic Grammatical TheoryACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347229521:2(1-32)Online publication date: 31-Mar-2022
  • Show More Cited By

Index Terms

  1. Keyword Extraction from Arabic Documents using Term Equivalence Classes

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 14, Issue 2
    March 2015
    96 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/2764912
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 April 2015
    Accepted: 01 August 2014
    Revised: 01 February 2014
    Received: 01 November 2013
    Published in TALLIP Volume 14, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Arabic natural language processing
    2. Keyword extraction
    3. term equivalence classes
    4. text analysis

    Qualifiers

    • Short-paper
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Comparison of Naïve Bayes with graph based methods for keyphrase extraction in modern standard Arabic languageInternational Journal of Speech Technology10.1007/s10772-022-10009-626:1(141-150)Online publication date: 1-Mar-2023
    • (2022)Combining statistical, structural, and linguistic features for keyword extraction from web pagesApplied Computing and Intelligence10.3934/aci.20220072:2(115-132)Online publication date: 2022
    • (2022)I3rab: A New Arabic Dependency Treebank Based on Arabic Grammatical TheoryACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347229521:2(1-32)Online publication date: 31-Mar-2022
    • (2021)Document Classification Based on Metadata and Keywords Extraction2021 Palestinian International Conference on Information and Communication Technology (PICICT)10.1109/PICICT53635.2021.00016(18-24)Online publication date: Sep-2021
    • (2019)Keywords Extraction from Arabic Documents Using Centrality Measures2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS)10.1109/SNAMS.2019.8931808(237-241)Online publication date: Oct-2019
    • (2019)Automated Keyword Extraction using Support Vector Machine from Arabic News Documents2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)10.1109/JEEIT.2019.8717420(342-346)Online publication date: Apr-2019
    • (2019)Using Part of Speech Tagging for Improving Word2vec Model2019 2nd International Conference on new Trends in Computing Sciences (ICTCS)10.1109/ICTCS.2019.8923081(1-7)Online publication date: Oct-2019
    • (2019)Arabic Text Keywords Extraction using Word2vec2019 2nd International Conference on new Trends in Computing Sciences (ICTCS)10.1109/ICTCS.2019.8923034(1-7)Online publication date: Oct-2019
    • (2019)Graph-Based Arabic Key-phrases Extraction2019 2nd International Conference on new Trends in Computing Sciences (ICTCS)10.1109/ICTCS.2019.8923029(1-7)Online publication date: Oct-2019
    • (2019)PAKE: a supervised approach for Persian automatic keyword extraction using statistical featuresSN Applied Sciences10.1007/s42452-019-1627-51:12Online publication date: 8-Nov-2019
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media