Skip to main content

A Hybrid Approach for Extracting Arabic Persons’ Names and Resolving Their Ambiguity from Twitter

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9103))

Abstract

Tweets offer a novel way of communication that enables users all over the world to share real-time news and ideas. The massive amount of tweets, generated regularly by Arabic speakers, has resulted in a growing interest in building Arabic named entity recognition (NER) systems that deal with the informal colloquial Arabic. The unique characteristics of the Arabic language make Arabic NER a challenging task, which, the informal nature of tweets further complicates. The majority of previous works addressing Arabic NER were concerned with formal modern standard Arabic (MSA). Moreover, taggers and parsers were often utilized to solve the ambiguity problem of Arabic persons’ names. Although, previously developed approaches perform well on MSA text, they are not suited for colloquial Arabic. This paper introduces a hybrid approach to extract Arabic persons’ names from tweets in addition to a way to resolve their ambiguity using context bigram patterns. The introduced approach attempts not to use any language-dependent resources. Evaluation of the presented approach shows a 7 % improvement in the F-score over the best reported result in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Diacritics are special marks placed above or below the letters.

  2. 2.

    http://tmrg.nileu.edu.eg/downloads.html (under the title: Persons’ Names Dictionaries).

  3. 3.

    The rest is used to collect a test dataset.

  4. 4.

    Waikato Environment for Knowledge Analysis.

  5. 5.

    We experimented with SVM and BayesNET. Although the results in the classification phase were better but they were significantly lower in the extraction phase.

  6. 6.

    The authors omitted the organizations’ names class because it has small frequency in the annotated data.

  7. 7.

    LDC2012T09: GALE Arabic-Dialect/English Parallel Text.

References

  1. Semiocast: Geolocation analysis of Twitter accounts and tweets by Semiocast. http://semiocast.com/en/publications/2012_07_30_Twitter_reaches_half_a_billion_accounts_140m_in_the_US

  2. Shaalan, K.: A survey of arabic named entity recognition and classification. Comput. Linguist. 40, 469–510 (2014)

    Article  Google Scholar 

  3. Farghaly, A., Shaalan, K.: Arabic natural language processing: challenges and solutions. ACM Trans. Asian Lang. Inf. Process. 8, 1–22 (2009)

    Article  Google Scholar 

  4. Zayed, O., El-Beltagy, S., Haggag, O.: An approach for extracting and disambiguating arabic persons’ names using clustered dictionaries and scored patterns. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds.) NLDB 2013. LNCS, vol. 7934, pp. 201–212. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  5. Darwish, K.: Named entity recognition using cross-lingual resources: arabic as an example. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 1558–1567. Association for Computational Linguistics, Sofia (2013)

    Google Scholar 

  6. Wikipedia. https://www.wikipedia.org/

  7. Habash, N.Y.: Introduction to Arabic Natural Language Processing. Mogran & Claypool, San Rafael (2010)

    Google Scholar 

  8. Wikipedia People Category. http://ar.wikipedia.org/wiki/

  9. Kooora: Arabic Sports Web Site. http://www.kooora.com/default.aspx?showplayers=true

  10. Twitter Search API. https://dev.twitter.com/rest/public/search

  11. Singhal, A.: Modern information retrieval: a brief overview. Bull. IEEE Comput. Soc. Tech. Comm. DATA Eng. 24, 35–43 (2001)

    Google Scholar 

  12. Witten, I.H., Frank, E., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: Weka: practical machine learning tools and techniques with Java implementations. In: Kasabov, N., Ko, K. (eds.) Proceedings of the ICONIP/ANZIIS/ANNES 1999 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems, Dunedin, New Zealand, pp 192–196 (1999)

    Google Scholar 

  13. CoNLL’s Standard NER Evaluation Script. http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt

  14. De Sitter, A., Calders, T., Daelemans, W.: A formal framework for evaluation of information extraction, Antwerp (2004)

    Google Scholar 

  15. Darwish, K., Gao, W.: Simple effective microblog named entity recognition: arabic as an example. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 2513–2517. European Language Resources Association (ELRA), Reykjavik (2014)

    Google Scholar 

  16. Linguistic Data Consortium (LDC). https://www.ldc.upenn.edu

  17. Shaalan, K., Raza, H.: Person name entity recognition for Arabic. In: Proceedings of the 5th Workshop on Important Unresolved Matters, pp. 17–24. Association for Computational Linguistics, Prague (2007)

    Google Scholar 

  18. Shaalan, K., Raza, H.: Arabic named entity recognition from diverse text types. In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 440–451. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  19. Shaalan, K., Raza, H.: NERA: named entity recognition for Arabic. J. Amer. Soc. for. Inf. Sci. Technol. 60, 1652–1663 (2009)

    Google Scholar 

  20. Abdallah, S., Shaalan, K., Shoaib, M.: Integrating rule-based system with classification for Arabic named entity recognition. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS, vol. 7181, pp. 311–322. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  21. Benajiba, Y., Rosso, P., BenedíRuiz, J.M.: ANERsys: an arabic named entity recognition system based on maximum entropy. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 143–153. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  22. Benajiba, Y., Diab, M., Rosso, P., Valencia, D.: Arabic named entity recognition using optimized feature sets. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, pp. 284–293. Association for Computational Linguistics, Morristown (2008)

    Google Scholar 

  23. Benajiba, Y., Diab, M., Rosso, P.: Arabic named entity recognition: a feature-driven study. IEEE Trans. Audio Speech Lang. Process. 17, 926–934 (2009)

    Article  Google Scholar 

  24. Benajiba, Y., Diab, M., Rosso, P.: Arabic named entity recognition: an SVM based approach. In: Proceeding of the 2008 Arab International Conference on Information Technology (ACIT) (2008)

    Google Scholar 

  25. Benajiba, Y., Rosso, P.: ANERsys 2 . 0 : conquering the NER task for the arabic language by combining the maximum entropy with POS-tag information. In: IICAI, pp. 1814–1823 (2007)

    Google Scholar 

  26. Benajiba, Y., Rosso, P.: Arabic named entity recognition using conditional random fields. In: Proceedings of Workshop on HLT & NLP within the Arabic World, LREC, vol. 8 (2008)

    Google Scholar 

  27. Shaalan, K., Oudah, M.: A hybrid approach to Arabic named entity recognition. J. Inf. Sci. 40, 67–87 (2014)

    Article  Google Scholar 

  28. Zirikly, A., Diab, M.: Named entity recognition system for dialectal Arabic. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pp. 78–86. Association for Computational Linguistics, Doha (2014)

    Google Scholar 

  29. Abdul-hamid, A., Darwish, K.: Simplified feature set for Arabic named entity recognition. In: Proceedings of the 2010 Named Entities Workshop (NEWS 2010), pp. 110–115. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  30. Pasha, A., Al-Badrashiny, M., Diab, M., Kholy, A.El, Eskander, R., Habash, N., Pooleery, M., Rambow, O., Roth, R.: MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 26–31. European Language Resources Association (ELRA), Reykjavik (2014)

    Google Scholar 

  31. Brown, P.F., DeSouza, P.V., Mercer, R.L., Dellapietra, V.J., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18, 467–479 (1992)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omnia H. Zayed .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zayed, O.H., El-Beltagy, S.R. (2015). A Hybrid Approach for Extracting Arabic Persons’ Names and Resolving Their Ambiguity from Twitter. In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2015. Lecture Notes in Computer Science(), vol 9103. Springer, Cham. https://doi.org/10.1007/978-3-319-19581-0_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19581-0_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19580-3

  • Online ISBN: 978-3-319-19581-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics