Abstract
Tweets offer a novel way of communication that enables users all over the world to share real-time news and ideas. The massive amount of tweets, generated regularly by Arabic speakers, has resulted in a growing interest in building Arabic named entity recognition (NER) systems that deal with the informal colloquial Arabic. The unique characteristics of the Arabic language make Arabic NER a challenging task, which, the informal nature of tweets further complicates. The majority of previous works addressing Arabic NER were concerned with formal modern standard Arabic (MSA). Moreover, taggers and parsers were often utilized to solve the ambiguity problem of Arabic persons’ names. Although, previously developed approaches perform well on MSA text, they are not suited for colloquial Arabic. This paper introduces a hybrid approach to extract Arabic persons’ names from tweets in addition to a way to resolve their ambiguity using context bigram patterns. The introduced approach attempts not to use any language-dependent resources. Evaluation of the presented approach shows a 7 % improvement in the F-score over the best reported result in the literature.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Diacritics are special marks placed above or below the letters.
- 2.
http://tmrg.nileu.edu.eg/downloads.html (under the title: Persons’ Names Dictionaries).
- 3.
The rest is used to collect a test dataset.
- 4.
Waikato Environment for Knowledge Analysis.
- 5.
We experimented with SVM and BayesNET. Although the results in the classification phase were better but they were significantly lower in the extraction phase.
- 6.
The authors omitted the organizations’ names class because it has small frequency in the annotated data.
- 7.
LDC2012T09: GALE Arabic-Dialect/English Parallel Text.
References
Semiocast: Geolocation analysis of Twitter accounts and tweets by Semiocast. http://semiocast.com/en/publications/2012_07_30_Twitter_reaches_half_a_billion_accounts_140m_in_the_US
Shaalan, K.: A survey of arabic named entity recognition and classification. Comput. Linguist. 40, 469–510 (2014)
Farghaly, A., Shaalan, K.: Arabic natural language processing: challenges and solutions. ACM Trans. Asian Lang. Inf. Process. 8, 1–22 (2009)
Zayed, O., El-Beltagy, S., Haggag, O.: An approach for extracting and disambiguating arabic persons’ names using clustered dictionaries and scored patterns. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds.) NLDB 2013. LNCS, vol. 7934, pp. 201–212. Springer, Heidelberg (2013)
Darwish, K.: Named entity recognition using cross-lingual resources: arabic as an example. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 1558–1567. Association for Computational Linguistics, Sofia (2013)
Wikipedia. https://www.wikipedia.org/
Habash, N.Y.: Introduction to Arabic Natural Language Processing. Mogran & Claypool, San Rafael (2010)
Wikipedia People Category. http://ar.wikipedia.org/wiki/
Kooora: Arabic Sports Web Site. http://www.kooora.com/default.aspx?showplayers=true
Twitter Search API. https://dev.twitter.com/rest/public/search
Singhal, A.: Modern information retrieval: a brief overview. Bull. IEEE Comput. Soc. Tech. Comm. DATA Eng. 24, 35–43 (2001)
Witten, I.H., Frank, E., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: Weka: practical machine learning tools and techniques with Java implementations. In: Kasabov, N., Ko, K. (eds.) Proceedings of the ICONIP/ANZIIS/ANNES 1999 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems, Dunedin, New Zealand, pp 192–196 (1999)
CoNLL’s Standard NER Evaluation Script. http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt
De Sitter, A., Calders, T., Daelemans, W.: A formal framework for evaluation of information extraction, Antwerp (2004)
Darwish, K., Gao, W.: Simple effective microblog named entity recognition: arabic as an example. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 2513–2517. European Language Resources Association (ELRA), Reykjavik (2014)
Linguistic Data Consortium (LDC). https://www.ldc.upenn.edu
Shaalan, K., Raza, H.: Person name entity recognition for Arabic. In: Proceedings of the 5th Workshop on Important Unresolved Matters, pp. 17–24. Association for Computational Linguistics, Prague (2007)
Shaalan, K., Raza, H.: Arabic named entity recognition from diverse text types. In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 440–451. Springer, Heidelberg (2008)
Shaalan, K., Raza, H.: NERA: named entity recognition for Arabic. J. Amer. Soc. for. Inf. Sci. Technol. 60, 1652–1663 (2009)
Abdallah, S., Shaalan, K., Shoaib, M.: Integrating rule-based system with classification for Arabic named entity recognition. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS, vol. 7181, pp. 311–322. Springer, Heidelberg (2012)
Benajiba, Y., Rosso, P., BenedíRuiz, J.M.: ANERsys: an arabic named entity recognition system based on maximum entropy. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 143–153. Springer, Heidelberg (2007)
Benajiba, Y., Diab, M., Rosso, P., Valencia, D.: Arabic named entity recognition using optimized feature sets. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, pp. 284–293. Association for Computational Linguistics, Morristown (2008)
Benajiba, Y., Diab, M., Rosso, P.: Arabic named entity recognition: a feature-driven study. IEEE Trans. Audio Speech Lang. Process. 17, 926–934 (2009)
Benajiba, Y., Diab, M., Rosso, P.: Arabic named entity recognition: an SVM based approach. In: Proceeding of the 2008 Arab International Conference on Information Technology (ACIT) (2008)
Benajiba, Y., Rosso, P.: ANERsys 2 . 0 : conquering the NER task for the arabic language by combining the maximum entropy with POS-tag information. In: IICAI, pp. 1814–1823 (2007)
Benajiba, Y., Rosso, P.: Arabic named entity recognition using conditional random fields. In: Proceedings of Workshop on HLT & NLP within the Arabic World, LREC, vol. 8 (2008)
Shaalan, K., Oudah, M.: A hybrid approach to Arabic named entity recognition. J. Inf. Sci. 40, 67–87 (2014)
Zirikly, A., Diab, M.: Named entity recognition system for dialectal Arabic. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pp. 78–86. Association for Computational Linguistics, Doha (2014)
Abdul-hamid, A., Darwish, K.: Simplified feature set for Arabic named entity recognition. In: Proceedings of the 2010 Named Entities Workshop (NEWS 2010), pp. 110–115. Association for Computational Linguistics, Stroudsburg (2010)
Pasha, A., Al-Badrashiny, M., Diab, M., Kholy, A.El, Eskander, R., Habash, N., Pooleery, M., Rambow, O., Roth, R.: MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 26–31. European Language Resources Association (ELRA), Reykjavik (2014)
Brown, P.F., DeSouza, P.V., Mercer, R.L., Dellapietra, V.J., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18, 467–479 (1992)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zayed, O.H., El-Beltagy, S.R. (2015). A Hybrid Approach for Extracting Arabic Persons’ Names and Resolving Their Ambiguity from Twitter. In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2015. Lecture Notes in Computer Science(), vol 9103. Springer, Cham. https://doi.org/10.1007/978-3-319-19581-0_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-19581-0_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19580-3
Online ISBN: 978-3-319-19581-0
eBook Packages: Computer ScienceComputer Science (R0)