Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition

Oudah, Mai; Shaalan, Khaled

doi:10.1007/s10579-016-9376-1

Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition

Original Paper
Published: 26 November 2016

Volume 51, pages 351–378, (2017)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Mai Oudah¹ &
Khaled Shaalan²

309 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, extensive experiments are conducted to study the impact of features of different categories, in isolation and gradually in an incremental manner, on Arabic Person name recognition. We present an integrated system that employs the rule-based approach with the machine learning (ML)-based approach in order to develop a consolidated hybrid system. Our feature space is comprised of language-independent and language-specific features. The explored features are naturally grouped under six categories: Person named entity tags predicted by the rule-based component, word-level features, POS features, morphological features, gazetteer features, and other contextual features. As decision tree algorithm has proved comparatively higher efficiency as a classifier in current state-of-the-art hybrid Named Entity Recognition for Arabic, it is adopted in this study as the ML technique utilized by the hybrid system. Therefore, the experiments are focused on two dimensions: the standard dataset used and the set of selected features. A number of standard datasets are used for the training and testing of the hybrid system, including ACE (2003–2004) and ANERcorp. The experimental analysis indicates that both language-independent and language-specific features play an important role in overcoming the challenges posed by Arabic language and have demonstrated critical impact on optimizing the performance of the hybrid system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative analysis of gradient boosting algorithms

Article 24 August 2020

Information extraction from electronic medical documents: state of the art and future research directions

Article 08 November 2022

Large-Language-Models (LLM)-Based AI Chatbots: Architecture, In-Depth Analysis and Their Performance Evaluation

Notes

We used Habash–Soudi–Buckwalter transliteration scheme (Habash et al. 2007).
GATE is freely available at the web link: http://gate.ac.uk/.
WEKA is available from www.cs.waikato.ac.nz/ml/weka/.
Available for our institution under license agreement from the Linguistic Data Consortium (LDC).
Available for download from http://www1.ccls.columbia.edu/~ybenajiba/downloads.html.

References

Abdallah, S., Shaalan, K., & Shoaib, M. (2012). Integrating rule-based system with classification for arabic named entity recognition. In Proceedings of the 13th international conference on intelligent text processing and computational linguistics (CICLing) (pp. 311–322). Berlin: Springer.
AbdelRahman, S., Elarnaoty, M., Magdy, M., & Fahmy, A. (2010). Integrated machine learning techniques for Arabic named entity recognition. International Journal of Computer Science Issues (IJCSI), 7(3), 27–36.
Google Scholar
Abdul-Hamid, A., & Darwish, K. (2010). Simplified feature set for Arabic named entity recognition. In Proceedings of the 2010 named entities workshop (ACL 2010) (pp. 110–115).
Aboaoga, M., & Aziz, M. J. A. (2013). Arabic person names recognition by using a rule based approach. Journal of Computer Science, 9, 922–927.
Article Google Scholar
Abouenour, L., Bouzoubaa, K., & Rosso, P. (2013). On the evaluation and improvement of Arabic word net coverage and usability. Language Resources and Evaluation, 47(3), 891–917.
Article Google Scholar
Alias-I. (2008). LingPipe 4.1.0., In: LingPipe, http://alias-i.com/lingpipe. 1 Oct 2008.
Al-Sughaiyer, I., & Al-Kharashi, A. (2004). Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology, 55, 189–213.
Article Google Scholar
Babych, B., & Hartley, A. (2003). Improving machine translation quality with automatic named entity recognition. In Proceedings of the 7th international EAMT workshop on MT and other language technology tools, improving MT through other language technology tools: Resources and tools for building MT (EAMT 2003) (pp. 1–8).
Benajiba, Y., Diab, M., & Rosso, P. (2008a). Arabic named entity recognition: An SVM-based approach. In Proceedings of Arab international conference on information technology (ACIT 2008) (pp. 16–18).
Benajiba, Y., Diab, M., & Rosso, P. (2008b). Arabic named entity recognition using optimized feature sets. In Proceedings of the conference on empirical methods in natural language.
Benajiba, Y., Diab, M., & Rosso, P. (2009a). Arabic named entity recognition: A feature-driven study. IEEE Transactions on Audio, Speech and Language Processing, 17(5), 926–934.
Article Google Scholar
Benajiba, Y., Diab, M., & Rosso, P. (2009b). Using language independent and language specific features to enhance Arabic named entity recognition. The International Arab Journal of Information Technology, 6(5), 464–473.
Google Scholar
Benajiba, Y., & Rosso, P. (2007). ANERsys 2.0: Conquering the NER task for the Arabic language by combining the Maximum Entropy with POS-tag information. In Proceedings of workshop on natural language-independent engineering, 3rd indian international conference on artificial intelligence (IICAI-2007) (pp. 1814–1823).
Benajiba, Y., & Rosso, P. (2008). Arabic named entity recognition using conditional random fields. In Proceedings of workshop on HLT & NLP within the Arabic World (LREC 2008).
Benajiba, Y., Rosso, P., & Bened’i, J. M. (2007). ANERsys: An Arabic named entity recognition system based on maximum entropy. In Proceedings of the 8th international conference on computational linguistics and intelligent text processing (CICLing-2007) (pp. 143–153). Berlin: Springer.
Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing (pp. 1–8).
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I. et al. (2011). Text processing with GATE (Version 6), University of Sheffield Department of Computer Science.
Elsebai, A., Meziane, F., & BelKredim, F. Z. (2009). A rule based Persons names Arabic extraction system. Communications of the IBIMA, 11(6), 53–59.
Farber, B., Freitag, D., Habash, N., & Rambow, O. (2008). Improving NER in Arabic using a morphological tagger. In Proceedings of workshop on HLT & NLP within the Arabic world (LREC 2008) (pp. 2509–2514).
Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8, 1–22.
Article Google Scholar
Finkel, J., & Manning, C. (2009). Nested named entity recognition. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 141–150).
Habash, N., Owen, R., & Ryan, R. (2009). MADA + TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt.
Habash, N., & Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05) (pp. 573–580).
Habash, N., Rambow, O., & Roth, R. (2010). MADA + TOKAN Manual. Technical Report CCLS-10-01, Center for Computational Learning Systems (CCLS), Columbia University.
Habash, N., Soudi, A., & Buckwalter, T. (2007). On Arabic transliteration. Arabic Computational Morphology: Knowledge-based and Empirical Methods, 38, 15–22
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 10–18.
Article Google Scholar
Hamadene, A., Shaheen, M., & Badawy, O. (2011). ARQA: An intelligent Arabic question answering system. In Proceedings of Arabic language technology international conference (ALTIC 2011).
Küçük, D., & Yazıcı, A. (2012). A hybrid named entity recognizer for Turkish. Expert Systems with Applications, 39, 2733–2742.
Article Google Scholar
Maloney, J., & Niv, M. (1998). TAGARAB: A fast, accurate Arabic name recognizer using high-precision morphological analysis. In Proceedings of the workshop on computational approaches to Semitic languages (Semitic 1998) (pp. 8–15).
Manning, C. D., Raghavan, P., & Schtze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Book Google Scholar
Mayfield, J., McNamee, P., & Piatko, C. (2003). Named entity recognition using hundreds of thousands of features. In Proceedings of the 7th conference on natural language learning at HLT-NAACL 2003 (CONLL 2003) (pp. 184–187).
Maynard, D., Tablan, V., Ursu, C., Cunningham, H., & Wilks, Y. (2001). Named entity recognition from diverse text types. In Proceedings of recent advances in natural language processing 2001 conference.
Mesfar, S. (2007). Named entity recognition for Arabic using syntactic grammars. In Proceedings of the 12th international conference on application of natural language to information systems (pp. 305–316). Berlin: Springer.
Mitchell, A., Strassel, S., Huang, S., & Zakhary, R. (2005). ACE 2004 Multilingual Training Corpus, Ldc2005t09: Linguistic Data Consortium.
Mitchell, A., Strassel, S., Przybocki, M., Davis, J., Doddington, G., Grishman, R. et al. (2003). Tides extraction (ACE) 2003 Multilingual Training Data, Ldc2004t09: Linguistic Data Consortium.
Mohammed, N. F., & Omar, N. (2012). Arabic named entity recognition using artificial neural network. Journal of Computer Science, 8, 1285–1293.
Article Google Scholar
Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26.
Article Google Scholar
Oudah, M. M., & Shaalan, K. (2012). A pipeline Arabic named entity recognition using a hybrid approach. In Proceedings of the 24th international conference on computational linguistics (COLING 2012) (pp. 2159–2176).
Oudah, M., & Shaalan, K. (2013). Person name recognition using the hybrid approach. In Lecture Notes in Computer Science, Natural language processing and information systems (Vol. 7934, pp. 237–248). Springer, Berlin.
Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., & Spyropoulos, C. D. (2001) Using machine learning to maintain rule-based named-entity recognition and classification systems. In Proceeding conference of association for computational linguistics (pp. 426–433).
Riaz, K. (2010). Rule-based named entity recognition in Urdu. In Proceedings of the 2010 named entities workshop (ACL 2010) (pp. 126–135).
Salloum, W., & Habash, N. (2012). Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of the international conference on computational linguistics (pp. 385–392).
Seon, C., Ko, Y., Kim, J., & Seo, J. (2001). Named entity recognition using machine learning methods and pattern-selection rules. In Proceedings of the 6th natural language processing Pacific Rim symposium (pp. 229–236).
Shaalan, K. (2010). Rule-based approach in Arabic natural language processing. The International Journal on Information and Communication Technologies (IJICT), 3(3), 11–19.
Google Scholar
Shaalan, K. (2014). A survey of Arabic named entity recognition and classification. Computational Linguistics, 40(2), 469–510.
Article Google Scholar
Shaalan, K., & Oudah, M. (2014). A hybrid approach to Arabic named entity recognition. Journal of Information Science (JIS), 40, 67–87.
Article Google Scholar
Shaalan, K., Rafea, A., Abdel Monem, A., & Baraka, H. (2004). Machine translation of English noun phrases into Arabic. The International Journal of Computer Processing of Oriental Languages (IJCPOL), 17(2), 121–134.
Article Google Scholar
Shaalan, K., & Raza, H. (2007). Person name entity recognition for Arabic. In Proceedings of the 5th workshop on important unresolved matters (pp. 17–24).
Shaalan, K., & Raza, H. (2008). Arabic named entity recognition from diverse text types. In Proceedings of the 6th international conference on natural language processing (GoTAL 2008) (pp. 440–451). Berlin: Springer.
Shaalan, K., & Raza, H. (2009). NERA: Named entity recognition for Arabic. Journal of the American Society for Information Science and Technology, 60(8), 1652–1663.
Article Google Scholar
Srihari, R., Niu, C., & Li, W. (2000). A hybrid approach for named entity and sub-type tagging. In Proceedings of the 6th conference on applied natural language processing (ANLC 2000) (pp. 247–254).
Toral, A., Noguera, E., Llopis, F., & Munoz, R. (2005). Improving question answering using named entity recognition. In Proceedings of the 10th international conference on Natural Language Processing and Information Systems (NLDB’05) (pp. 181–191). Berlin: Springer.
Tsai, T., Wu, S., Lee, C., Shih, C., & Hsu, W. (2004). Mencius: A Chinese named entity recognizer using the maximum entropy-based hybrid model. Computational Linguistics and Chinese Language Processing, 9, 65–82.
Google Scholar
Zaghouani, W. (2012). RENAR: A rule-based Arabic named entity recognition system. ACM Transactions on Asian Language Information Processing, 11, 1–13.
Article Google Scholar
Zhou, G., & Su, J. (2002). Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL) (pp. 473–480).
Zirikly, A., & Diab, M. (2015). Named entity recognition for Arabic social media. In Proceedings of NAACL-HLT 2015 (pp. 176–185).

Download references

Acknowledgements

This research was funded by the British University in Dubai (Grant No. INF004-Using machine learning to improve Arabic named entity recognition).

Author information

Authors and Affiliations

Masdar Institute of Science and Technology, Abu Dhabi, UAE
Mai Oudah
The British University in Dubai, Dubai International Academic City, UAE
Khaled Shaalan

Authors

Mai Oudah
View author publications
You can also search for this author in PubMed Google Scholar
Khaled Shaalan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mai Oudah.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oudah, M., Shaalan, K. Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition. Lang Resources & Evaluation 51, 351–378 (2017). https://doi.org/10.1007/s10579-016-9376-1

Download citation

Published: 26 November 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10579-016-9376-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition

Abstract

Access this article

Similar content being viewed by others

A comparative analysis of gradient boosting algorithms

Information extraction from electronic medical documents: state of the art and future research directions

Large-Language-Models (LLM)-Based AI Chatbots: Architecture, In-Depth Analysis and Their Performance Evaluation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition

Abstract

Access this article

Similar content being viewed by others

A comparative analysis of gradient boosting algorithms

Information extraction from electronic medical documents: state of the art and future research directions

Large-Language-Models (LLM)-Based AI Chatbots: Architecture, In-Depth Analysis and Their Performance Evaluation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation