Abstract
In this paper, extensive experiments are conducted to study the impact of features of different categories, in isolation and gradually in an incremental manner, on Arabic Person name recognition. We present an integrated system that employs the rule-based approach with the machine learning (ML)-based approach in order to develop a consolidated hybrid system. Our feature space is comprised of language-independent and language-specific features. The explored features are naturally grouped under six categories: Person named entity tags predicted by the rule-based component, word-level features, POS features, morphological features, gazetteer features, and other contextual features. As decision tree algorithm has proved comparatively higher efficiency as a classifier in current state-of-the-art hybrid Named Entity Recognition for Arabic, it is adopted in this study as the ML technique utilized by the hybrid system. Therefore, the experiments are focused on two dimensions: the standard dataset used and the set of selected features. A number of standard datasets are used for the training and testing of the hybrid system, including ACE (2003–2004) and ANERcorp. The experimental analysis indicates that both language-independent and language-specific features play an important role in overcoming the challenges posed by Arabic language and have demonstrated critical impact on optimizing the performance of the hybrid system.
Similar content being viewed by others
Notes
We used Habash–Soudi–Buckwalter transliteration scheme (Habash et al. 2007).
GATE is freely available at the web link: http://gate.ac.uk/.
WEKA is available from www.cs.waikato.ac.nz/ml/weka/.
Available for our institution under license agreement from the Linguistic Data Consortium (LDC).
Available for download from http://www1.ccls.columbia.edu/~ybenajiba/downloads.html.
References
Abdallah, S., Shaalan, K., & Shoaib, M. (2012). Integrating rule-based system with classification for arabic named entity recognition. In Proceedings of the 13th international conference on intelligent text processing and computational linguistics (CICLing) (pp. 311–322). Berlin: Springer.
AbdelRahman, S., Elarnaoty, M., Magdy, M., & Fahmy, A. (2010). Integrated machine learning techniques for Arabic named entity recognition. International Journal of Computer Science Issues (IJCSI), 7(3), 27–36.
Abdul-Hamid, A., & Darwish, K. (2010). Simplified feature set for Arabic named entity recognition. In Proceedings of the 2010 named entities workshop (ACL 2010) (pp. 110–115).
Aboaoga, M., & Aziz, M. J. A. (2013). Arabic person names recognition by using a rule based approach. Journal of Computer Science, 9, 922–927.
Abouenour, L., Bouzoubaa, K., & Rosso, P. (2013). On the evaluation and improvement of Arabic word net coverage and usability. Language Resources and Evaluation, 47(3), 891–917.
Alias-I. (2008). LingPipe 4.1.0., In: LingPipe, http://alias-i.com/lingpipe. 1 Oct 2008.
Al-Sughaiyer, I., & Al-Kharashi, A. (2004). Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology, 55, 189–213.
Babych, B., & Hartley, A. (2003). Improving machine translation quality with automatic named entity recognition. In Proceedings of the 7th international EAMT workshop on MT and other language technology tools, improving MT through other language technology tools: Resources and tools for building MT (EAMT 2003) (pp. 1–8).
Benajiba, Y., Diab, M., & Rosso, P. (2008a). Arabic named entity recognition: An SVM-based approach. In Proceedings of Arab international conference on information technology (ACIT 2008) (pp. 16–18).
Benajiba, Y., Diab, M., & Rosso, P. (2008b). Arabic named entity recognition using optimized feature sets. In Proceedings of the conference on empirical methods in natural language.
Benajiba, Y., Diab, M., & Rosso, P. (2009a). Arabic named entity recognition: A feature-driven study. IEEE Transactions on Audio, Speech and Language Processing, 17(5), 926–934.
Benajiba, Y., Diab, M., & Rosso, P. (2009b). Using language independent and language specific features to enhance Arabic named entity recognition. The International Arab Journal of Information Technology, 6(5), 464–473.
Benajiba, Y., & Rosso, P. (2007). ANERsys 2.0: Conquering the NER task for the Arabic language by combining the Maximum Entropy with POS-tag information. In Proceedings of workshop on natural language-independent engineering, 3rd indian international conference on artificial intelligence (IICAI-2007) (pp. 1814–1823).
Benajiba, Y., & Rosso, P. (2008). Arabic named entity recognition using conditional random fields. In Proceedings of workshop on HLT & NLP within the Arabic World (LREC 2008).
Benajiba, Y., Rosso, P., & Bened’i, J. M. (2007). ANERsys: An Arabic named entity recognition system based on maximum entropy. In Proceedings of the 8th international conference on computational linguistics and intelligent text processing (CICLing-2007) (pp. 143–153). Berlin: Springer.
Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing (pp. 1–8).
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I. et al. (2011). Text processing with GATE (Version 6), University of Sheffield Department of Computer Science.
Elsebai, A., Meziane, F., & BelKredim, F. Z. (2009). A rule based Persons names Arabic extraction system. Communications of the IBIMA, 11(6), 53–59.
Farber, B., Freitag, D., Habash, N., & Rambow, O. (2008). Improving NER in Arabic using a morphological tagger. In Proceedings of workshop on HLT & NLP within the Arabic world (LREC 2008) (pp. 2509–2514).
Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8, 1–22.
Finkel, J., & Manning, C. (2009). Nested named entity recognition. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 141–150).
Habash, N., Owen, R., & Ryan, R. (2009). MADA + TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt.
Habash, N., & Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05) (pp. 573–580).
Habash, N., Rambow, O., & Roth, R. (2010). MADA + TOKAN Manual. Technical Report CCLS-10-01, Center for Computational Learning Systems (CCLS), Columbia University.
Habash, N., Soudi, A., & Buckwalter, T. (2007). On Arabic transliteration. Arabic Computational Morphology: Knowledge-based and Empirical Methods, 38, 15–22
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 10–18.
Hamadene, A., Shaheen, M., & Badawy, O. (2011). ARQA: An intelligent Arabic question answering system. In Proceedings of Arabic language technology international conference (ALTIC 2011).
Küçük, D., & Yazıcı, A. (2012). A hybrid named entity recognizer for Turkish. Expert Systems with Applications, 39, 2733–2742.
Maloney, J., & Niv, M. (1998). TAGARAB: A fast, accurate Arabic name recognizer using high-precision morphological analysis. In Proceedings of the workshop on computational approaches to Semitic languages (Semitic 1998) (pp. 8–15).
Manning, C. D., Raghavan, P., & Schtze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Mayfield, J., McNamee, P., & Piatko, C. (2003). Named entity recognition using hundreds of thousands of features. In Proceedings of the 7th conference on natural language learning at HLT-NAACL 2003 (CONLL 2003) (pp. 184–187).
Maynard, D., Tablan, V., Ursu, C., Cunningham, H., & Wilks, Y. (2001). Named entity recognition from diverse text types. In Proceedings of recent advances in natural language processing 2001 conference.
Mesfar, S. (2007). Named entity recognition for Arabic using syntactic grammars. In Proceedings of the 12th international conference on application of natural language to information systems (pp. 305–316). Berlin: Springer.
Mitchell, A., Strassel, S., Huang, S., & Zakhary, R. (2005). ACE 2004 Multilingual Training Corpus, Ldc2005t09: Linguistic Data Consortium.
Mitchell, A., Strassel, S., Przybocki, M., Davis, J., Doddington, G., Grishman, R. et al. (2003). Tides extraction (ACE) 2003 Multilingual Training Data, Ldc2004t09: Linguistic Data Consortium.
Mohammed, N. F., & Omar, N. (2012). Arabic named entity recognition using artificial neural network. Journal of Computer Science, 8, 1285–1293.
Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26.
Oudah, M. M., & Shaalan, K. (2012). A pipeline Arabic named entity recognition using a hybrid approach. In Proceedings of the 24th international conference on computational linguistics (COLING 2012) (pp. 2159–2176).
Oudah, M., & Shaalan, K. (2013). Person name recognition using the hybrid approach. In Lecture Notes in Computer Science, Natural language processing and information systems (Vol. 7934, pp. 237–248). Springer, Berlin.
Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., & Spyropoulos, C. D. (2001) Using machine learning to maintain rule-based named-entity recognition and classification systems. In Proceeding conference of association for computational linguistics (pp. 426–433).
Riaz, K. (2010). Rule-based named entity recognition in Urdu. In Proceedings of the 2010 named entities workshop (ACL 2010) (pp. 126–135).
Salloum, W., & Habash, N. (2012). Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of the international conference on computational linguistics (pp. 385–392).
Seon, C., Ko, Y., Kim, J., & Seo, J. (2001). Named entity recognition using machine learning methods and pattern-selection rules. In Proceedings of the 6th natural language processing Pacific Rim symposium (pp. 229–236).
Shaalan, K. (2010). Rule-based approach in Arabic natural language processing. The International Journal on Information and Communication Technologies (IJICT), 3(3), 11–19.
Shaalan, K. (2014). A survey of Arabic named entity recognition and classification. Computational Linguistics, 40(2), 469–510.
Shaalan, K., & Oudah, M. (2014). A hybrid approach to Arabic named entity recognition. Journal of Information Science (JIS), 40, 67–87.
Shaalan, K., Rafea, A., Abdel Monem, A., & Baraka, H. (2004). Machine translation of English noun phrases into Arabic. The International Journal of Computer Processing of Oriental Languages (IJCPOL), 17(2), 121–134.
Shaalan, K., & Raza, H. (2007). Person name entity recognition for Arabic. In Proceedings of the 5th workshop on important unresolved matters (pp. 17–24).
Shaalan, K., & Raza, H. (2008). Arabic named entity recognition from diverse text types. In Proceedings of the 6th international conference on natural language processing (GoTAL 2008) (pp. 440–451). Berlin: Springer.
Shaalan, K., & Raza, H. (2009). NERA: Named entity recognition for Arabic. Journal of the American Society for Information Science and Technology, 60(8), 1652–1663.
Srihari, R., Niu, C., & Li, W. (2000). A hybrid approach for named entity and sub-type tagging. In Proceedings of the 6th conference on applied natural language processing (ANLC 2000) (pp. 247–254).
Toral, A., Noguera, E., Llopis, F., & Munoz, R. (2005). Improving question answering using named entity recognition. In Proceedings of the 10th international conference on Natural Language Processing and Information Systems (NLDB’05) (pp. 181–191). Berlin: Springer.
Tsai, T., Wu, S., Lee, C., Shih, C., & Hsu, W. (2004). Mencius: A Chinese named entity recognizer using the maximum entropy-based hybrid model. Computational Linguistics and Chinese Language Processing, 9, 65–82.
Zaghouani, W. (2012). RENAR: A rule-based Arabic named entity recognition system. ACM Transactions on Asian Language Information Processing, 11, 1–13.
Zhou, G., & Su, J. (2002). Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL) (pp. 473–480).
Zirikly, A., & Diab, M. (2015). Named entity recognition for Arabic social media. In Proceedings of NAACL-HLT 2015 (pp. 176–185).
Acknowledgements
This research was funded by the British University in Dubai (Grant No. INF004-Using machine learning to improve Arabic named entity recognition).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Oudah, M., Shaalan, K. Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition. Lang Resources & Evaluation 51, 351–378 (2017). https://doi.org/10.1007/s10579-016-9376-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-016-9376-1