Abstract
A number of natural language processing tools for Urdu language processing have been developed in the past few years for word segmentation, part of speech tagging, chunking, named entity recognition and parsing. Corpora, especially treebanks, are essential data resources for language processing. This work presents the development and evaluation of an Urdu treebank, the CLE-UTB and a statistical parser. The treebank has been annotated with phrase structure annotation. Part of speech tagging has been performed semi-automatically by using an existing tagger and incorrect tags were corrected manually by annotators. The syntactic annotation has been performed in the Penn Treebank style to mark phrases. The annotation scheme also adds functional labels for grammatical roles. Currently, the treebank contains 7854 annotated sentences and 148,575 tokens. Completeness and correctness of the syntactic labels have been checked automatically after manual annotation. To ensure the annotation consistency of the resource, a grammar-based evaluation and an automatic consistency checking tool have been used to detect linguistically implausible constituents. The inter-annotator agreement is greater than 90%. We have developed a bidirectional long-short term memory (BiLSTM) based parser and a POS tagger which have been trained on the final version of the treebank. We have improved our results by training the word embeddings on a large Urdu text corpus. Our parser produced an f-score of 88.1% and the POS tagger performed with an accuracy of 96.3%.
Similar content being viewed by others
Change history
17 November 2020
A Correction to this paper has been published: https://doi.org/10.1007/s10579-020-09518-0
References
Abbas, Q. (2012). Building a hierarchical annotated corpus of Urdu: the URDU.KON-TB treebank. Computational Linguistics and Intelligent Text Processing. pp. 66–79.
Abbas, Q. (2014). Building computational resources: The URDU.KON-TB treebank and the Urdu parser. Konstanzer Online-Publication-System (KOPS).
Abney, S., Flickenger, S., Gdaniec, C., Grishman, C., Harrison, P., Hindle, D. ... others (1991). Procedure for quantitatively comparing the syntactic coverage of English grammars. In: Proceedings of the workshop on speech and natural language (pp. 306–311).
Akram, M. & Hussain, S. (2010). Word segmentation for Urdu OCR system. In: Proceedings of the 8th Workshop on Asian Language Resources, Beijing, China (pp. 88–94).
Bharati, A., Chaitanya, V., Sangal, R., & Ramakrishnamacharyulu, K. (1995). Natural language processing: A Paninian perspective. New DelhiNew Delhi: Prentice-Hall.
Bharati, A., Sangal, R., & Sharma, D. M. (2007). SSF: Shakti Standard Format guide. International Institute of Information Technology, Hyderabad, Language Technologies Research Centre, India, pp. 1–25.
Bhat, R. A., Bhatt, R., Farudi, A., Klassen, P., Narasimhan, B., Palmer, M. & Xia, F. (2017). The Hindi/Urdu treebank project. In: Handbook of Linguistic Annotation (pp. 659–697). Springer.
Bhat, R. A. & Sharma, D. M. (2012). A dependency treebank of Urdu and its evaluation. In: Proceedings of the Sixth Linguistic Annotation Workshop (pp. 157–165).
Bhatt, R., Farudi, A. & Rambow, O. (2013). Hindi-Urdu Phrase Structure Annotation Guidelines.
Bhatt, R., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D. M. & Xia, F. (2009). A multi-representational and multi-layered treebank for Hindi/Urdu. In: Proceedings of the Third Linguistic Annotation Workshop (pp. 186–189).
Bies, A., Ferguson, M., Katz, K., MacIntyre, R., Tredinnick, V., Kim, G., et al. (1995). Bracketing guidelines for Treebank II style Penn Treebank project. University of Pennsylvania, 97, 100.
Bin Zia, H., Raza, A. A. & Athar, A. (2018). Urdu Word Segmentation using Conditional Random Fields (CRFs). In: Proceedings of the 27th International Conference on Computational Linguistics (pp. 2562–2569). Santa Fe, New Mexico, USAAssociation for Computational Linguistics. https://www.aclweb.org/anthology/C18-1217.
Bögel, T., & Butt, M. (2013). Possessive clitics and ezafe in Urdu. Morphosyntactic Categories and the Expression of Possession, 199(291), 86–129.
Brants, S., Dipper, S., Hansen, S., Lezius, W. & Smith, G. (2002). The TIGER treebank, vol. 168. In: Proceedings of the Workshop on Treebanks and Linguistic Theories.
Butt, M. (1995). The structure of complex predicates in Urdu. Center for the Study of Language (CSLI).
Butt, M. (2006). Theories of case. Cambridge: Cambridge University Press.
Butt, M. & King, T. H. (2014). Questions and information structure in Urdu/Hindi. In: Proceedings of the LFG14 Conference (pp. 158–178).
Butt, M. & Ramchand, G. (2001). Complex aspectual structure in Hindi/Urdu. M. Liakata, B. Jensen, & D. Maillat, Eds. 1–30.
Chomsky, N. (2014). The minimalist program. New York: MIT press.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Durrani, N. & Hussain, S. (2010). Urdu word segmentation. In: Human language technologies: The 2010 annual conference of the north american chapter of the association for computational linguistics (pp. 528–536).
Gómez-Rodríguez, C. & Vilares, D. (2018). Constituent Parsing as Sequence Labeling. In: Conference on Empirical Methods in Natural Language Processing, EMNLP2018 (pp. 1314–1324).
Hajič, J., Hajičová, E., Mikulová, M. & Mírovskỳ, J. (2017). Prague dependency treebank. In: Handbook of Linguistic Annotation (pp. 555–594). Springer.
Han, A. L.-F., Wong, D. F., Chao, L. S., Lu, Y., He, L. & Tian, L. (2014). A universal phrase tagset for multilingual treebanks. In: Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data (pp. 247–258). Springer.
Hardie, A. (2003). Developing a tagset for automated part-of-speech tagging in Urdu. In: Proceedings of the Corpus Linguistics Conference.
Hussain, S. (2004). Finite-state morphological analyzer for Urdu. Unpublished MS thesis, Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan.
Kaljurand, K. (2004). Checking treebank consistency to find annotation errors.
Khan, T. A. (2009). Spatial expressions and case in South Asian languages. Konstanzer Online-Publication-System (KOPS).
Khan, T. A., Ehsan, T., Ashraf, A., Rahman, M. U., Hussain, S. & Butt, M. (2020). A Multilayered Urdu Treebank.
Khan, T. A., Urooj, S., Hussain, S., Mustafa, A., Parveen, R., Adeeba, F. & Butt, M. (2015). The CLE Urdu POS tagset. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation (pp. 2920–2925).
Klein, D. & Manning, C. D. (2003). Fast exact inference with a factored model for natural language parsing. In: Advances in Neural information processing systems (pp. 3–10).
Liu, T., Ma, J., & Li, S. (2006). Building a Dependency Treebank for Improving Chinese Parser. Journal of Chinese Language and Computing, 16(4), 207–224.
Maamouri, M., Bies, A., Buckwalter, T. & Mekki, W. (2004). The Penn Arabic treebank: Building a large-scale annotated Arabic corpus. In: NEMLAR Conference on Arabic Language Resources and Tools (Vol. 27, pp. 466–467).
Malik, M. K., Ahmed, T., Sulger, S., Bögel, T., Gulzar, A., Raza, G. & Butt, M. (2010). Transliterating Urdu for a broad-coverage Urdu/Hindi LFG grammar. In: LREC 2010, Seventh International Conference on Language Resources and Evaluation (pp. 2921–2927).
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276–282.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (pp. 3111–3119).
Mohanan, T. (1994). Argument structure in Hindi. Center for the Study of Language (CSLI).
Nguyen, P.-T., Le, A.-C., Ho, T.-B., & Nguyen, V.-H. (2015). Vietnamese treebank construction and entropy-based error detection. Language Resources and Evaluation, 49(3), 487–519.
Nguyen, Q. T., Miyao, Y., Le, H. T., & Nguyen, N. T. (2017). Ensuring annotation consistency and accuracy for Vietnamese treebank. Language Resources and Evaluation, 1, 47.
Raza, G., Ahmed, T., Butt, M. & King, T. H. (2011). Argument scrambling within Urdu NPs. Proceedings of LFG11 Conference. 461–481.
Sajjad, H. (2007). Statistical part of speech tagger for Urdu. Unpublished MS Thesis, National University of Computer and Emerging Sciences, Lahore, Pakistan.
Sajjad, H. & Schmid, H. (2009). Tagging urdu text with parts of speech: A tagger comparison. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (pp. 692–700).
Schmid, H. (1995). Treetagger \(|\) a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 43, 28.
Schmid, H. (1999). Improvements in part-of-speech tagging with an application to German. In: Natural language processing using very large corpora (pp. 13–25). Springer.
Silveira, N., Dozat, T., De Marneffe, M.-C., Bowman, S. R., Connor, M., Bauer, J. & Manning, C. D. (2014). A Gold Standard Dependency Corpus for English. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation (pp. 2897–2904).
Simons, G. F., & Fennig, C. D. (2017). Ethnologue: Languages of the world. SIL International, 2017, 20.
Toutanova, K., Klein, D., Manning, C. D. & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 (pp. 173–180).
Urooj, S., Hussain, S., Adeeba, F., Jabeen, F., & Parveen, R. (2012). CLE Urdu digest corpus. Language and Technology, 2012, 47.
Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238.
Acknowledgements
We are thankful to Dr. Tafseer Ahmad Khan, DHA Suffa University, for preparing the annotation guidelines which paved a route to the development of the CLE-UTB. We are also indebted to Professor Miriam Butt, University of Konstanz, for her valuable comments and feedback on this work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original version of this article was revised: In the original publication of the article the column headers of the Tables 17 and 18 were incorrectly displayed.
Appendix A: Tag sets
Appendix A: Tag sets
Rights and permissions
About this article
Cite this article
Ehsan, T., Hussain, S. Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser. Lang Resources & Evaluation 55, 287–326 (2021). https://doi.org/10.1007/s10579-020-09492-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-020-09492-7