Skip to main content
Log in

Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

A Correction to this article was published on 17 November 2020

This article has been updated

Abstract

A number of natural language processing tools for Urdu language processing have been developed in the past few years for word segmentation, part of speech tagging, chunking, named entity recognition and parsing. Corpora, especially treebanks, are essential data resources for language processing. This work presents the development and evaluation of an Urdu treebank, the CLE-UTB and a statistical parser. The treebank has been annotated with phrase structure annotation. Part of speech tagging has been performed semi-automatically by using an existing tagger and incorrect tags were corrected manually by annotators. The syntactic annotation has been performed in the Penn Treebank style to mark phrases. The annotation scheme also adds functional labels for grammatical roles. Currently, the treebank contains 7854 annotated sentences and 148,575 tokens. Completeness and correctness of the syntactic labels have been checked automatically after manual annotation. To ensure the annotation consistency of the resource, a grammar-based evaluation and an automatic consistency checking tool have been used to detect linguistically implausible constituents. The inter-annotator agreement is greater than 90%. We have developed a bidirectional long-short term memory (BiLSTM) based parser and a POS tagger which have been trained on the final version of the treebank. We have improved our results by training the word embeddings on a large Urdu text corpus. Our parser produced an f-score of 88.1% and the POS tagger performed with an accuracy of 96.3%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Change history

Notes

  1. http://www.cle.org.pk.

  2. http://www.cle.org.pk/clestore/postagger.htm.

  3. https://github.com/Kaljurand/treebank-consistency-checking.

  4. http://www.cle.org.pk.

  5. https://www.urdudigest.pk.

References

  • Abbas, Q. (2012). Building a hierarchical annotated corpus of Urdu: the URDU.KON-TB treebank. Computational Linguistics and Intelligent Text Processing. pp. 66–79.

  • Abbas, Q. (2014). Building computational resources: The URDU.KON-TB treebank and the Urdu parser. Konstanzer Online-Publication-System (KOPS).

  • Abney, S., Flickenger, S., Gdaniec, C., Grishman, C., Harrison, P., Hindle, D. ... others (1991). Procedure for quantitatively comparing the syntactic coverage of English grammars. In: Proceedings of the workshop on speech and natural language (pp. 306–311).

  • Akram, M. & Hussain, S. (2010). Word segmentation for Urdu OCR system. In: Proceedings of the 8th Workshop on Asian Language Resources, Beijing, China (pp. 88–94).

  • Bharati, A., Chaitanya, V., Sangal, R., & Ramakrishnamacharyulu, K. (1995). Natural language processing: A Paninian perspective. New DelhiNew Delhi: Prentice-Hall.

    Google Scholar 

  • Bharati, A., Sangal, R., & Sharma, D. M. (2007). SSF: Shakti Standard Format guide. International Institute of Information Technology, Hyderabad, Language Technologies Research Centre, India, pp. 1–25.

  • Bhat, R. A., Bhatt, R., Farudi, A., Klassen, P., Narasimhan, B., Palmer, M. & Xia, F. (2017). The Hindi/Urdu treebank project. In: Handbook of Linguistic Annotation (pp. 659–697). Springer.

  • Bhat, R. A. & Sharma, D. M. (2012). A dependency treebank of Urdu and its evaluation. In: Proceedings of the Sixth Linguistic Annotation Workshop (pp. 157–165).

  • Bhatt, R., Farudi, A. & Rambow, O. (2013). Hindi-Urdu Phrase Structure Annotation Guidelines.

  • Bhatt, R., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D. M. & Xia, F. (2009). A multi-representational and multi-layered treebank for Hindi/Urdu. In: Proceedings of the Third Linguistic Annotation Workshop (pp. 186–189).

  • Bies, A., Ferguson, M., Katz, K., MacIntyre, R., Tredinnick, V., Kim, G., et al. (1995). Bracketing guidelines for Treebank II style Penn Treebank project. University of Pennsylvania, 97, 100.

    Google Scholar 

  • Bin Zia, H., Raza, A. A. & Athar, A. (2018). Urdu Word Segmentation using Conditional Random Fields (CRFs). In: Proceedings of the 27th International Conference on Computational Linguistics (pp. 2562–2569). Santa Fe, New Mexico, USAAssociation for Computational Linguistics. https://www.aclweb.org/anthology/C18-1217.

  • Bögel, T., & Butt, M. (2013). Possessive clitics and ezafe in Urdu. Morphosyntactic Categories and the Expression of Possession, 199(291), 86–129.

    Google Scholar 

  • Brants, S., Dipper, S., Hansen, S., Lezius, W. & Smith, G. (2002). The TIGER treebank, vol. 168. In: Proceedings of the Workshop on Treebanks and Linguistic Theories.

  • Butt, M. (1995). The structure of complex predicates in Urdu. Center for the Study of Language (CSLI).

  • Butt, M. (2006). Theories of case. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Butt, M. & King, T. H. (2014). Questions and information structure in Urdu/Hindi. In: Proceedings of the LFG14 Conference (pp. 158–178).

  • Butt, M. & Ramchand, G. (2001). Complex aspectual structure in Hindi/Urdu. M. Liakata, B. Jensen, & D. Maillat, Eds. 1–30.

  • Chomsky, N. (2014). The minimalist program. New York: MIT press.

    Book  Google Scholar 

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

    Article  Google Scholar 

  • Durrani, N. & Hussain, S. (2010). Urdu word segmentation. In: Human language technologies: The 2010 annual conference of the north american chapter of the association for computational linguistics (pp. 528–536).

  • Gómez-Rodríguez, C. & Vilares, D. (2018). Constituent Parsing as Sequence Labeling. In: Conference on Empirical Methods in Natural Language Processing, EMNLP2018 (pp. 1314–1324).

  • Hajič, J., Hajičová, E., Mikulová, M. & Mírovskỳ, J. (2017). Prague dependency treebank. In: Handbook of Linguistic Annotation (pp. 555–594). Springer.

  • Han, A. L.-F., Wong, D. F., Chao, L. S., Lu, Y., He, L. & Tian, L. (2014). A universal phrase tagset for multilingual treebanks. In: Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data (pp. 247–258). Springer.

  • Hardie, A. (2003). Developing a tagset for automated part-of-speech tagging in Urdu. In: Proceedings of the Corpus Linguistics Conference.

  • Hussain, S. (2004). Finite-state morphological analyzer for Urdu. Unpublished MS thesis, Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan.

  • Kaljurand, K. (2004). Checking treebank consistency to find annotation errors.

  • Khan, T. A. (2009). Spatial expressions and case in South Asian languages. Konstanzer Online-Publication-System (KOPS).

  • Khan, T. A., Ehsan, T., Ashraf, A., Rahman, M. U., Hussain, S. & Butt, M. (2020). A Multilayered Urdu Treebank.

  • Khan, T. A., Urooj, S., Hussain, S., Mustafa, A., Parveen, R., Adeeba, F. & Butt, M. (2015). The CLE Urdu POS tagset. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation (pp. 2920–2925).

  • Klein, D. & Manning, C. D. (2003). Fast exact inference with a factored model for natural language parsing. In: Advances in Neural information processing systems (pp. 3–10).

  • Liu, T., Ma, J., & Li, S. (2006). Building a Dependency Treebank for Improving Chinese Parser. Journal of Chinese Language and Computing, 16(4), 207–224.

    Google Scholar 

  • Maamouri, M., Bies, A., Buckwalter, T. & Mekki, W. (2004). The Penn Arabic treebank: Building a large-scale annotated Arabic corpus. In: NEMLAR Conference on Arabic Language Resources and Tools (Vol. 27, pp. 466–467).

  • Malik, M. K., Ahmed, T., Sulger, S., Bögel, T., Gulzar, A., Raza, G. & Butt, M. (2010). Transliterating Urdu for a broad-coverage Urdu/Hindi LFG grammar. In: LREC 2010, Seventh International Conference on Language Resources and Evaluation (pp. 2921–2927).

  • Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.

    Google Scholar 

  • McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276–282.

    Article  Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (pp. 3111–3119).

  • Mohanan, T. (1994). Argument structure in Hindi. Center for the Study of Language (CSLI).

  • Nguyen, P.-T., Le, A.-C., Ho, T.-B., & Nguyen, V.-H. (2015). Vietnamese treebank construction and entropy-based error detection. Language Resources and Evaluation, 49(3), 487–519.

    Article  Google Scholar 

  • Nguyen, Q. T., Miyao, Y., Le, H. T., & Nguyen, N. T. (2017). Ensuring annotation consistency and accuracy for Vietnamese treebank. Language Resources and Evaluation, 1, 47.

    Google Scholar 

  • Raza, G., Ahmed, T., Butt, M. & King, T. H. (2011). Argument scrambling within Urdu NPs. Proceedings of LFG11 Conference. 461–481.

  • Sajjad, H. (2007). Statistical part of speech tagger for Urdu. Unpublished MS Thesis, National University of Computer and Emerging Sciences, Lahore, Pakistan.

  • Sajjad, H. & Schmid, H. (2009). Tagging urdu text with parts of speech: A tagger comparison. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (pp. 692–700).

  • Schmid, H. (1995). Treetagger \(|\) a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 43, 28.

    Google Scholar 

  • Schmid, H. (1999). Improvements in part-of-speech tagging with an application to German. In: Natural language processing using very large corpora (pp. 13–25). Springer.

  • Silveira, N., Dozat, T., De Marneffe, M.-C., Bowman, S. R., Connor, M., Bauer, J. & Manning, C. D. (2014). A Gold Standard Dependency Corpus for English. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation (pp. 2897–2904).

  • Simons, G. F., & Fennig, C. D. (2017). Ethnologue: Languages of the world. SIL International, 2017, 20.

    Google Scholar 

  • Toutanova, K., Klein, D., Manning, C. D. & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 (pp. 173–180).

  • Urooj, S., Hussain, S., Adeeba, F., Jabeen, F., & Parveen, R. (2012). CLE Urdu digest corpus. Language and Technology, 2012, 47.

    Google Scholar 

  • Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238.

    Article  Google Scholar 

Download references

Acknowledgements

We are thankful to Dr. Tafseer Ahmad Khan, DHA Suffa University, for preparing the annotation guidelines which paved a route to the development of the CLE-UTB. We are also indebted to Professor Miriam Butt, University of Konstanz, for her valuable comments and feedback on this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Toqeer Ehsan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original version of this article was revised: In the original publication of the article the column headers of the Tables 17 and 18 were incorrectly displayed.

Appendix A: Tag sets

Appendix A: Tag sets

Table 19 The CLE Urdu POS tag set
Table 20 Phrase label set
Table 21 Functional label set
Table 22 Comparison of the CLE-UTB functional labels with the HUTB dependency labels

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ehsan, T., Hussain, S. Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser. Lang Resources & Evaluation 55, 287–326 (2021). https://doi.org/10.1007/s10579-020-09492-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-020-09492-7

Keywords

Navigation