Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser

Ehsan, Toqeer; Hussain, Sarmad

doi:10.1007/s10579-020-09492-7

Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser

Original Paper
Published: 18 July 2020

Volume 55, pages 287–326, (2021)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

509 Accesses
6 Citations
1 Altmetric
Explore all metrics

A Correction to this article was published on 17 November 2020

This article has been updated

Abstract

A number of natural language processing tools for Urdu language processing have been developed in the past few years for word segmentation, part of speech tagging, chunking, named entity recognition and parsing. Corpora, especially treebanks, are essential data resources for language processing. This work presents the development and evaluation of an Urdu treebank, the CLE-UTB and a statistical parser. The treebank has been annotated with phrase structure annotation. Part of speech tagging has been performed semi-automatically by using an existing tagger and incorrect tags were corrected manually by annotators. The syntactic annotation has been performed in the Penn Treebank style to mark phrases. The annotation scheme also adds functional labels for grammatical roles. Currently, the treebank contains 7854 annotated sentences and 148,575 tokens. Completeness and correctness of the syntactic labels have been checked automatically after manual annotation. To ensure the annotation consistency of the resource, a grammar-based evaluation and an automatic consistency checking tool have been used to detect linguistically implausible constituents. The inter-annotator agreement is greater than 90%. We have developed a bidirectional long-short term memory (BiLSTM) based parser and a POS tagger which have been trained on the final version of the treebank. We have improved our results by training the word embeddings on a large Urdu text corpus. Our parser produced an f-score of 88.1% and the POS tagger performed with an accuracy of 96.3%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prague Dependency Treebank

Vietnamese treebank construction and entropy-based error detection

Article 27 June 2015

PassPort: A Dependency Parsing Model for Portuguese

Change history

17 November 2020
A Correction to this paper has been published: https://doi.org/10.1007/s10579-020-09518-0

Notes

References

Abbas, Q. (2012). Building a hierarchical annotated corpus of Urdu: the URDU.KON-TB treebank. Computational Linguistics and Intelligent Text Processing. pp. 66–79.
Abbas, Q. (2014). Building computational resources: The URDU.KON-TB treebank and the Urdu parser. Konstanzer Online-Publication-System (KOPS).
Abney, S., Flickenger, S., Gdaniec, C., Grishman, C., Harrison, P., Hindle, D. ... others (1991). Procedure for quantitatively comparing the syntactic coverage of English grammars. In: Proceedings of the workshop on speech and natural language (pp. 306–311).
Akram, M. & Hussain, S. (2010). Word segmentation for Urdu OCR system. In: Proceedings of the 8th Workshop on Asian Language Resources, Beijing, China (pp. 88–94).
Bharati, A., Chaitanya, V., Sangal, R., & Ramakrishnamacharyulu, K. (1995). Natural language processing: A Paninian perspective. New DelhiNew Delhi: Prentice-Hall.
Google Scholar
Bharati, A., Sangal, R., & Sharma, D. M. (2007). SSF: Shakti Standard Format guide. International Institute of Information Technology, Hyderabad, Language Technologies Research Centre, India, pp. 1–25.
Bhat, R. A., Bhatt, R., Farudi, A., Klassen, P., Narasimhan, B., Palmer, M. & Xia, F. (2017). The Hindi/Urdu treebank project. In: Handbook of Linguistic Annotation (pp. 659–697). Springer.
Bhat, R. A. & Sharma, D. M. (2012). A dependency treebank of Urdu and its evaluation. In: Proceedings of the Sixth Linguistic Annotation Workshop (pp. 157–165).
Bhatt, R., Farudi, A. & Rambow, O. (2013). Hindi-Urdu Phrase Structure Annotation Guidelines.
Bhatt, R., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D. M. & Xia, F. (2009). A multi-representational and multi-layered treebank for Hindi/Urdu. In: Proceedings of the Third Linguistic Annotation Workshop (pp. 186–189).
Bies, A., Ferguson, M., Katz, K., MacIntyre, R., Tredinnick, V., Kim, G., et al. (1995). Bracketing guidelines for Treebank II style Penn Treebank project. University of Pennsylvania, 97, 100.
Google Scholar
Bin Zia, H., Raza, A. A. & Athar, A. (2018). Urdu Word Segmentation using Conditional Random Fields (CRFs). In: Proceedings of the 27th International Conference on Computational Linguistics (pp. 2562–2569). Santa Fe, New Mexico, USAAssociation for Computational Linguistics. https://www.aclweb.org/anthology/C18-1217.
Bögel, T., & Butt, M. (2013). Possessive clitics and ezafe in Urdu. Morphosyntactic Categories and the Expression of Possession, 199(291), 86–129.
Google Scholar
Brants, S., Dipper, S., Hansen, S., Lezius, W. & Smith, G. (2002). The TIGER treebank, vol. 168. In: Proceedings of the Workshop on Treebanks and Linguistic Theories.
Butt, M. (1995). The structure of complex predicates in Urdu. Center for the Study of Language (CSLI).
Butt, M. (2006). Theories of case. Cambridge: Cambridge University Press.
Book Google Scholar
Butt, M. & King, T. H. (2014). Questions and information structure in Urdu/Hindi. In: Proceedings of the LFG14 Conference (pp. 158–178).
Butt, M. & Ramchand, G. (2001). Complex aspectual structure in Hindi/Urdu. M. Liakata, B. Jensen, & D. Maillat, Eds. 1–30.
Chomsky, N. (2014). The minimalist program. New York: MIT press.
Book Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Article Google Scholar
Durrani, N. & Hussain, S. (2010). Urdu word segmentation. In: Human language technologies: The 2010 annual conference of the north american chapter of the association for computational linguistics (pp. 528–536).
Gómez-Rodríguez, C. & Vilares, D. (2018). Constituent Parsing as Sequence Labeling. In: Conference on Empirical Methods in Natural Language Processing, EMNLP2018 (pp. 1314–1324).
Hajič, J., Hajičová, E., Mikulová, M. & Mírovskỳ, J. (2017). Prague dependency treebank. In: Handbook of Linguistic Annotation (pp. 555–594). Springer.
Han, A. L.-F., Wong, D. F., Chao, L. S., Lu, Y., He, L. & Tian, L. (2014). A universal phrase tagset for multilingual treebanks. In: Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data (pp. 247–258). Springer.
Hardie, A. (2003). Developing a tagset for automated part-of-speech tagging in Urdu. In: Proceedings of the Corpus Linguistics Conference.
Hussain, S. (2004). Finite-state morphological analyzer for Urdu. Unpublished MS thesis, Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Pakistan.
Kaljurand, K. (2004). Checking treebank consistency to find annotation errors.
Khan, T. A. (2009). Spatial expressions and case in South Asian languages. Konstanzer Online-Publication-System (KOPS).
Khan, T. A., Ehsan, T., Ashraf, A., Rahman, M. U., Hussain, S. & Butt, M. (2020). A Multilayered Urdu Treebank.
Khan, T. A., Urooj, S., Hussain, S., Mustafa, A., Parveen, R., Adeeba, F. & Butt, M. (2015). The CLE Urdu POS tagset. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation (pp. 2920–2925).
Klein, D. & Manning, C. D. (2003). Fast exact inference with a factored model for natural language parsing. In: Advances in Neural information processing systems (pp. 3–10).
Liu, T., Ma, J., & Li, S. (2006). Building a Dependency Treebank for Improving Chinese Parser. Journal of Chinese Language and Computing, 16(4), 207–224.
Google Scholar
Maamouri, M., Bies, A., Buckwalter, T. & Mekki, W. (2004). The Penn Arabic treebank: Building a large-scale annotated Arabic corpus. In: NEMLAR Conference on Arabic Language Resources and Tools (Vol. 27, pp. 466–467).
Malik, M. K., Ahmed, T., Sulger, S., Bögel, T., Gulzar, A., Raza, G. & Butt, M. (2010). Transliterating Urdu for a broad-coverage Urdu/Hindi LFG grammar. In: LREC 2010, Seventh International Conference on Language Resources and Evaluation (pp. 2921–2927).
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Google Scholar
McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276–282.
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (pp. 3111–3119).
Mohanan, T. (1994). Argument structure in Hindi. Center for the Study of Language (CSLI).
Nguyen, P.-T., Le, A.-C., Ho, T.-B., & Nguyen, V.-H. (2015). Vietnamese treebank construction and entropy-based error detection. Language Resources and Evaluation, 49(3), 487–519.
Article Google Scholar
Nguyen, Q. T., Miyao, Y., Le, H. T., & Nguyen, N. T. (2017). Ensuring annotation consistency and accuracy for Vietnamese treebank. Language Resources and Evaluation, 1, 47.
Google Scholar
Raza, G., Ahmed, T., Butt, M. & King, T. H. (2011). Argument scrambling within Urdu NPs. Proceedings of LFG11 Conference. 461–481.
Sajjad, H. (2007). Statistical part of speech tagger for Urdu. Unpublished MS Thesis, National University of Computer and Emerging Sciences, Lahore, Pakistan.
Sajjad, H. & Schmid, H. (2009). Tagging urdu text with parts of speech: A tagger comparison. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (pp. 692–700).
Schmid, H. (1995). Treetagger \(|\) a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 43, 28.
Google Scholar
Schmid, H. (1999). Improvements in part-of-speech tagging with an application to German. In: Natural language processing using very large corpora (pp. 13–25). Springer.
Silveira, N., Dozat, T., De Marneffe, M.-C., Bowman, S. R., Connor, M., Bauer, J. & Manning, C. D. (2014). A Gold Standard Dependency Corpus for English. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation (pp. 2897–2904).
Simons, G. F., & Fennig, C. D. (2017). Ethnologue: Languages of the world. SIL International, 2017, 20.
Google Scholar
Toutanova, K., Klein, D., Manning, C. D. & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 (pp. 173–180).
Urooj, S., Hussain, S., Adeeba, F., Jabeen, F., & Parveen, R. (2012). CLE Urdu digest corpus. Language and Technology, 2012, 47.
Google Scholar
Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238.
Article Google Scholar

Download references

Acknowledgements

We are thankful to Dr. Tafseer Ahmad Khan, DHA Suffa University, for preparing the annotation guidelines which paved a route to the development of the CLE-UTB. We are also indebted to Professor Miriam Butt, University of Konstanz, for her valuable comments and feedback on this work.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan
Toqeer Ehsan
Center for Language Engineering (CLE), KICS/CS&E, University of Engineering and Technology, Lahore, Pakistan
Sarmad Hussain

Authors

Toqeer Ehsan
View author publications
You can also search for this author in PubMed Google Scholar
Sarmad Hussain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Toqeer Ehsan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original version of this article was revised: In the original publication of the article the column headers of the Tables 17 and 18 were incorrectly displayed.

Appendix A: Tag sets

Table 19 The CLE Urdu POS tag set

Full size table

Table 20 Phrase label set

Full size table

Table 21 Functional label set

Full size table

Table 22 Comparison of the CLE-UTB functional labels with the HUTB dependency labels

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ehsan, T., Hussain, S. Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser. Lang Resources & Evaluation 55, 287–326 (2021). https://doi.org/10.1007/s10579-020-09492-7

Download citation

Published: 18 July 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10579-020-09492-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser

Abstract

Access this article

Similar content being viewed by others

Prague Dependency Treebank

Vietnamese treebank construction and entropy-based error detection

PassPort: A Dependency Parsing Model for Portuguese

Change history

17 November 2020

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Tag sets

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser

Abstract

Access this article

Similar content being viewed by others

Prague Dependency Treebank

Vietnamese treebank construction and entropy-based error detection

PassPort: A Dependency Parsing Model for Portuguese

Change history

17 November 2020

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Tag sets

Appendix A: Tag sets

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation