Building a Hierarchical Annotated Corpus of Urdu: The URDU.KON-TB Treebank

Abbas, Qaiser

doi:10.1007/978-3-642-28604-9_6

Qaiser Abbas¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7181))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2101 Accesses

Abstract

This work aims at the development of a representative treebank for the South Asian language Urdu. Urdu is a comparatively under resourced language and the development of a reliable treebank for Urdu will have significant impact on the state-of-the-art for Urdu language processing. In URDU.KON-TB treebank described here, a POS tagset, a syntactic tagset and a functional tagset have been proposed. The construction of the treebank is based on an existing corpus of 19 million words for the Urdu language. Part of speech (POS) tagging and annotation of a selected set of sentences from different sub-domains of this corpus is in process manually and the work performed till to date is presented here. The hierarchical annotation scheme we adopted has a combination of a phrase structure (PS) and a hybrid dependency structure (HDS).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The Hindi/Urdu Treebank Project

Prague Dependency Treebank

German Treebanks: TIGER and TüBa-D/Z

References

Leech, G.: Adding linguistic annotation. In: Wynne, M. (ed.) Developing Linguistic Corpora: A Guide to Good Practice, ch. 3, pp. 17–29. Oxbow Books, Oxford (2005)
Google Scholar
Garside, R., Leech, G.N., McEnery, T.: Corpus annotation: linguistic information from computer text corpora. Longman, London (1997)
Google Scholar
Ijaz, M.: Urdu 5000 Most Frequently Used Words: Technical Report, Center for Research in Urdu Language Processing (CRULP), Lahore, Pakistan (2007)
Google Scholar
Wallis, S.: Searching treebanks and other structured corpora. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft, ch. 34. Mouton de Gruyter, Berlin (2008)
Google Scholar
Santorini, B.: Part-of-speech tagging guidelines for the Penn treebank project: Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania (1990)
Google Scholar
Brill, E.: Discovering the lexical features of a language. In: 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA (1991)
Google Scholar
Brill, E., Magerman, D., Marcus, M.P., Santorini, B.: Deducing linguistic structure from the statistics of large corpora. In: DARPA Speech and Natural Language Workshop (1990)
Google Scholar
Magerman, D., Marcus, M.P.: Parsing a natural language using mutual information statistics. In: AAAI (1990)
Google Scholar
Pereira, F., Schabes, F.: Inside-outside re-estimation from partially bracketed corpora. In: 30th Annual Meeting of the Association for Computational Linguistics (1992)
Google Scholar
Weischedel, R., Ayuso, D., Bobrow, R., Boisen, S., Ingria, R., Palmucci, J.: Partial parsing: a report of work in progress. In: 4th DARPA Speech and Natural Language Workshop (1991)
Google Scholar
Meteer, M., Schwartz, R., Weischedel, R.: Studies in part of speech labelling. In: 4th DARPA Speech and Natural Language Workshop (1991)
Google Scholar
Veilleux, M.N., Ostendorf, M.: Probabilistic parse scoring based on prosodic features. In: 5th DARPA Speech and Natural Language Workshop (1992)
Google Scholar
Niv, M.: Syntactic disambiguation. The Penn Review of Linguistics 14, 120–126 (1991)
Google Scholar
Sampson, G.: English for the computer: The SUSANNE corpus and analytic scheme. Clarendon Press, Oxford (1995)
Google Scholar
Leech, G.: The Lancaster Parsed Corpus. ICAME Journal 16(124) (1992)
Google Scholar
Greenbaum, S.: Comparing English worldwide: The International Corpus of English. Clarendon Press, Oxford (1996)
Google Scholar
Dipper, S., Brants, T., Lezius, W., Plaehn, O., Smith, G.: The TIGER Treebank. In: Third Workshop on Linguistically Interpreted Corpora LINC 2001, Leuven, Belgium (2001)
Google Scholar
Schiller, A., Teufel, S., Stoeckert, C.: Vorlaeufige Guidelines fuer das Tagging deutscher Textcorpora mit STTS(Deutsche): Technical Report, IMS-CL, University Stuttgart (1995)
Google Scholar
Skut, W., Krenn, B., Brants, T., Uszkoreit, H.: An Annotation Scheme for Free Word Order Languages. In: Fifth Conference on Applied Natural Language Processing (ANLP), Washington, D.C (1997)
Google Scholar
Abbas, Q., Karamat, N., Niazi, S.: Development of Tree-bank based probabilistic grammar for Urdu Language. International Journal of Electrical & Computer Science 09(09), 231–235 (2009) ISSN: 2077-1231
Google Scholar
Butt, M., King, T.H.: The Status of Case. In: Dayal, V., Mahajan, A. (eds.) Clause Structure in South Asian Languages, pp. 153–198. Springer, Berlin (2005)
Google Scholar
Sajjad, H., Schmid, H.: Tagging Urdu Text with Parts of Speech: A Tagger Comparison. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009 (2009)
Google Scholar
Clark, A., Fox, C., Lappin, S.: The Handbook of Computational Linguistics and Natural Language Processing. Blackwell Handbooks in Linguistics, vol. 52, pp. 239–244. John Wiley and Sons (2010) ISBN: 1405155817, 9781405155816
Google Scholar
Abbas, Q., Khan, A.H.: Lexical functional grammar for Urdu modal verbs. In: 5th IEEE (ICET) 2009 International Conference on Engineering and Technology, pp. 07–12 (2009)
Google Scholar
Abbas, Q., Ahmed, M.S., Niazi, S.: Language Identifier for Languages of Pakistan Including Arabic and Persian. International Journal of Computational Linguistics (IJCL) 01(03), 27–35 (2010) ISSN: 2180-1266
Google Scholar
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English. Computational Linguistics (CL) 19(2), 313–330 (1993)
Google Scholar
Bies, A., Ferguson, M., Katz, K., Macintyre, R.: Bracketing guidelines for Treebank II style penn treebank project: Technical Report, University of Pennsylvania (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Linguistics, University of Konstanz, Box D 185, 78457, Konstanz, Germany
Qaiser Abbas

Authors

Qaiser Abbas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abbas, Q. (2012). Building a Hierarchical Annotated Corpus of Urdu: The URDU.KON-TB Treebank. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-28604-9_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28603-2
Online ISBN: 978-3-642-28604-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics