Abstract
This work aims at the development of a representative treebank for the South Asian language Urdu. Urdu is a comparatively under resourced language and the development of a reliable treebank for Urdu will have significant impact on the state-of-the-art for Urdu language processing. In URDU.KON-TB treebank described here, a POS tagset, a syntactic tagset and a functional tagset have been proposed. The construction of the treebank is based on an existing corpus of 19 million words for the Urdu language. Part of speech (POS) tagging and annotation of a selected set of sentences from different sub-domains of this corpus is in process manually and the work performed till to date is presented here. The hierarchical annotation scheme we adopted has a combination of a phrase structure (PS) and a hybrid dependency structure (HDS).
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Leech, G.: Adding linguistic annotation. In: Wynne, M. (ed.) Developing Linguistic Corpora: A Guide to Good Practice, ch. 3, pp. 17–29. Oxbow Books, Oxford (2005)
Garside, R., Leech, G.N., McEnery, T.: Corpus annotation: linguistic information from computer text corpora. Longman, London (1997)
Ijaz, M.: Urdu 5000 Most Frequently Used Words: Technical Report, Center for Research in Urdu Language Processing (CRULP), Lahore, Pakistan (2007)
Wallis, S.: Searching treebanks and other structured corpora. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft, ch. 34. Mouton de Gruyter, Berlin (2008)
Santorini, B.: Part-of-speech tagging guidelines for the Penn treebank project: Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania (1990)
Brill, E.: Discovering the lexical features of a language. In: 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA (1991)
Brill, E., Magerman, D., Marcus, M.P., Santorini, B.: Deducing linguistic structure from the statistics of large corpora. In: DARPA Speech and Natural Language Workshop (1990)
Magerman, D., Marcus, M.P.: Parsing a natural language using mutual information statistics. In: AAAI (1990)
Pereira, F., Schabes, F.: Inside-outside re-estimation from partially bracketed corpora. In: 30th Annual Meeting of the Association for Computational Linguistics (1992)
Weischedel, R., Ayuso, D., Bobrow, R., Boisen, S., Ingria, R., Palmucci, J.: Partial parsing: a report of work in progress. In: 4th DARPA Speech and Natural Language Workshop (1991)
Meteer, M., Schwartz, R., Weischedel, R.: Studies in part of speech labelling. In: 4th DARPA Speech and Natural Language Workshop (1991)
Veilleux, M.N., Ostendorf, M.: Probabilistic parse scoring based on prosodic features. In: 5th DARPA Speech and Natural Language Workshop (1992)
Niv, M.: Syntactic disambiguation. The Penn Review of Linguistics 14, 120–126 (1991)
Sampson, G.: English for the computer: The SUSANNE corpus and analytic scheme. Clarendon Press, Oxford (1995)
Leech, G.: The Lancaster Parsed Corpus. ICAME Journal 16(124) (1992)
Greenbaum, S.: Comparing English worldwide: The International Corpus of English. Clarendon Press, Oxford (1996)
Dipper, S., Brants, T., Lezius, W., Plaehn, O., Smith, G.: The TIGER Treebank. In: Third Workshop on Linguistically Interpreted Corpora LINC 2001, Leuven, Belgium (2001)
Schiller, A., Teufel, S., Stoeckert, C.: Vorlaeufige Guidelines fuer das Tagging deutscher Textcorpora mit STTS(Deutsche): Technical Report, IMS-CL, University Stuttgart (1995)
Skut, W., Krenn, B., Brants, T., Uszkoreit, H.: An Annotation Scheme for Free Word Order Languages. In: Fifth Conference on Applied Natural Language Processing (ANLP), Washington, D.C (1997)
Abbas, Q., Karamat, N., Niazi, S.: Development of Tree-bank based probabilistic grammar for Urdu Language. International Journal of Electrical & Computer Science 09(09), 231–235 (2009) ISSN: 2077-1231
Butt, M., King, T.H.: The Status of Case. In: Dayal, V., Mahajan, A. (eds.) Clause Structure in South Asian Languages, pp. 153–198. Springer, Berlin (2005)
Sajjad, H., Schmid, H.: Tagging Urdu Text with Parts of Speech: A Tagger Comparison. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009 (2009)
Clark, A., Fox, C., Lappin, S.: The Handbook of Computational Linguistics and Natural Language Processing. Blackwell Handbooks in Linguistics, vol. 52, pp. 239–244. John Wiley and Sons (2010) ISBN: 1405155817, 9781405155816
Abbas, Q., Khan, A.H.: Lexical functional grammar for Urdu modal verbs. In: 5th IEEE (ICET) 2009 International Conference on Engineering and Technology, pp. 07–12 (2009)
Abbas, Q., Ahmed, M.S., Niazi, S.: Language Identifier for Languages of Pakistan Including Arabic and Persian. International Journal of Computational Linguistics (IJCL) 01(03), 27–35 (2010) ISSN: 2180-1266
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English. Computational Linguistics (CL) 19(2), 313–330 (1993)
Bies, A., Ferguson, M., Katz, K., Macintyre, R.: Bracketing guidelines for Treebank II style penn treebank project: Technical Report, University of Pennsylvania (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Abbas, Q. (2012). Building a Hierarchical Annotated Corpus of Urdu: The URDU.KON-TB Treebank. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-28604-9_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28603-2
Online ISBN: 978-3-642-28604-9
eBook Packages: Computer ScienceComputer Science (R0)