Abstract
We describe an approach to creating a small but diverse corpus in English that can be used to elicit information about any target language. The focus of the corpus is on structural information. The resulting bilingual corpus can then be used for natural language processing tasks such as inferring transfer mappings for Machine Translation. The corpus is sufficiently small that a bilingual user can translate and word-align it within a matter of hours. We describe how the corpus is created and how its structural diversity is ensured. We then argue that it is not necessary to introduce a large amount of redundancy into the corpus. This is shown by creating an increasingly redundant corpus and observing that the information gained converges as redundancy increases.
This research was funded in part by NSF grant number IIS-0121631.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bouquiaux, L., Thomas, J.M.C.: Studying and Describing Unwritten Languages, The Summer Institute of Linguistics, Dallas, TX (1992)
Comrie, B., Smith, N.: Lingua Descriptive Series: Questionnaire Lingua vol. 42, pp.1-72 (1977)
Lavie, A., Vogel, S., Levin, L., Peterson, E., Probst, K., Font Llitjos, A., Reynolds, R., Carbonell, J., Cohen, R.: Experiments with a Hindi-to-English Transferbased MT System under a Miserly Data Scenario. In: ACM Transactions on Asian Language Information Processing (TALIP), vol. 2(2) (2003)
Jones, D., Havrilla, R.: Twisted Pair Grammar: Support for Rapid Development of Machine Translation for Low Density Languages. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 318–332. Springer, Heidelberg (1998)
Marcus, M.A., Taylor, R., MacIntyre, A., Bies, C., Cooper, M., Ferguson, A.: Littmann. The Penn Treebank Project (1992), http://www.cis.upenn.edu/treebank/home.html
Probst, K.R., Brown, J., Carbonell, A., Lavie, L., Levin, E.: Peterson. Design and Implementation of Controlled Elicitation for Machine Translation of Lowdensity Languages. In:Workshop MT2010 at Machine Translation Summit VIII (2001)
Probst, K., Levin, L., Peterson, E., Lavie, A., Carbonell, J.: MT for Resource- Poor Languages Using Elicitation-Based Learning of Syntactic Transfer Rules, Machine Translation. Special Issue on Embedded MT (2003)
Probst, K., Levin, L.: Challenges in Automated Elicitation of a Controlled Bilingual Corpus.In: 9th International Conference on Theoretical and Methodological Issues in Machine Translation, TMI 2002 (2002)
Sherematyeva, S., Nirenburg, S.: Towards a Unversal Tool for NLP Resource Acquisition.In: Second International Conference on Language Resources and Evaluation, LREC-00 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Probst, K., Lavie, A. (2004). A Structurally Diverse Minimal Corpus for Eliciting Structural Mappings Between Languages. In: Frederking, R.E., Taylor, K.B. (eds) Machine Translation: From Real Users to Research. AMTA 2004. Lecture Notes in Computer Science(), vol 3265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30194-3_24
Download citation
DOI: https://doi.org/10.1007/978-3-540-30194-3_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23300-8
Online ISBN: 978-3-540-30194-3
eBook Packages: Springer Book Archive