Abstract
The cumulative effort over the past few decades that have gone into developing linguistic resources for tasks ranging from machine readable dictionaries to translation systems is enormous. Such effort is prohibitively expensive for languages outside the (largely) European family. The possibility of building such resources automatically by accessing electronic corpora of such languages are therefore of great interest to those involved in studying these ‘new’ - ‘lesser known’ languages. The main stumbling block to applying these data driven techniques directly is that most of them require large corpora rarely available for such ‘new’ languages. This paper describes an attempt at setting up a bootstrapping agenda to exploit the scarce corpus resources that may be available at the outset to a researcher concerned with such languages. In particular it reports on results of an experiment to use state-of-the-art data-driven techniques for building linguistic resources for Sinhala - a non-European language with virtually no electronic resources.
Work reported herein was carried out at INRIA, France, supported by the European Research Consortium on Informatics and Mathematics (ERCIM).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Germann, U.: Building a Statistical Machine Translation System from Scratch: How Much Bang Can We Expect for the Buck. Proceedings of the Data-Driven MT Workshop of ACL-01.Toulouse, France (2001)
Brown, P. F., Della-Pietra, S. A., Della-Pietra, V. J. and Mercer, R. L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2) (1993)263–311.
Al-Onaizan, Y., Curin, J., Jahr, M., Knight, Lafferty, J., Melamed, D., Och, F.-J., Purdy, D., Smith, N. A., and Yarowsky, D.: Statistical Machine Translation, Final Report, JHU Workshop 1999. Technical Report, CLSP/JHU (1999)
Gale W. A. and Church K. W.: A program for aligning sentences in bilingual corpora. Proceedings of ACL-91, Berkeley (1991) 177–184
Melamed I. Dan: A Portable Algorithm for Mapping Bitext Correspondence. Proceedings of the 35th Conference of the Association for Computational Linguistics (ACL’97), Madrid, Spain (1997)
Clarkson, P.R. and Rosenfield, R.: Statistical Language Modeling using the CMU-Cambridge Toolkit, Proceedings ESCA Eurospeech, Rhodes, Greece (1997)
Germann, U., Jahr, M., Knight, K., Marcu, D., and Yamada, K.: Fast Decoding and Optimal Decoding for Machine Translation. Proceedings of ACL-01. Toulouse, France (2001)
Simard, M.: Text-translation Alignment: Three Languages Are Better Than Two. In Proceedings of EMNLP/VLC-99, College Park, MD (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Weerasinghe, R. (2002). Bootstrapping the Lexicon Building Process for Machine Translation between ‘New’ Languages. In: Richardson, S.D. (eds) Machine Translation: From Research to Real Users. AMTA 2002. Lecture Notes in Computer Science(), vol 2499. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45820-4_18
Download citation
DOI: https://doi.org/10.1007/3-540-45820-4_18
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44282-0
Online ISBN: 978-3-540-45820-3
eBook Packages: Springer Book Archive