A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora

Fung, Pascale

doi:10.1007/3-540-49478-2_1

Pascale Fung⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1529))

Included in the following conference series:

Conference of the Association for Machine Translation in the Americas

828 Accesses
1 Altmetric

Abstract

We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method-Convec. Convec is based on context information of a word to be translated. We show a 30% to 76% precision when top-one to top-20 translation candidates are considered. Most of the top-20 candidates are either collocations or words related to the correct translation. Since non-parallel corpora contain a lot more polysemous words, many-to-many translations, and different lexical items in the two languages, we conclude that the output from Convec is reasonable and useful.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

New Areas of Application of Comparable Corpora

Construction of Large-Scale Chinese-English Bilingual Corpus and Sentence Alignment

OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation

References

A. Bookstein. Explanation and generalization of vector models in information retrieval. In Proceedings of the 6th Annual International Conference on Research and Development in Information Retrieval, pages 118–132, 1983.
Google Scholar
P.F. Brown, J. Cocke, S.A. Della Pietra, V.J. Della Pietra, F. Jelinek, J.D. Lafferty, R.L. Mercer, and P. Roosin. A statistical approach to machine translation. Computational Linguistics, 16:79–85, 1990.
Google Scholar
P.F. Brown, S.A Della Pietra, V.J. Della Pietra, and R.L. Mercer. The mathematics of machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.
Google Scholar
Stanley Chen. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, pages 9–16, Columbus, Ohio, June 1993.
Google Scholar
W. Bruce Croft. A comparison of the cosine correlation and the modified probabilistic model. In Information Technology, volume 3, pages 113–114, 1984.
Google Scholar
Ido Dagan and Kenneth W. Church. Termight: Identifying and translating technical terminology. In Proceedings of the 4th Conference on Applied Natural Language Processing, pages 34–40, Stuttgart, Germany, October 1994.
Google Scholar
Ido Dagan, Kenneth W. Church, and William A. Gale. Robust bilingual word alignment for machine aided translation. In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 1–8, Columbus, Ohio, June 1993.
Google Scholar
Ido Dagan and Alon Itai. Word sense disambiguation using a second language monolingual corpus. In Computational Linguistics, pages 564–596, 1994.
Google Scholar
Pascale Fung. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In Proceedings of the Third Annual Workshop on Very Large Corpora, pages 173–183, Boston, Massachusettes, June 1995.
Google Scholar
Pascale Fung. A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of the 33rd Annual Conference of the Association for Computational Linguistics, pages 236–233, Boston, Massachusettes, June 1995.
Google Scholar
Pascale Fung. Domain word translation by space-frequency analysis of context length histograms. In Proceedings of ICASSP 96, volume 1, pages 184–187, Atlanta, Georgia, May 1996.
Google Scholar
Pascale Fung and Kenneth Church. Kvec: A new approach for aligning parallel texts. In Proceedings of COLING 94, pages 1096–1102, Kyoto, Japan, August 1994.
Google Scholar
Pascale Fung and Kathleen McKeown. Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 81–88, Columbia, Maryland, October 1994.
Google Scholar
Pascale Fung and Kathleen McKeown. A technical word and term translation aid using noisy parallel corpora across language groups. Machine Translation, pages 53–87, 1996.
Google Scholar
Pascale Fung and Kathleen McKeown. Finding terminology translations from non-parallel corpora. In The 5th Annual Workshop on Very Large Corpora, pages 192–202, Hong Kong, Aug. 1997.
Google Scholar
Pascale Fung and Lo Yuen Yee. An ir approach for translating new words from nonparallel, comparable texts.
Google Scholar
W. Gale, K. Church, and D. Yarowsky. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. In Proceedings of the 30th Conference of the Association for Computational Linguistics. Association for Computational Linguistics, 1992.
Google Scholar
W. Gale, K. Church, and D. Yarowsky. Using bilingual materials to develop word sense disambiguation methods. In Proceedings of TMI 92, 1992.
Google Scholar
W. Gale, K. Church, and D. Yarowsky. Work on statistical methods for word sense disambiguation. In Proceedings of AAAI 92, 1992.
Google Scholar
W. Gale, K. Church, and D. Yarowsky. A method for disambiguating word senses in a large corpus. In Computers and Humanities, volume 26, pages 415–439, 1993.
Article Google Scholar
William Gale and Kenneth Church. Identifying word correspondences in parallel text. In Proceedings of the Fourth Darpa Workshop on Speech and Natural Language, Asilomar, 1991.
Google Scholar
M. Hearst. Noun homograph disambiguation using local context in large text corpora. In Using Corpora, Waterloo, Canada, 1991.
Google Scholar
Martin Kay and Martin Röscheisen. Text-Translation alignment. Computational Linguistics, 19(1):121–142, 1993.
Google Scholar
Robert Korfhage. Some thoughts on similarity measures. In The SIGIR Forum, volume 29, page 8, 1995.
Article Google Scholar
Julian Kupiec. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, pages 17–22, Columbus, Ohio, June 1993.
Google Scholar
Frederick Mosteller and David L. Wallace. Applied Bayesian and Classical Inference-The Case of The Federalist Papers. Springer Series in Satistics, Springer-Verlag, 1968.
Google Scholar
G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
Google Scholar
Hinrich Shütze. Dimensions of meaning. In Proceedings of Supercomputing’ 92, 1992.
Google Scholar
Frank Smadja, Kathleen McKeown, and Vasileios Hatzsivassiloglou. Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 21(4):1–38, 1996.
Google Scholar
Howard R. Turtle and W. Bruce Croft. A comparison of text retrieval methods. In The Computer Journal, volume 35, pages 279–290, 1992.
Article MATH Google Scholar
Dekai Wu and Hongsing Wong. Machine translation with a stochastical grammatical channel.
Google Scholar
Dekai Wu and Xuanyin Xia. Learning an English-Chinese lexicon from a parallel corpus. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 206–213, Columbia, Maryland, October 1994.
Google Scholar
D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Conference of the Association for Computational Linguistics, pages 189–196. Association for Computational Linguistics, 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

Human Language Technology Center Department of Electrical and Electronic Engineering, University of Science and Technology (HKUST), Clear Water Bay, Hong Kong
Pascale Fung

Authors

Pascale Fung
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computing Research Lab, New Mexico State University, Box 30001 / 3CRL, Las Cruces, NM, 88003, USA
David Farwell
SYSTRAN Inc., 7855 Fay Avenue, Suite 300, P.O. Box 907, La Jolla, CA, 92037, USA
Laurie Gerber
Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA, 90292-6695, USA
Eduard Hovy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fung, P. (1998). A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In: Farwell, D., Gerber, L., Hovy, E. (eds) Machine Translation and the Information Soup. AMTA 1998. Lecture Notes in Computer Science(), vol 1529. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49478-2_1

Download citation

DOI: https://doi.org/10.1007/3-540-49478-2_1
Published: 24 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65259-5
Online ISBN: 978-3-540-49478-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics