Abstract
We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method-Convec. Convec is based on context information of a word to be translated. We show a 30% to 76% precision when top-one to top-20 translation candidates are considered. Most of the top-20 candidates are either collocations or words related to the correct translation. Since non-parallel corpora contain a lot more polysemous words, many-to-many translations, and different lexical items in the two languages, we conclude that the output from Convec is reasonable and useful.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
A. Bookstein. Explanation and generalization of vector models in information retrieval. In Proceedings of the 6th Annual International Conference on Research and Development in Information Retrieval, pages 118–132, 1983.
P.F. Brown, J. Cocke, S.A. Della Pietra, V.J. Della Pietra, F. Jelinek, J.D. Lafferty, R.L. Mercer, and P. Roosin. A statistical approach to machine translation. Computational Linguistics, 16:79–85, 1990.
P.F. Brown, S.A Della Pietra, V.J. Della Pietra, and R.L. Mercer. The mathematics of machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.
Stanley Chen. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, pages 9–16, Columbus, Ohio, June 1993.
W. Bruce Croft. A comparison of the cosine correlation and the modified probabilistic model. In Information Technology, volume 3, pages 113–114, 1984.
Ido Dagan and Kenneth W. Church. Termight: Identifying and translating technical terminology. In Proceedings of the 4th Conference on Applied Natural Language Processing, pages 34–40, Stuttgart, Germany, October 1994.
Ido Dagan, Kenneth W. Church, and William A. Gale. Robust bilingual word alignment for machine aided translation. In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 1–8, Columbus, Ohio, June 1993.
Ido Dagan and Alon Itai. Word sense disambiguation using a second language monolingual corpus. In Computational Linguistics, pages 564–596, 1994.
Pascale Fung. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In Proceedings of the Third Annual Workshop on Very Large Corpora, pages 173–183, Boston, Massachusettes, June 1995.
Pascale Fung. A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of the 33rd Annual Conference of the Association for Computational Linguistics, pages 236–233, Boston, Massachusettes, June 1995.
Pascale Fung. Domain word translation by space-frequency analysis of context length histograms. In Proceedings of ICASSP 96, volume 1, pages 184–187, Atlanta, Georgia, May 1996.
Pascale Fung and Kenneth Church. Kvec: A new approach for aligning parallel texts. In Proceedings of COLING 94, pages 1096–1102, Kyoto, Japan, August 1994.
Pascale Fung and Kathleen McKeown. Aligning noisy parallel corpora across language groups: Word pair feature matching by dynamic time warping. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 81–88, Columbia, Maryland, October 1994.
Pascale Fung and Kathleen McKeown. A technical word and term translation aid using noisy parallel corpora across language groups. Machine Translation, pages 53–87, 1996.
Pascale Fung and Kathleen McKeown. Finding terminology translations from non-parallel corpora. In The 5th Annual Workshop on Very Large Corpora, pages 192–202, Hong Kong, Aug. 1997.
Pascale Fung and Lo Yuen Yee. An ir approach for translating new words from nonparallel, comparable texts.
W. Gale, K. Church, and D. Yarowsky. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. In Proceedings of the 30th Conference of the Association for Computational Linguistics. Association for Computational Linguistics, 1992.
W. Gale, K. Church, and D. Yarowsky. Using bilingual materials to develop word sense disambiguation methods. In Proceedings of TMI 92, 1992.
W. Gale, K. Church, and D. Yarowsky. Work on statistical methods for word sense disambiguation. In Proceedings of AAAI 92, 1992.
W. Gale, K. Church, and D. Yarowsky. A method for disambiguating word senses in a large corpus. In Computers and Humanities, volume 26, pages 415–439, 1993.
William Gale and Kenneth Church. Identifying word correspondences in parallel text. In Proceedings of the Fourth Darpa Workshop on Speech and Natural Language, Asilomar, 1991.
M. Hearst. Noun homograph disambiguation using local context in large text corpora. In Using Corpora, Waterloo, Canada, 1991.
Martin Kay and Martin Röscheisen. Text-Translation alignment. Computational Linguistics, 19(1):121–142, 1993.
Robert Korfhage. Some thoughts on similarity measures. In The SIGIR Forum, volume 29, page 8, 1995.
Julian Kupiec. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, pages 17–22, Columbus, Ohio, June 1993.
Frederick Mosteller and David L. Wallace. Applied Bayesian and Classical Inference-The Case of The Federalist Papers. Springer Series in Satistics, Springer-Verlag, 1968.
G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
Hinrich Shütze. Dimensions of meaning. In Proceedings of Supercomputing’ 92, 1992.
Frank Smadja, Kathleen McKeown, and Vasileios Hatzsivassiloglou. Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 21(4):1–38, 1996.
Howard R. Turtle and W. Bruce Croft. A comparison of text retrieval methods. In The Computer Journal, volume 35, pages 279–290, 1992.
Dekai Wu and Hongsing Wong. Machine translation with a stochastical grammatical channel.
Dekai Wu and Xuanyin Xia. Learning an English-Chinese lexicon from a parallel corpus. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 206–213, Columbia, Maryland, October 1994.
D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Conference of the Association for Computational Linguistics, pages 189–196. Association for Computational Linguistics, 1995.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fung, P. (1998). A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In: Farwell, D., Gerber, L., Hovy, E. (eds) Machine Translation and the Information Soup. AMTA 1998. Lecture Notes in Computer Science(), vol 1529. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49478-2_1
Download citation
DOI: https://doi.org/10.1007/3-540-49478-2_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65259-5
Online ISBN: 978-3-540-49478-2
eBook Packages: Springer Book Archive