Knowledge Extraction from Bilingual Corpora

Somers, Harold

doi:10.1007/3-540-48089-7_7

Knowledge Extraction from Bilingual Corpora

Harold Somers²

Conference paper
First Online: 01 January 2002

570 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1714))

Abstract

The use of corpora has become an important issue in IE. In this chapter we consider a specific type of corpus, the bilingual parallel corpus, and ways of automatically extracting information from such corpora. This information, “linguistic metaknowledge”, is essential for techniques such as tokenization, POS-tagging, morphological analysis, used in IE. Where we wish to extract information from multilingual texts, we must rely on these linguistic resources being available in several languages. This chapter discusses locating and storing parallel texts, alignment at various levels (sentence, word, phrase), and extraction of bilingual vocabulary and terminology.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ahrenberg, L., Andersson, M., Merkel, M. 1998: A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts.COLING-ACL’ 98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, pp. 29–35.
Google Scholar
Brown, P.F., Lai, J.C., Mercer, R.L. 1991: Aligning Sentences in Parallel Corpora. 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 169–176.
Google Scholar
Chen, S.F. 1993: Aligning Sentences in Bilingual Corpora using Lexical Information. 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 9–16.
Google Scholar
Church, K.W. 1993: Char align: A Program for Aligning Parallel Texts at the Character Level. 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 1–8.
Google Scholar
Church, K., Hanks, P. 1989: Word Association Norms, Mutual Information, and Lexicography. 27th Annual Meeting of the Association for Computational Linguistics, Vancouver, British Columbia, pp. 76–83.
Google Scholar
Dagan, I. 1996: Bilingual Word Alignment and Lexicon Construction. Tutorial handout, COLING-96: The 16th International Conference on Computational Linguistics, Copenhagen.
Google Scholar
Daille, B. 1995: Combined approach for terminology extraction: lexical statistics and linguistic filtering, UCREL Technical Papers, No. 15, Department of Linguistics, Lancaster University. [cited in [36].
Google Scholar
Daille, B. 1995: Repêrage et extraction de terminologie par une approach mixte statistique et linguistique. Traitements Probabilistes et Corpus, t.a.l. 36.1-2, 101–118.
Google Scholar
Daille, B. 1996: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In J. Klavans and P. Resnik (eds) The Balancing Act: ] Combining Symbolic and Statistical Approaches to Language, MIT Press, Cambridge, Mass., pp.49–66.
Google Scholar
Dagan, I., Church, K. 1997: Termight: Coordinating Humans and Machines in Bilingual Terminology Acquisition. Machine Translation 12, pp. 89–107.
Article Google Scholar
Dagan, I., Church, K.W., Gale, W.A. 1993: Robust Bilingual Word Alignment for Machine Aided Translation. Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, pp. 1–8.
Google Scholar
Daille, B., Gaussier, É., Langê, J.-M. 1994:Towards Automatic Extraction of Monolingual and Bilingual Terminology. COLING 94: The 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 515–521.
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B. 1977: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B, 39, pp. 1–38.
MATH MathSciNet Google Scholar
Dunning, T. 1993: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19, pp. 61–76.
Google Scholar
Fung, P., Church, K.W. 1994: K-vec: A New Approach for Aligning Parallel Texts. COLING 94: The 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 1096–1102.
Google Scholar
Fung, P., McKeown, K. 1997: A Technical Word-and Term-Translation Aid using Noisy Parallel Corpora across Language Groups. Machine Translation 12, 53–87.
Article Google Scholar
Fung, P., McKeown, K. 1997: Finding Terminology Translations from Non-parallel Corpora. Proceedings of the Fifth Workshop on Very Large Corpora, Beijing and Hong Kong, pp. 192–202.
Google Scholar
Fung, P. 1995: Compiling Bilingual Lexicon Entries from a Non-Parallel English-Chinese Corpus. Proceedings of the Third Workshop on Very Large Corpora, Cambridge, Mass., pp. 173–183.
Google Scholar
Fung, P. 1998: A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora. In D. Farwell, L. Gerber and E. Hovy (eds) Machine Translation and the Information Soup, Springer, Berlin, pp. 1–17.
Google Scholar
Fung, P., Yee, L.Y. 1998: An IR Approach for Translating New Words from Nonparallel, Comparable Texts. COLING-ACL’ 98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, pp. 414–420.
Google Scholar
Gao, Z-M. 1997: Automatic Extraction of Translation Equivalents from a Parallel Chinese-English Corpus, PhD thesis, UMIST, Manchester, England.
Google Scholar
Gale, W.A., Church, K.W. 1991: A Program for Aligning Sentences in Bilingual Corpora. 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, Calif., 177–184.
Google Scholar
Gale, W.A., Church, K.W. 1991: Identifying Word Correspondences in Parallel Text. Workshop on Speech and Natural Language Processing, Asilomar, California.
Google Scholar
Gale, W.A., Church, K.W. 1991: Concordances for Parallel Texts. Seventh Annual Conference of the UW Centre for New OED and text Research Using Corpora, Oxford, pp. 40–62.
Google Scholar
Gaussier, É., Langé, J.-M. 1997: Some Methods for the Extraction of Bilingual Terminology. In D. Jones and H. Somers (eds) New Methods in Language Processing, UCL Press, London, pp. 145–153.
Google Scholar
Grefenstette, G. 1995: Comparing two Language Identification Schemes. JADT 1995: III Giornate internazionali di Analisi Statistica dei Dati Testuali, Rome, Vol. I, pp. 263–268.
Google Scholar
Harris, B. 1988: Bi-text, a New Concept in Translation Theory.Language Monthly 54, 8–10.
Google Scholar
Hall, P.A.V., Dowling, G.R. 1980: Approximate String Matching. Computing Surveys 12, 381–402.
Article MathSciNet Google Scholar
Jones, D., Alexa, M. 1997: Towards automatically aligning German Compounds with English Word Groups. In D. Jones and H. Somers (eds) New Methods in Language Processing, UCL Press, London, pp. 199–206.
Google Scholar
Johansson, S., Ebeling, J., Hofland, K. 1993: Coding and aligning the English-Norwegian Parallel Corpus. In K. Ajimer, B. Altenberg and M. Johansson (eds) Languages in Contrast: A Symposium on Text-Based Cross-Linguistic Studies, Lund University Press, Lund, pp. 87–112.
Google Scholar
Justeson, J.S., Katz, S.M. 1995: Technical Terminology: some Linguistic Properties and an Algorithm for Identification in Text. Natural Language Engineering 1, pp. 9–27.
Article Google Scholar
Jones, D.B., Somers, H. 1995: Bilingual Vocabulary Estimation from Noisy Parallel Corpora using Variable Bag Estimation. JADT 1995: III Giornate internazionali di Analisi Statistica dei Dati Testuali, Rome, Vol. I, pp. 255–262.
Google Scholar
Kitamura, M., Matsumoto, Y. 1996: Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 79–87.
Google Scholar
Kay, M. and Röscheisen, M. 1993: Text Translation Alignment. Computational Linguistics 19, 121–142.
Google Scholar
Kupiec, J. 1993: An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 17–22.
Google Scholar
McEnery, T., Langê, J.M., Oakes, M., Vêronis, J. 1997: The Exploitation of Multilingual Corpora for Term Extraction. In R. Garside, G. Leech and A. McEnery (eds), Corpus Annotation: Linguistic Information from Computer Text Corpora, Addison Wesley Longman, London, pp. 220–230.
Google Scholar
McEnery, T., Oakes, M. 1996: Sentence and word alignment in the CRATER Project. In J. Thomas and M. Short (eds), Using Corpora for Language Research, Longman, London, pp. 211–231.
Google Scholar
Melamed, I.D. 1996: A Geometrical Approach to Mapping Bitext Correspondence. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, Pa., pp. 1–12.
Google Scholar
Melamed, I.D. 1996: Automatic Detection of Omissions in Translation. COLING-96: The 16th International Conference on Computational Linguistics, Copenhagen, Denmark, pp. 764–769.
Google Scholar
Resnik, P. 1998: Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text. In D. Farwell, L. Gerber and E. Hovy (eds) Machine Translation and the Information Soup, Springer, Berlin, pp. 72–82.
Chapter Google Scholar
Resnik, P. 1999: Mining the Web for Bilingual Text. 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland.
Google Scholar
Sperberg-McQueen, C., Burnard, L. 1994: Guidelines for Electronic Text Encoding and Interchange: TEI-P3 ACH-ACL-ALLC Text Coding Initiative, Chicago and Oxford.
Google Scholar
Simard, M., Foster, G., Isabelle, P. 1992: Using Cognates to Align Sentences in Bilingual Corpora. Quatriéme colloque international sur les aspects théoriques et méthodologiques de la traduction automatique, Fourth International Conference on Theoretical and Methodological Issues in Machine Translation: Méthodes empiristes versus méthodes rationalistes en TA, Empiricist vs. Rationalist Methods in MT (TMI-92), Montréal, Canada, 67–82.
Google Scholar
Shin, J.H., Han, Y.S., Choi, K-S. 1996: Bilingual Knowledge Acquisition from Korean-English Parallel Corpus Using Alignment Method (Korean-English Alignment at Word and Phrase Level). COLING-96: The 16th International Conference on Computational Linguistics, Copenhagen, Denmark, pp. 230–235.
Google Scholar
Smadja, F. 1993: Retrieving Collocations from Text: Xtract. Computational Lin-guistics 19, 121–142.
Google Scholar
Smadja, F., McKeown, K.R., Hatzivassiloglou, V. 1996: Translating Collocations for Bilingual Lexicons: A Statistical Approach, Computational Linguistics 22, 1–38.
Google Scholar
Somers, H. 1998: Further Experiments in Bilingual Text Alignment. International Journal of Corpus Linguistics 3, 115–150.
Google Scholar
Thompson, H. in press: Corpus Creation for Data-Intensive Linguistics. To appear in R. D ale, H. Moisl and H. Somers (eds) A Handbook of Natural Language Processing, Marcel Dekker, New York.
Google Scholar
Wu, D., Xia, X. 1994: Learning an English-Chinese Lexicon from a Parallel Corpus. Technology Partnerships for Crossing the Language Barrier: Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, pp. 206–213.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Language Engineering, UMIST, Manchester, England
Harold Somers

Authors

Harold Somers
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Systems and Production, University of Roma, Tor Vergata, Via di Tor Vergata, I-00133, Roma, Italy
Maria Teresa Pazienza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Somers, H. (1999). Knowledge Extraction from Bilingual Corpora. In: Pazienza, M.T. (eds) Information Extraction. SCIE 1999. Lecture Notes in Computer Science(), vol 1714. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48089-7_7

Download citation

DOI: https://doi.org/10.1007/3-540-48089-7_7
Published: 28 March 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66625-7
Online ISBN: 978-3-540-48089-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics