Abstract
The use of corpora has become an important issue in IE. In this chapter we consider a specific type of corpus, the bilingual parallel corpus, and ways of automatically extracting information from such corpora. This information, “linguistic metaknowledge”, is essential for techniques such as tokenization, POS-tagging, morphological analysis, used in IE. Where we wish to extract information from multilingual texts, we must rely on these linguistic resources being available in several languages. This chapter discusses locating and storing parallel texts, alignment at various levels (sentence, word, phrase), and extraction of bilingual vocabulary and terminology.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ahrenberg, L., Andersson, M., Merkel, M. 1998: A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts.COLING-ACL’ 98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, pp. 29–35.
Brown, P.F., Lai, J.C., Mercer, R.L. 1991: Aligning Sentences in Parallel Corpora. 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 169–176.
Chen, S.F. 1993: Aligning Sentences in Bilingual Corpora using Lexical Information. 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 9–16.
Church, K.W. 1993: Char align: A Program for Aligning Parallel Texts at the Character Level. 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 1–8.
Church, K., Hanks, P. 1989: Word Association Norms, Mutual Information, and Lexicography. 27th Annual Meeting of the Association for Computational Linguistics, Vancouver, British Columbia, pp. 76–83.
Dagan, I. 1996: Bilingual Word Alignment and Lexicon Construction. Tutorial handout, COLING-96: The 16th International Conference on Computational Linguistics, Copenhagen.
Daille, B. 1995: Combined approach for terminology extraction: lexical statistics and linguistic filtering, UCREL Technical Papers, No. 15, Department of Linguistics, Lancaster University. [cited in [36].
Daille, B. 1995: Repêrage et extraction de terminologie par une approach mixte statistique et linguistique. Traitements Probabilistes et Corpus, t.a.l. 36.1-2, 101–118.
Daille, B. 1996: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In J. Klavans and P. Resnik (eds) The Balancing Act: ] Combining Symbolic and Statistical Approaches to Language, MIT Press, Cambridge, Mass., pp.49–66.
Dagan, I., Church, K. 1997: Termight: Coordinating Humans and Machines in Bilingual Terminology Acquisition. Machine Translation 12, pp. 89–107.
Dagan, I., Church, K.W., Gale, W.A. 1993: Robust Bilingual Word Alignment for Machine Aided Translation. Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, pp. 1–8.
Daille, B., Gaussier, É., Langê, J.-M. 1994:Towards Automatic Extraction of Monolingual and Bilingual Terminology. COLING 94: The 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 515–521.
Dempster, A.P., Laird, N.M., Rubin, D.B. 1977: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B, 39, pp. 1–38.
Dunning, T. 1993: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19, pp. 61–76.
Fung, P., Church, K.W. 1994: K-vec: A New Approach for Aligning Parallel Texts. COLING 94: The 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 1096–1102.
Fung, P., McKeown, K. 1997: A Technical Word-and Term-Translation Aid using Noisy Parallel Corpora across Language Groups. Machine Translation 12, 53–87.
Fung, P., McKeown, K. 1997: Finding Terminology Translations from Non-parallel Corpora. Proceedings of the Fifth Workshop on Very Large Corpora, Beijing and Hong Kong, pp. 192–202.
Fung, P. 1995: Compiling Bilingual Lexicon Entries from a Non-Parallel English-Chinese Corpus. Proceedings of the Third Workshop on Very Large Corpora, Cambridge, Mass., pp. 173–183.
Fung, P. 1998: A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora. In D. Farwell, L. Gerber and E. Hovy (eds) Machine Translation and the Information Soup, Springer, Berlin, pp. 1–17.
Fung, P., Yee, L.Y. 1998: An IR Approach for Translating New Words from Nonparallel, Comparable Texts. COLING-ACL’ 98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, pp. 414–420.
Gao, Z-M. 1997: Automatic Extraction of Translation Equivalents from a Parallel Chinese-English Corpus, PhD thesis, UMIST, Manchester, England.
Gale, W.A., Church, K.W. 1991: A Program for Aligning Sentences in Bilingual Corpora. 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, Calif., 177–184.
Gale, W.A., Church, K.W. 1991: Identifying Word Correspondences in Parallel Text. Workshop on Speech and Natural Language Processing, Asilomar, California.
Gale, W.A., Church, K.W. 1991: Concordances for Parallel Texts. Seventh Annual Conference of the UW Centre for New OED and text Research Using Corpora, Oxford, pp. 40–62.
Gaussier, É., Langé, J.-M. 1997: Some Methods for the Extraction of Bilingual Terminology. In D. Jones and H. Somers (eds) New Methods in Language Processing, UCL Press, London, pp. 145–153.
Grefenstette, G. 1995: Comparing two Language Identification Schemes. JADT 1995: III Giornate internazionali di Analisi Statistica dei Dati Testuali, Rome, Vol. I, pp. 263–268.
Harris, B. 1988: Bi-text, a New Concept in Translation Theory.Language Monthly 54, 8–10.
Hall, P.A.V., Dowling, G.R. 1980: Approximate String Matching. Computing Surveys 12, 381–402.
Jones, D., Alexa, M. 1997: Towards automatically aligning German Compounds with English Word Groups. In D. Jones and H. Somers (eds) New Methods in Language Processing, UCL Press, London, pp. 199–206.
Johansson, S., Ebeling, J., Hofland, K. 1993: Coding and aligning the English-Norwegian Parallel Corpus. In K. Ajimer, B. Altenberg and M. Johansson (eds) Languages in Contrast: A Symposium on Text-Based Cross-Linguistic Studies, Lund University Press, Lund, pp. 87–112.
Justeson, J.S., Katz, S.M. 1995: Technical Terminology: some Linguistic Properties and an Algorithm for Identification in Text. Natural Language Engineering 1, pp. 9–27.
Jones, D.B., Somers, H. 1995: Bilingual Vocabulary Estimation from Noisy Parallel Corpora using Variable Bag Estimation. JADT 1995: III Giornate internazionali di Analisi Statistica dei Dati Testuali, Rome, Vol. I, pp. 255–262.
Kitamura, M., Matsumoto, Y. 1996: Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 79–87.
Kay, M. and Röscheisen, M. 1993: Text Translation Alignment. Computational Linguistics 19, 121–142.
Kupiec, J. 1993: An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 17–22.
McEnery, T., Langê, J.M., Oakes, M., Vêronis, J. 1997: The Exploitation of Multilingual Corpora for Term Extraction. In R. Garside, G. Leech and A. McEnery (eds), Corpus Annotation: Linguistic Information from Computer Text Corpora, Addison Wesley Longman, London, pp. 220–230.
McEnery, T., Oakes, M. 1996: Sentence and word alignment in the CRATER Project. In J. Thomas and M. Short (eds), Using Corpora for Language Research, Longman, London, pp. 211–231.
Melamed, I.D. 1996: A Geometrical Approach to Mapping Bitext Correspondence. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, Pa., pp. 1–12.
Melamed, I.D. 1996: Automatic Detection of Omissions in Translation. COLING-96: The 16th International Conference on Computational Linguistics, Copenhagen, Denmark, pp. 764–769.
Resnik, P. 1998: Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text. In D. Farwell, L. Gerber and E. Hovy (eds) Machine Translation and the Information Soup, Springer, Berlin, pp. 72–82.
Resnik, P. 1999: Mining the Web for Bilingual Text. 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland.
Sperberg-McQueen, C., Burnard, L. 1994: Guidelines for Electronic Text Encoding and Interchange: TEI-P3 ACH-ACL-ALLC Text Coding Initiative, Chicago and Oxford.
Simard, M., Foster, G., Isabelle, P. 1992: Using Cognates to Align Sentences in Bilingual Corpora. Quatriéme colloque international sur les aspects théoriques et méthodologiques de la traduction automatique, Fourth International Conference on Theoretical and Methodological Issues in Machine Translation: Méthodes empiristes versus méthodes rationalistes en TA, Empiricist vs. Rationalist Methods in MT (TMI-92), Montréal, Canada, 67–82.
Shin, J.H., Han, Y.S., Choi, K-S. 1996: Bilingual Knowledge Acquisition from Korean-English Parallel Corpus Using Alignment Method (Korean-English Alignment at Word and Phrase Level). COLING-96: The 16th International Conference on Computational Linguistics, Copenhagen, Denmark, pp. 230–235.
Smadja, F. 1993: Retrieving Collocations from Text: Xtract. Computational Lin-guistics 19, 121–142.
Smadja, F., McKeown, K.R., Hatzivassiloglou, V. 1996: Translating Collocations for Bilingual Lexicons: A Statistical Approach, Computational Linguistics 22, 1–38.
Somers, H. 1998: Further Experiments in Bilingual Text Alignment. International Journal of Corpus Linguistics 3, 115–150.
Thompson, H. in press: Corpus Creation for Data-Intensive Linguistics. To appear in R. D ale, H. Moisl and H. Somers (eds) A Handbook of Natural Language Processing, Marcel Dekker, New York.
Wu, D., Xia, X. 1994: Learning an English-Chinese Lexicon from a Parallel Corpus. Technology Partnerships for Crossing the Language Barrier: Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, pp. 206–213.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Somers, H. (1999). Knowledge Extraction from Bilingual Corpora. In: Pazienza, M.T. (eds) Information Extraction. SCIE 1999. Lecture Notes in Computer Science(), vol 1714. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48089-7_7
Download citation
DOI: https://doi.org/10.1007/3-540-48089-7_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66625-7
Online ISBN: 978-3-540-48089-1
eBook Packages: Springer Book Archive