Skip to main content

Knowledge Extraction from Bilingual Corpora

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1714))

Abstract

The use of corpora has become an important issue in IE. In this chapter we consider a specific type of corpus, the bilingual parallel corpus, and ways of automatically extracting information from such corpora. This information, “linguistic metaknowledge”, is essential for techniques such as tokenization, POS-tagging, morphological analysis, used in IE. Where we wish to extract information from multilingual texts, we must rely on these linguistic resources being available in several languages. This chapter discusses locating and storing parallel texts, alignment at various levels (sentence, word, phrase), and extraction of bilingual vocabulary and terminology.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ahrenberg, L., Andersson, M., Merkel, M. 1998: A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts.COLING-ACL’ 98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, pp. 29–35.

    Google Scholar 

  2. Brown, P.F., Lai, J.C., Mercer, R.L. 1991: Aligning Sentences in Parallel Corpora. 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 169–176.

    Google Scholar 

  3. Chen, S.F. 1993: Aligning Sentences in Bilingual Corpora using Lexical Information. 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 9–16.

    Google Scholar 

  4. Church, K.W. 1993: Char align: A Program for Aligning Parallel Texts at the Character Level. 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 1–8.

    Google Scholar 

  5. Church, K., Hanks, P. 1989: Word Association Norms, Mutual Information, and Lexicography. 27th Annual Meeting of the Association for Computational Linguistics, Vancouver, British Columbia, pp. 76–83.

    Google Scholar 

  6. Dagan, I. 1996: Bilingual Word Alignment and Lexicon Construction. Tutorial handout, COLING-96: The 16th International Conference on Computational Linguistics, Copenhagen.

    Google Scholar 

  7. Daille, B. 1995: Combined approach for terminology extraction: lexical statistics and linguistic filtering, UCREL Technical Papers, No. 15, Department of Linguistics, Lancaster University. [cited in [36].

    Google Scholar 

  8. Daille, B. 1995: Repêrage et extraction de terminologie par une approach mixte statistique et linguistique. Traitements Probabilistes et Corpus, t.a.l. 36.1-2, 101–118.

    Google Scholar 

  9. Daille, B. 1996: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In J. Klavans and P. Resnik (eds) The Balancing Act: ] Combining Symbolic and Statistical Approaches to Language, MIT Press, Cambridge, Mass., pp.49–66.

    Google Scholar 

  10. Dagan, I., Church, K. 1997: Termight: Coordinating Humans and Machines in Bilingual Terminology Acquisition. Machine Translation 12, pp. 89–107.

    Article  Google Scholar 

  11. Dagan, I., Church, K.W., Gale, W.A. 1993: Robust Bilingual Word Alignment for Machine Aided Translation. Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, pp. 1–8.

    Google Scholar 

  12. Daille, B., Gaussier, É., Langê, J.-M. 1994:Towards Automatic Extraction of Monolingual and Bilingual Terminology. COLING 94: The 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 515–521.

    Google Scholar 

  13. Dempster, A.P., Laird, N.M., Rubin, D.B. 1977: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B, 39, pp. 1–38.

    MATH  MathSciNet  Google Scholar 

  14. Dunning, T. 1993: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19, pp. 61–76.

    Google Scholar 

  15. Fung, P., Church, K.W. 1994: K-vec: A New Approach for Aligning Parallel Texts. COLING 94: The 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 1096–1102.

    Google Scholar 

  16. Fung, P., McKeown, K. 1997: A Technical Word-and Term-Translation Aid using Noisy Parallel Corpora across Language Groups. Machine Translation 12, 53–87.

    Article  Google Scholar 

  17. Fung, P., McKeown, K. 1997: Finding Terminology Translations from Non-parallel Corpora. Proceedings of the Fifth Workshop on Very Large Corpora, Beijing and Hong Kong, pp. 192–202.

    Google Scholar 

  18. Fung, P. 1995: Compiling Bilingual Lexicon Entries from a Non-Parallel English-Chinese Corpus. Proceedings of the Third Workshop on Very Large Corpora, Cambridge, Mass., pp. 173–183.

    Google Scholar 

  19. Fung, P. 1998: A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora. In D. Farwell, L. Gerber and E. Hovy (eds) Machine Translation and the Information Soup, Springer, Berlin, pp. 1–17.

    Google Scholar 

  20. Fung, P., Yee, L.Y. 1998: An IR Approach for Translating New Words from Nonparallel, Comparable Texts. COLING-ACL’ 98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, pp. 414–420.

    Google Scholar 

  21. Gao, Z-M. 1997: Automatic Extraction of Translation Equivalents from a Parallel Chinese-English Corpus, PhD thesis, UMIST, Manchester, England.

    Google Scholar 

  22. Gale, W.A., Church, K.W. 1991: A Program for Aligning Sentences in Bilingual Corpora. 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, Calif., 177–184.

    Google Scholar 

  23. Gale, W.A., Church, K.W. 1991: Identifying Word Correspondences in Parallel Text. Workshop on Speech and Natural Language Processing, Asilomar, California.

    Google Scholar 

  24. Gale, W.A., Church, K.W. 1991: Concordances for Parallel Texts. Seventh Annual Conference of the UW Centre for New OED and text Research Using Corpora, Oxford, pp. 40–62.

    Google Scholar 

  25. Gaussier, É., Langé, J.-M. 1997: Some Methods for the Extraction of Bilingual Terminology. In D. Jones and H. Somers (eds) New Methods in Language Processing, UCL Press, London, pp. 145–153.

    Google Scholar 

  26. Grefenstette, G. 1995: Comparing two Language Identification Schemes. JADT 1995: III Giornate internazionali di Analisi Statistica dei Dati Testuali, Rome, Vol. I, pp. 263–268.

    Google Scholar 

  27. Harris, B. 1988: Bi-text, a New Concept in Translation Theory.Language Monthly 54, 8–10.

    Google Scholar 

  28. Hall, P.A.V., Dowling, G.R. 1980: Approximate String Matching. Computing Surveys 12, 381–402.

    Article  MathSciNet  Google Scholar 

  29. Jones, D., Alexa, M. 1997: Towards automatically aligning German Compounds with English Word Groups. In D. Jones and H. Somers (eds) New Methods in Language Processing, UCL Press, London, pp. 199–206.

    Google Scholar 

  30. Johansson, S., Ebeling, J., Hofland, K. 1993: Coding and aligning the English-Norwegian Parallel Corpus. In K. Ajimer, B. Altenberg and M. Johansson (eds) Languages in Contrast: A Symposium on Text-Based Cross-Linguistic Studies, Lund University Press, Lund, pp. 87–112.

    Google Scholar 

  31. Justeson, J.S., Katz, S.M. 1995: Technical Terminology: some Linguistic Properties and an Algorithm for Identification in Text. Natural Language Engineering 1, pp. 9–27.

    Article  Google Scholar 

  32. Jones, D.B., Somers, H. 1995: Bilingual Vocabulary Estimation from Noisy Parallel Corpora using Variable Bag Estimation. JADT 1995: III Giornate internazionali di Analisi Statistica dei Dati Testuali, Rome, Vol. I, pp. 255–262.

    Google Scholar 

  33. Kitamura, M., Matsumoto, Y. 1996: Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 79–87.

    Google Scholar 

  34. Kay, M. and Röscheisen, M. 1993: Text Translation Alignment. Computational Linguistics 19, 121–142.

    Google Scholar 

  35. Kupiec, J. 1993: An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 17–22.

    Google Scholar 

  36. McEnery, T., Langê, J.M., Oakes, M., Vêronis, J. 1997: The Exploitation of Multilingual Corpora for Term Extraction. In R. Garside, G. Leech and A. McEnery (eds), Corpus Annotation: Linguistic Information from Computer Text Corpora, Addison Wesley Longman, London, pp. 220–230.

    Google Scholar 

  37. McEnery, T., Oakes, M. 1996: Sentence and word alignment in the CRATER Project. In J. Thomas and M. Short (eds), Using Corpora for Language Research, Longman, London, pp. 211–231.

    Google Scholar 

  38. Melamed, I.D. 1996: A Geometrical Approach to Mapping Bitext Correspondence. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, Pa., pp. 1–12.

    Google Scholar 

  39. Melamed, I.D. 1996: Automatic Detection of Omissions in Translation. COLING-96: The 16th International Conference on Computational Linguistics, Copenhagen, Denmark, pp. 764–769.

    Google Scholar 

  40. Resnik, P. 1998: Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text. In D. Farwell, L. Gerber and E. Hovy (eds) Machine Translation and the Information Soup, Springer, Berlin, pp. 72–82.

    Chapter  Google Scholar 

  41. Resnik, P. 1999: Mining the Web for Bilingual Text. 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland.

    Google Scholar 

  42. Sperberg-McQueen, C., Burnard, L. 1994: Guidelines for Electronic Text Encoding and Interchange: TEI-P3 ACH-ACL-ALLC Text Coding Initiative, Chicago and Oxford.

    Google Scholar 

  43. Simard, M., Foster, G., Isabelle, P. 1992: Using Cognates to Align Sentences in Bilingual Corpora. Quatriéme colloque international sur les aspects théoriques et méthodologiques de la traduction automatique, Fourth International Conference on Theoretical and Methodological Issues in Machine Translation: Méthodes empiristes versus méthodes rationalistes en TA, Empiricist vs. Rationalist Methods in MT (TMI-92), Montréal, Canada, 67–82.

    Google Scholar 

  44. Shin, J.H., Han, Y.S., Choi, K-S. 1996: Bilingual Knowledge Acquisition from Korean-English Parallel Corpus Using Alignment Method (Korean-English Alignment at Word and Phrase Level). COLING-96: The 16th International Conference on Computational Linguistics, Copenhagen, Denmark, pp. 230–235.

    Google Scholar 

  45. Smadja, F. 1993: Retrieving Collocations from Text: Xtract. Computational Lin-guistics 19, 121–142.

    Google Scholar 

  46. Smadja, F., McKeown, K.R., Hatzivassiloglou, V. 1996: Translating Collocations for Bilingual Lexicons: A Statistical Approach, Computational Linguistics 22, 1–38.

    Google Scholar 

  47. Somers, H. 1998: Further Experiments in Bilingual Text Alignment. International Journal of Corpus Linguistics 3, 115–150.

    Google Scholar 

  48. Thompson, H. in press: Corpus Creation for Data-Intensive Linguistics. To appear in R. D ale, H. Moisl and H. Somers (eds) A Handbook of Natural Language Processing, Marcel Dekker, New York.

    Google Scholar 

  49. Wu, D., Xia, X. 1994: Learning an English-Chinese Lexicon from a Parallel Corpus. Technology Partnerships for Crossing the Language Barrier: Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, pp. 206–213.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Somers, H. (1999). Knowledge Extraction from Bilingual Corpora. In: Pazienza, M.T. (eds) Information Extraction. SCIE 1999. Lecture Notes in Computer Science(), vol 1714. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48089-7_7

Download citation

  • DOI: https://doi.org/10.1007/3-540-48089-7_7

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-66625-7

  • Online ISBN: 978-3-540-48089-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics