Skip to main content
Log in

Compositionality and lexical alignment of multi-word terms

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The automatic compilation of bilingual lists of terms from specialized comparable corpora using lexical alignment has been successful for single-word terms (SWTs), but remains disappointing for multi-word terms (MWTs). The low frequency and the variability of the syntactic structures of MWTs in the source and the target languages are the main reported problems. This paper defines a general framework dedicated to the lexical alignment of MWTs from comparable corpora that includes a compositional translation process and the standard lexical context analysis. The compositional method which is based on the translation of lexical items being restrictive, we introduce an extended compositional method that bridges the gap between MWTs of different syntactic structures through morphological links. We experimented with the two compositional methods for the French–Japanese alignment task. The results show a significant improvement for the translation of MWTs and advocate further morphological analysis in lexical alignment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://www.kanji.free.fr/.

  2. http://www.quebec-japon.com/lexique/index.php?a=index&d=25.

  3. http://www.dico.fj.free.fr/index.php.

  4. http://www.quebec-japon.com/lexique/index.php?a=index&d=3.

  5. http://www.sciences.univ-nantes.fr/info/perso/permanents/daille/ and release for Mandriva Linux.

  6. http://www.cl.cs.okayama-u.ac.jp/rsc/jacabit/.

  7. The symbols for part-of-speech tags are Adj (Adjective), N (Noun), Pref (Prefix), Prep (Preposition), and Suff (Suffix).

  8. http://www.atilf.fr/winbrill/.

  9. http://www.univ-nancy2.fr/pers/namer/.

  10. http://www.chasen-legacy.sourceforge.jp/.

  11. The Precision corresponds to # correct JP trans. divided by # JP trans.

References

  • Baldwin, T., & Tanaka, T. (2004). Translation by machine of complex nominals: Getting it right. In Proceedings of the ACL 2004 Workshop on multiword expressions: Integrating processing. Barcelona, Spain, pp. 24–31.

  • Bowker, L., & Pearson, J. (2002). Working with specialized language: A practical guide to using corpora. London/New York: Routeledge.

    Book  Google Scholar 

  • Brill, E. (1994). Some advances in transformation-based part of speech tagging. In Proceedings of the 12th national conference on artificial intelligence (AAAI’94). Seattle, Washington, USA, pp. 722–727.

  • Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.

    Google Scholar 

  • Chiao, Y. -C., & Zweigenbaum, P. (2002a). Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings of the 19th international conference on computational linguistics (COLING’02). Tapei, Taiwan, pp. 1208–1212.

  • Chiao, Y.-C., & Zweigenbaum, P. (2002b). Looking for French–English translations in comparable medical corpora. Journal of the American Society for Information Science, 8, 150–154.

    Google Scholar 

  • Daille, B. (2001). Qualitative terminology extraction: Identifying relational adjectives. In D. Bourigault, C. Jacquemin, & M.-C. L’Homme (Eds.), Recent advances in computational terminology, Vol. 2 of Natural language processing (pp. 149–166). John Benjamins.

  • Daille, B. (2003a). Conceptual structuring through term variations. In F. Bond, A. Korhonen, D. MacCarthy, & A. Villacicencio (Eds.), Proceedings of the ACL 2003 workshop on multiword expressions: Analysis, acquisition and treatment, pp. 9–16.

  • Daille, B. (2003b). Terminology mining. In M. T. Pazienza (Ed.), Information extraction in the web era. Springer, pp. 29–44.

  • Daille, B., & Morin, E. (2005). French–English terminology extraction from comparable corpora. In Proceedings of the 2nd international joint conference on natural language processing (IJCLNP’05). Jeju Island, Korea, pp. 707–718.

  • Déjean, H., & Gaussier, E. (2002). Une nouvelle approche à l’extraction de lexiques bilingues à partir de corpus comparables. Lexicometrica, Alignement lexical dans les corpus multilingues, pp. 1–22.

  • Déjean, H., Sadat, F., & Gaussier, E. (2002). An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th international conference on computational linguistics (COLING’02). Tapei, Taiwan, pp. 218–224.

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Fano, R. M. (1961). Transmission of information: A statistical theory of communications. Cambridge, MA, USA: MIT Press.

    Google Scholar 

  • Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In D. Farwell, L. Gerber, & E. Hovy (Eds.) , Proceedings of the 3rd conference of the association for machine translation in the Americas (AMTA’98). Langhorne, PA, USA (pp. 1–16).

  • Fung, P., & McKeown, K. (1997). Finding terminology translations from non-parallel corpora. In Proceedings of the 5th annual workshop on very large corpora (VLC’97). Hong Kong, China, pp. 192–202.

  • Grefenstette, G. (1994a). Corpus-derived first, second and third-order word affinities. In Proceedings of the 6th congress of the European association for lexicography (EURALEX’94). Amsterdam, The Netherlands, pp. 279–290.

  • Grefenstette, G. (1994b). Explorations in automatic thesaurus discovery. Boston, MA, USA: Kluwer Academic Publisher.

    Google Scholar 

  • Grefenstette, G. (1999). The world wide web as a resource for example-based machine translation tasks. In ASLIB’99 translating and the computer 21. London, UK.

  • Hakusui-sha. (Ed.). (1989). Dictionnaire des termes techniques et scientifiques: Francais-Japonais (4th ed.).

  • Jacquemin, C. (2001). Spotting and discovering terms through natural language processing. Cambridge: MIT Press.

    Google Scholar 

  • Keenan, E. L., & Faltz, L. M. (1985). Boolean semantics for natural language. Dordrecht, Holland: D. Reidel.

    Google Scholar 

  • Matsumoto, Y., Kitauchi, A., Yamashita, T., & Hirano, Y. (1999). Japanese morphological analysis system ChaSen 2.0 users manual. Technical report, Nara Institute of Science and Technology (NAIST).

  • Melamed, I. D. (1997). A word-to-word model of translational equivalence. In P. R. Cohen & W. Wahlster (Eds.), Proceedings of the 35th annual meeting of the association for computational linguistics (ACL’97) and 8th conference of the European chapter of the association for computational linguistics (EACL’97). Madrid, Spain, pp. 490–497.

  • Melamed, I. D. (2001). Empirical methods for exploiting parallel texts. Cambridge: MIT Press.

    Google Scholar 

  • Mikheev, A. (1997). Automatic rule induction for unknown-word guessing. Computational Linguistics, 23(3), 405–423.

    Google Scholar 

  • Morin, E., & Daille, B. (2006). Comparabilité de corpus et fouille terminologique multilingue. Traitement Automatique des Langues (TAL), 47(2), 113–136.

    Google Scholar 

  • Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining—using brain, not brawn comparable corpora. In Proceedings of the 45th annual meeting of the association for computational linguistics (ACL’07). Prague, Czech Republic, pp. 664–671.

  • Namer, F. (2000). FLEMM: Un analyseur flexionnel du français à base de règles. Traitement Automatique des Langues (TAL), 41(2), 523–547.

    Google Scholar 

  • Rapp, R. (1995). Identify word translations in non-parallel texts. In Proceedings of the 35th annual meeting of the association for computational linguistics (ACL’95). Boston, MA, USA, pp. 320–322.

  • Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th annual meeting of the association for computational linguistics (ACL’99). College Park, MD, USA, pp. 519–526.

  • Robitaille, X., Sasaki, X., Tonoike, M., Sato, S., & Utsuro, S. (2006). Compiling French–Japanese terminologies from the web. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL’06). Trento, Italy, pp. 225–232.

  • Salton, G., & Lesk, M. E. (1968). Computer evaluation of indexing and text processing. Journal of the Association for Computational Machinery, 15(1), 8–36.

    Google Scholar 

  • Simard, M., & Langlais, P. (2003). Statistical translation alignment with compositionality constraint. In HLT-NAACL, worshop on building and using parallel texts: Data driven machine translation and beyond (Vol. 3, pp. 19–22).

  • Takeuchi, K., Kageura, K., Daille, B., & Romary, L. (2004). Construction of grammar based term extraction model for Japanese. In S. Ananadiou & P. Zweigenbaum (Eds.) Proceedings of the COLING 2004, 3rd international workshop on computational terminology (COMPUTERM’04). Geneva, Switzerland (pp. 91–94).

  • Tanaka, T. (2002). Measuring the similarity between compound nouns in different languages using non-parallel corpora. In Proceedings of the 19th international conference on computational linguistics (COLING’02). Taipei, Taiwan, pp. 1–7.

  • Tanaka, T., & Baldwin, T. (2003) Noun–noun compound machine translation: A feasibility study on shallow processing. In Proceedings of the ACL 2003 workshop on multiword expressions: Analysis, acquisition and treatment. Sapporo, Japan, pp. 17–24.

  • Tsutsumi, T. (1990). Wide-range restructuring of intermediate representations in machine translation. Computational Linguistics, 16(2), 71–78.

    Google Scholar 

Download references

Acknowledgement

This work was supported by the French National Research Agency grant ANR-08-CORD-013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Béatrice Daille.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Morin, E., Daille, B. Compositionality and lexical alignment of multi-word terms. Lang Resources & Evaluation 44, 79–95 (2010). https://doi.org/10.1007/s10579-009-9098-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-009-9098-8

Keywords

Navigation