Abstract
The automatic compilation of bilingual lists of terms from specialized comparable corpora using lexical alignment has been successful for single-word terms (SWTs), but remains disappointing for multi-word terms (MWTs). The low frequency and the variability of the syntactic structures of MWTs in the source and the target languages are the main reported problems. This paper defines a general framework dedicated to the lexical alignment of MWTs from comparable corpora that includes a compositional translation process and the standard lexical context analysis. The compositional method which is based on the translation of lexical items being restrictive, we introduce an extended compositional method that bridges the gap between MWTs of different syntactic structures through morphological links. We experimented with the two compositional methods for the French–Japanese alignment task. The results show a significant improvement for the translation of MWTs and advocate further morphological analysis in lexical alignment.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10579-009-9098-8/MediaObjects/10579_2009_9098_Fig1_HTML.gif)
Similar content being viewed by others
Notes
http://www.sciences.univ-nantes.fr/info/perso/permanents/daille/ and release for Mandriva Linux.
The symbols for part-of-speech tags are Adj (Adjective), N (Noun), Pref (Prefix), Prep (Preposition), and Suff (Suffix).
The Precision corresponds to # correct JP trans. divided by # JP trans.
References
Baldwin, T., & Tanaka, T. (2004). Translation by machine of complex nominals: Getting it right. In Proceedings of the ACL 2004 Workshop on multiword expressions: Integrating processing. Barcelona, Spain, pp. 24–31.
Bowker, L., & Pearson, J. (2002). Working with specialized language: A practical guide to using corpora. London/New York: Routeledge.
Brill, E. (1994). Some advances in transformation-based part of speech tagging. In Proceedings of the 12th national conference on artificial intelligence (AAAI’94). Seattle, Washington, USA, pp. 722–727.
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.
Chiao, Y. -C., & Zweigenbaum, P. (2002a). Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings of the 19th international conference on computational linguistics (COLING’02). Tapei, Taiwan, pp. 1208–1212.
Chiao, Y.-C., & Zweigenbaum, P. (2002b). Looking for French–English translations in comparable medical corpora. Journal of the American Society for Information Science, 8, 150–154.
Daille, B. (2001). Qualitative terminology extraction: Identifying relational adjectives. In D. Bourigault, C. Jacquemin, & M.-C. L’Homme (Eds.), Recent advances in computational terminology, Vol. 2 of Natural language processing (pp. 149–166). John Benjamins.
Daille, B. (2003a). Conceptual structuring through term variations. In F. Bond, A. Korhonen, D. MacCarthy, & A. Villacicencio (Eds.), Proceedings of the ACL 2003 workshop on multiword expressions: Analysis, acquisition and treatment, pp. 9–16.
Daille, B. (2003b). Terminology mining. In M. T. Pazienza (Ed.), Information extraction in the web era. Springer, pp. 29–44.
Daille, B., & Morin, E. (2005). French–English terminology extraction from comparable corpora. In Proceedings of the 2nd international joint conference on natural language processing (IJCLNP’05). Jeju Island, Korea, pp. 707–718.
Déjean, H., & Gaussier, E. (2002). Une nouvelle approche à l’extraction de lexiques bilingues à partir de corpus comparables. Lexicometrica, Alignement lexical dans les corpus multilingues, pp. 1–22.
Déjean, H., Sadat, F., & Gaussier, E. (2002). An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th international conference on computational linguistics (COLING’02). Tapei, Taiwan, pp. 218–224.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Fano, R. M. (1961). Transmission of information: A statistical theory of communications. Cambridge, MA, USA: MIT Press.
Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In D. Farwell, L. Gerber, & E. Hovy (Eds.) , Proceedings of the 3rd conference of the association for machine translation in the Americas (AMTA’98). Langhorne, PA, USA (pp. 1–16).
Fung, P., & McKeown, K. (1997). Finding terminology translations from non-parallel corpora. In Proceedings of the 5th annual workshop on very large corpora (VLC’97). Hong Kong, China, pp. 192–202.
Grefenstette, G. (1994a). Corpus-derived first, second and third-order word affinities. In Proceedings of the 6th congress of the European association for lexicography (EURALEX’94). Amsterdam, The Netherlands, pp. 279–290.
Grefenstette, G. (1994b). Explorations in automatic thesaurus discovery. Boston, MA, USA: Kluwer Academic Publisher.
Grefenstette, G. (1999). The world wide web as a resource for example-based machine translation tasks. In ASLIB’99 translating and the computer 21. London, UK.
Hakusui-sha. (Ed.). (1989). Dictionnaire des termes techniques et scientifiques: Francais-Japonais (4th ed.).
Jacquemin, C. (2001). Spotting and discovering terms through natural language processing. Cambridge: MIT Press.
Keenan, E. L., & Faltz, L. M. (1985). Boolean semantics for natural language. Dordrecht, Holland: D. Reidel.
Matsumoto, Y., Kitauchi, A., Yamashita, T., & Hirano, Y. (1999). Japanese morphological analysis system ChaSen 2.0 users manual. Technical report, Nara Institute of Science and Technology (NAIST).
Melamed, I. D. (1997). A word-to-word model of translational equivalence. In P. R. Cohen & W. Wahlster (Eds.), Proceedings of the 35th annual meeting of the association for computational linguistics (ACL’97) and 8th conference of the European chapter of the association for computational linguistics (EACL’97). Madrid, Spain, pp. 490–497.
Melamed, I. D. (2001). Empirical methods for exploiting parallel texts. Cambridge: MIT Press.
Mikheev, A. (1997). Automatic rule induction for unknown-word guessing. Computational Linguistics, 23(3), 405–423.
Morin, E., & Daille, B. (2006). Comparabilité de corpus et fouille terminologique multilingue. Traitement Automatique des Langues (TAL), 47(2), 113–136.
Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining—using brain, not brawn comparable corpora. In Proceedings of the 45th annual meeting of the association for computational linguistics (ACL’07). Prague, Czech Republic, pp. 664–671.
Namer, F. (2000). FLEMM: Un analyseur flexionnel du français à base de règles. Traitement Automatique des Langues (TAL), 41(2), 523–547.
Rapp, R. (1995). Identify word translations in non-parallel texts. In Proceedings of the 35th annual meeting of the association for computational linguistics (ACL’95). Boston, MA, USA, pp. 320–322.
Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th annual meeting of the association for computational linguistics (ACL’99). College Park, MD, USA, pp. 519–526.
Robitaille, X., Sasaki, X., Tonoike, M., Sato, S., & Utsuro, S. (2006). Compiling French–Japanese terminologies from the web. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL’06). Trento, Italy, pp. 225–232.
Salton, G., & Lesk, M. E. (1968). Computer evaluation of indexing and text processing. Journal of the Association for Computational Machinery, 15(1), 8–36.
Simard, M., & Langlais, P. (2003). Statistical translation alignment with compositionality constraint. In HLT-NAACL, worshop on building and using parallel texts: Data driven machine translation and beyond (Vol. 3, pp. 19–22).
Takeuchi, K., Kageura, K., Daille, B., & Romary, L. (2004). Construction of grammar based term extraction model for Japanese. In S. Ananadiou & P. Zweigenbaum (Eds.) Proceedings of the COLING 2004, 3rd international workshop on computational terminology (COMPUTERM’04). Geneva, Switzerland (pp. 91–94).
Tanaka, T. (2002). Measuring the similarity between compound nouns in different languages using non-parallel corpora. In Proceedings of the 19th international conference on computational linguistics (COLING’02). Taipei, Taiwan, pp. 1–7.
Tanaka, T., & Baldwin, T. (2003) Noun–noun compound machine translation: A feasibility study on shallow processing. In Proceedings of the ACL 2003 workshop on multiword expressions: Analysis, acquisition and treatment. Sapporo, Japan, pp. 17–24.
Tsutsumi, T. (1990). Wide-range restructuring of intermediate representations in machine translation. Computational Linguistics, 16(2), 71–78.
Acknowledgement
This work was supported by the French National Research Agency grant ANR-08-CORD-013.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Morin, E., Daille, B. Compositionality and lexical alignment of multi-word terms. Lang Resources & Evaluation 44, 79–95 (2010). https://doi.org/10.1007/s10579-009-9098-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-009-9098-8