skip to main content
research-article

Brains, not brawn: The use of “smart” comparable corpora in bilingual terminology mining

Published: 04 October 2008 Publication History

Abstract

Current research in text mining favors the quantity of texts over their representativeness. But for bilingual terminology mining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the representativeness rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the representativeness of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. This article tests this hypothesis on a French-Japanese bilingual term extraction task. To demonstrate how important the type of discourse is as a characteristic of the comparable corpora, we used a state-of-the-art multilingual terminology mining chain composed of two extraction programs, one in each language, and an alignment program. We evaluated the candidate translations using a reference list, and found that taking discourse type into account resulted in candidate translations of a better quality even when the corpus size was reduced by half.

References

[1]
}}Baldwin, T. and Tanaka, T. 2004. Translation by machine of complex nominals: Getting it right. In Proceedings of the ACL Workshop on Multiword Expressions: Integrating Processing. 24--31.
[2]
}}Beauvisage, T. 2001. Morphosyntaxe et genres textuels. Exploiter des données morphosyntaxiques pour l'étude statistique des genres textuels : Application au roman policier. Traitement Autom. Lang. 42, 2, 579--608.
[3]
}}Biber, D. 1994. Representativeness in corpus design. In Current Issues in Computational Linguistics: In Honour of Don Walker, A. Zampolli, N. Calzolari, and M. Palmer, Eds. Kluwer Academic Publishers, 377--407.
[4]
}}Biber, D. 1995. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press, Cambridge, UK.
[5]
}}Bowker, L. and Pearson, J. 2002. Working with Specialized Language: A Practical Guide to Using Corpora. Routledge, London, UK.
[6]
}}Brill, E. 1994. Some advances in transformation-based part of speech tagging. In Proceedings of the 12th National Conference on Artificial Intelligence (AAAI'94). 722--727.
[7]
}}Brown, P., Della Pietra, S., Della Pietra, V., and Mercer, R. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 2, 263--311.
[8]
}}Cabré, M. T., Bagot, R. E., and Platresi, J. V. 2001. Automatic term detection: A review of current systems. In Recent Advances in Computational Terminology, D. Bourigault, C. Jacquemin, and M.-C. L'Homme, Eds. Natural Language Processing, vol. 2. John Benjamins, 53--88.
[9]
}}Cao, Y. and Li, H. 2002. Base noun phrase translation using Web data and the EM algorithm. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02). 127--133.
[10]
}}Cerbah, F. 2000. Exogeneous and endogeneous approaches to semantic categorization of unknown technical terms. In Proceedings of the 18th International Conference on Computational Linguistics (COLING'00). 145--151.
[11]
}}Chiao, Y.-C. and Zweigenbaum, P. 2002. Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02). 1208--1212.
[12]
}}Church, K. W. and Mercer, R. L. 1993. Introduction to the special issue on computational linguistics using large corpora. Comput. Linguist. 19, 1, 1--24.
[13]
}}Daille, B. 1996. Study and implementation of combined techniques for automatic extraction of terminology. In The Balancing Act: Combining Symbolic and Statistical Approaches to Language, J. Klavans and P. Resnik, Eds. The MIT Press, Cambridge, MA, 49--66.
[14]
}}Daille, B. 2003a. Conceptual structuring through term variations. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, F. Bond, A. Korhonen, D. MacCarthy, and A. Villacicencio, Eds. 9--16.
[15]
}}Daille, B. 2003b. Terminology Mining. In Information Extraction in the Web Era, M. T. Pazienza, Ed. Springer, 29--44.
[16]
}}Daille, B. and Morin, E. 2005. French-English terminology extraction from comparable corpora. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCLNP'05). 707--718.
[17]
}}Déjean, H. and Gaussier, E. 2002. Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables. In Lexicometrica, Alignement lexical dans les Corpus Multilingues, 1--22.
[18]
}}Déjean, H., Sadat, F., and Gaussier, E. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics (COLING'02). 218--224.
[19]
}}Diab, M. T. and Finch, S. 2000. A statistical word-level translation model for comparable corpora. In Proceedings of the 6th International Conference on Computer-Assisted Information Retrieval (RIAO'00). 1500--1508.
[20]
}}Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1, 61--74.
[21]
}}Fano, R. M. 1961. Transmission of Information: A Statistical Theory of Communications. MIT Press, Cambridge, MA.
[22]
}}French-Japanese Scientific Dictionary. 1989. 4th Ed. Hakusuisha.
[23]
}}Fung, P. 1998. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas (AMTA'98). D. Farwell, L. Gerber, and E. Hovy, Eds. 1--16.
[24]
}}Fung, P. and McKeown, K. 1997. Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora (VLC'97). 192--202.
[25]
}}Goeuriot, L., Morin, E., and Daille, B. 2009. Compilation of specialized comparable corpora in French and Japanese. In Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora (BUCC'09). 55--62.
[26]
}}Grefenstette, G. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publisher, Boston, MA.
[27]
}}Grefenstette, G. 1999. The Word Wide Web as a resource for example-based machine translation tasks. In Translating and the Computer 21 (ASLIB'99).
[28]
}}Jacquemin, C. 2001. Spotting and Discovering Terms through Natural Language Processing. MIT Press, Cambridge, MA.
[29]
}}Janssen, T. M. V. 1996. Compositionality. In Handbook of Logic and Language, J. van Benthem and A. ter Meulen, Eds. Elsevier, Amsterdam, 417--473.
[30]
}}Kageura, K., Daille, B., Nakagawa, H., and Chien, L. 2004. Recent trends in computational terminology. Terminol. 10, 2, 1--25.
[31]
}}Karlgren, J. and Cutting, D. 1994. Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of the 15th International Conference on Computational Linguistics (COLING'94). Vol. II. 1071--1075.
[32]
}}Manning, C. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.
[33]
}}Matsumoto, Y., Kitauchi, A., Yamashita, T., and Hirano, Y. 1999. Japanese morphological analysis system Chasen 2.0 users manual. Tech. rep., Nara Institute of Science and Technology (NAIST).
[34]
}}McEnery, A. M. and Xiao, R. Z. 2007. Parallel and comparable corpora: What are they up to? In Incorporating Corpora: Translation and the Linguist, G. M. Anderman and M. Rogers, Eds. Multilingual Matters. Clevedon, UK, Chapter 2.
[35]
}}Melamed, I. D. 1997. A word-to-word model of translational equivalence. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL'97) and 8th Conference of the European Chapter of the Association for Computational Linguistics (EACL'98). P. R. Cohen and W. Wahlster, Eds. 490--497.
[36]
}}Melamed, I. D. 2001. Empirical Methods for Exploiting Parallel Texts. MIT Press, Cambridge, MA.
[37]
}}Nakagawa, H. and Mori, T. 1998. Nested collocation and compound noun for term recognition. In Proceedings of the 1st Workshop on Computational Terminology (COMPTERM'98). D. Bourigault, C. Jacquemin, and M.-C. L'Homme, Eds. 64--70.
[38]
}}Namer, F. 2000. Flemm: Un analyseur flexionnel du français à base de règles. Traitement Autom. Lang. 41, 2, 523--547.
[39]
}}Nazarenko, A. and Hamon, T., Eds. 2002. Structuration de terminologie. Traitement Autom. Lang. 43.
[40]
}}Nomura, M. and M., I. 1989. Gakujutu Yogo Goki-Hyo. National Language Research Institute, Tokyo.
[41]
}}Peters, C. and Picchi, E. 1998. Cross-Language information retrieval: A system for comparable corpus querying. In Cross-Language Information Retrieval, G. Grefenstette, Ed. Kluwer Academic Publishers, Chapter 7, 81--90.
[42]
}}Rapp, R. 1995. Identify word translations in non-parallel texts. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL'95). 320--322.
[43]
}}Rapp, R. 1999. Automatic identification of word translations from unrelated english and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL'99). 519--526.
[44]
}}Robitaille, X., Sasaki, X., Tonoike, M., Sato, S., and Utsuro, S. 2006. Compiling French-Japanese terminologies from the Web. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL'06). 225--232.
[45]
}}Sadat, F., Yoshikawa, M., and Uemura, S. 2003. Learning bilingual translations from comparable corpora to cross-language information retrieval: Hybrid statistics-based and linguistics-based approach. In Proceedings of the 6th International Workshop on Information Retrieval with Asian Languages (IRAL'03). Vol. 11. 57--64.
[46]
}}Salton, G. and Buckley, C. 1988. Term-Weighting approaches in automatic text retrieval. Inform. Process. Manag. 24, 5, 513--523.
[47]
}}Salton, G. and Lesk, M. E. 1968. Computer evaluation of indexing and text processing. J. Assoc. Comput. Mach. 15, 1, 8--36.
[48]
}}Savary, A. and Jacquemin, C. 2003. Reducing information variation in text. In Text- and Speech-Triggered Information Access, G. Grefenstette, Ed. Lecture Notes in Computer Science. Springer Verlag, 141--181.
[49]
}}Takeuchi, K., Kageura, K., Daille, B., and Romary, L. 2004. Construction of grammar based term extraction model for Japanese. In Proceedings of the 3rd International Workshop on Computational Terminology (COMPUTERM'04). S. Ananadiou and P. Zweigenbaum, Eds. 91--94.

Cited By

View all
  • (2022)Identification d’occurrences de candidats termes dans des articles scientifiquesCorela10.4000/corela.14874Online publication date: 13-Jun-2022
  • (2019)A Summary of Studies on Bilingual Comparable Corpus2019 International Conference on Smart Grid and Electrical Automation (ICSGEA)10.1109/ICSGEA.2019.00138(595-599)Online publication date: Aug-2019
  • (2019)Reproduction, replication, analysis and adaptation of a term alignment approachLanguage Resources and Evaluation10.1007/s10579-019-09477-1Online publication date: 18-Nov-2019
  • Show More Cited By

Index Terms

  1. Brains, not brawn: The use of “smart” comparable corpora in bilingual terminology mining

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Speech and Language Processing
    ACM Transactions on Speech and Language Processing   Volume 7, Issue 1
    August 2010
    23 pages
    ISSN:1550-4875
    EISSN:1550-4883
    DOI:10.1145/1839478
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Accepted: 01 June 2010
    Revised: 01 February 2010
    Received: 01 January 2009
    Published: 04 October 2008
    Published in TSLP Volume 7, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Terminology mining
    2. comparable corpora
    3. lexical alignment

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Identification d’occurrences de candidats termes dans des articles scientifiquesCorela10.4000/corela.14874Online publication date: 13-Jun-2022
    • (2019)A Summary of Studies on Bilingual Comparable Corpus2019 International Conference on Smart Grid and Electrical Automation (ICSGEA)10.1109/ICSGEA.2019.00138(595-599)Online publication date: Aug-2019
    • (2019)Reproduction, replication, analysis and adaptation of a term alignment approachLanguage Resources and Evaluation10.1007/s10579-019-09477-1Online publication date: 18-Nov-2019
    • (2017)A Hybrid Model for Chinese Spelling CheckACM Transactions on Asian and Low-Resource Language Information Processing10.1145/304740516:3(1-22)Online publication date: 30-Mar-2017
    • (2017)Corpus-Based Translation Induction in Indian Languages Using Auxiliary Language Corpora from WikipediaACM Transactions on Asian and Low-Resource Language Information Processing10.1145/303829516:3(1-25)Online publication date: 17-Mar-2017
    • (2017)Writes Wrought Right, and Other Adventures in File System OptimizationACM Transactions on Storage10.1145/303296913:1(1-26)Online publication date: 16-Mar-2017
    • (2017)IsotopeACM Transactions on Storage10.1145/303296713:1(1-25)Online publication date: 16-Feb-2017
    • (2016)Exploiting unbalanced specialized comparable corpora for bilingual lexicon extractionNatural Language Engineering10.1017/S135132491600014022:04(575-601)Online publication date: 15-Jun-2016
    • (2015)Translation Induction on Indian Language Corpora Using Translingual Themes from Other LanguagesComputational Linguistics and Intelligent Text Processing10.1007/978-3-319-18111-0_38(505-519)Online publication date: 2015
    • (2014)Low-Rank Modeling and Its Applications in Image AnalysisACM Computing Surveys10.1145/267455947:2(1-33)Online publication date: 19-Dec-2014
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media