skip to main content
research-article

A Constraint Approach to Pivot-Based Bilingual Dictionary Induction

Published:21 November 2015Publication History
Skip Abstract Section

Abstract

High-quality bilingual dictionaries are very useful, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. Using a third language to link two other languages is a well-known solution and usually requires only two input bilingual dictionaries A-B and B-C to automatically induce the new one, A-C. This approach, however, has never been demonstrated to utilize the complete structures of the input bilingual dictionaries, and this is a key failing because the dropped meanings negatively influence the result. This article proposes a constraint approach to pivot-based dictionary induction where language A and C are closely related. We create constraints from language similarity and model the structures of the input dictionaries as a Boolean optimization problem, which is then formulated within the Weighted Partial Max-SAT framework, an extension of Boolean Satisfiability (SAT). All of the encoded CNF (Conjunctive Normal Form), the predominant input language of modern SAT/MAX-SAT solvers, formulas are evaluated by a solver to produce the target (output) bilingual dictionary. Moreover, we discuss alternative formalizations as a comparison study. We designed a tool that uses the Sat4j library as the default solver to implement our method and conducted an experiment in which the output bilingual dictionary achieved better quality than the baseline method.

References

  1. Kisuh Ahn and Matthew Frampton. 2006. Automatic generation of translation dictionaries using intermediary languages. In Proceedings of the International Workshop on Cross-Language Knowledge Induction. Association for Computational Linguistics, 41--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Fadi A. Aloul, Arathi Ramani, Igor L. Markov, and Karem A. Sakallah. 2002. Generic ILP versus specialized 0-1 ILP: An update. In Proceedings of the 2002 IEEE/ACM International Conference on Computer-Aided Design. ACM, 450--457. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Hitham Abo Bakr, Khaled Shaalan, and Ibrahim Ziedan. 2008. A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. In Proceedings of the he 6th International Conference on Informatics and Systems (INFOS’08). Cairo University.Google ScholarGoogle Scholar
  4. Shane Bergsma and Benjamin Van Durme. 2011. Learning bilingual lexicons using the visual similarity of labeled web images. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22. 1764. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Armin Biere, Marijn J. H. Heule, Hans van Maaren, and Toby Walsh (Eds.). 2009. Handbook of Satisfiability. Frontiers in Artificial Intelligence and Applications, Vol. 185. IOS Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Francis Bond and Kentaro Ogura. 2008. Combining linguistic resources to create a machine-tractable Japanese-Malay dictionary. Language Resources and Evaluation 42, 2 (2008), 127--136.Google ScholarGoogle ScholarCross RefCross Ref
  7. Francis Bond, Takefumi Yamazaki, Ruhaida Binti Sulong, and Kentaro Okura. 2001. Design and construction of a machine--tractable Japanese-Malay lexicon. In Annual Meeting of the Association for Natural Language Processing, Vol. 7. 1.Google ScholarGoogle Scholar
  8. Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Computational Linguistics 16, 2 (1990), 79--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Cheng, Victor Wu, Robert Collins, A. Hanson, and E. Riseman. 1996. Maximum-weight bipartite matching technique and its application in image feature matching. In SPIE Conference on Visual Communication and Image Processing. 1358--1379.Google ScholarGoogle Scholar
  10. Stephen A. Cook. 1971. The complexity of theorem-proving procedures. In Proceedings of the 3rd Annual ACM Symposium on Theory of Computing. ACM, 151--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Inderjit Dhillon, Yuqiang Guan, and Brian Kulis. 2005. A fast kernel-based multilevel algorithm for graph clustering. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 629--634. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Qing Dou and Kevin Knight. 2012. Large scale decipherment for out-of-domain machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 266--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Zhaohui Fu and Sharad Malik. 2006. On solving the partial MAX-SAT problem. In Theory and Applications of Satisfiability Testing (SAT’06). Springer, 252--265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Pascale Fung. 1995. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. Proceedings of the 3rd Annual Workshop on Very Large Corpora.Google ScholarGoogle Scholar
  15. Pascale Fung. 1998. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Machine Translation and the Information Soup. Springer, 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. Proceedings of ACL-08: HLT (2008), 771--779.Google ScholarGoogle Scholar
  17. Jirka Hana, Anna Feldman, Chris Brew, and Luiz Amaral. 2006. Tagging Portuguese with a Spanish tagger using cognates. In Proceedings of the International Workshop on Cross-Language Knowledge Induction. Association for Computational Linguistics, 33--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ahlem Ben Hassine, Shigeo Matsubara, and Toru Ishida. 2006. A constraint-based approach to horizontal web service composition. In The Semantic Web-ISWC 2006. Springer, 130--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. John Hopcroft and Robert Tarjan. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Communications of the ACM 16, 6 (1973), 372--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Toru Ishida. 2011. The Language Grid. Springer.Google ScholarGoogle Scholar
  21. Azniah Ismail and Suresh Manandhar. 2010. Bilingual lexicon extraction from comparable corpora using in-domain terms. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 481--489. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Varga István and Yokoyama Shoichi. 2009. Bilingual dictionary generation for low-resourced language pairs. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2. Association for Computational Linguistics, 862--870. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hiroyuki Kaji and Toshiko Aizono. 1996. Extracting word correspondences from bilingual corpora based on word co-occurrences information. In Proceedings of the 16th Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 23--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hiroyuki Kaji, Shin’ichi Tamamura, and Dashtseren Erdenebat. 2008. Automatic construction of a Japanese-Chinese dictionary via English. In LREC, Vol. 2008. 699--706.Google ScholarGoogle Scholar
  25. Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition-Volume 9. Association for Computational Linguistics, 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ruiming Li, Dian Zhou, and Donglei Du. 2004. Satisfiability and integer programming as complementary tools. In Proceedings of the 2004 Asia and South Pacific Design Automation Conference. IEEE Press, 879--882. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wushouer Mairidan, Lin Donghui, and Toru Ishida. 2013. A heuristic framework for pivot-based bilingual dictionary induction. In Proceedings of 3rd International Conference on Culture and Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jun Matsuno and Toru Ishida. 2011. Constraint optimization approach to context based word selection. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence-Volume Volume 3. AAAI Press, 1846--1851. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. I. Dan Melamed. 1997. A word-to-word model of translational equivalence. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 490--497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. I. Dan Melamed. 2000. Models of translational equivalence among words. Computational Linguistics 26, 2 (2000), 221--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Preslav Nakov and Hwee Tou Ng. 2012. Improving statistical machine translation for a resource-poor language using related resource-rich languages. Journal of Artificial Intelligence Research 44, 1 (2012), 179--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Luka Nerima and Eric Wehrli. 2008. Generating bilingual dictionaries by transitivity. In LREC.Google ScholarGoogle Scholar
  33. Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 74--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Malte Nuhn, Arne Mauser, and Hermann Ney. 2012. Deciphering foreign language by combining language models and context vectors. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 156--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Pablo Gamallo Otero and José Ramom Pichel Campos. 2010. Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora. In Computational Linguistics and Intelligent Text Processing. Springer, 473--483. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Reinhard Rapp. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 519--526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sujith Ravi and Kevin Knight. 2008. Attacking decipherment problems optimally with low-order n-gram models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 812--819. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Magnus Sahlgren and Jussi Karlgren. 2005. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Natural Language Engineering 11, 03 (2005), 327--341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wael Salloum and Nizar Habash. 2011. Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the 1st Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties. Association for Computational Linguistics, 10--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Xabier Saralegi, Iker Manterola, and Iñaki San Vicente. 2012. Building a Basque-Chinese dictionary by using english as pivot. In LREC. 1443--1447.Google ScholarGoogle Scholar
  41. Xabier Saralegi, Iker Manterola, and Iñaki San Vicente. 2011. Analyzing methods for improving precision of pivot based bilingual dictionaries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 846--856. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Hassan Sawaf. 2010. Arabic dialect handling in hybrid machine translation. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA’10).Google ScholarGoogle Scholar
  43. Charles Schafer and David Yarowsky. 2002. Inducing translation lexicons via diverse similarity measures and bridge languages. In Proceedings of the 6th Conference on Natural Language Learning-Volume 20. Association for Computational Linguistics, 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Alexander Schrijver. 1998. Theory of Linear and Integer {rogramming. John Wiley & Sons. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Stefan Schulz, Kornél Markó, Eduardo Sbrissia, Percy Nohama, and Udo Hahn. 2004. Cognate mapping: A heuristic strategy for the semi-supervised acquisition of a Spanish lexicon from a Portuguese seed lexicon. In Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, 813. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Daphna Shezaf and Ari Rappoport. 2010. Bilingual lexicon generation using non-aligned signatures. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 98--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Carsten Sinz. 2005. Towards an optimal CNF encoding of boolean cardinality constraints. In Principles and Practice of Constraint Programming (CP’05). Springer, 827--831.Google ScholarGoogle Scholar
  48. Jonas Sjobergh. 2005. Creating a free digital Japanese-Swedish lexicon. In Proceedings of PACLING. Citeseer, 296--300.Google ScholarGoogle Scholar
  49. Hana Skoumalova. 2001. Bridge dictionaries as bridges between languages. International Journal of Corpus Linguistics, 6, Special Issue 95, 105 (2001), 11.Google ScholarGoogle Scholar
  50. Mausam, Stephen Soderland, Oren Etzioni, Daniel S. Weld, Michael Skinner, and Jeff Bilmes. 2009. Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Kumiko Tanaka and Hideya Iwasaki. 1996. Extraction of lexical translations from non-aligned corpora. In Proceedings of the 16th Conference on Computational Linguistics-Volume 2. Association for Computational Linguistics, 580--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Kumiko Tanaka and Kyoji Umemura. 1994. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th Conference on Computational Linguistics - Volume 1 (COLING’94). Association for Computational Linguistics, Stroudsburg, PA, 297--303. DOI:http://dx.doi.org/10.3115/991886.991937 Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Rie Tanaka, Yohei Murakami, and Toru Ishida. 2009. Context-based approach for pivot translation services. In IJCAI. 1555--1561. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Jerzy Tomaszczyk. 1986. The bilingual dictionary under review. In Zurilex’86 Proceedings: Papers Read at the Euralex International Congress, University of Zurich. 289--297.Google ScholarGoogle Scholar
  55. Laurence A. Wolsey. 1998. Integer Programming. Vol. 42. Wiley, New York.Google ScholarGoogle Scholar
  56. Dekai Wu and Xuanyin Xia. 1994. Learning an English-Chinese lexicon from a parallel corpus. In Proceedings of the 1st Conference of the Association for Machine Translation in the Americas. Citeseer, 206--213.Google ScholarGoogle Scholar
  57. Kun Yu and Junichi Tsujii. 2009. Bilingual dictionary extraction from wikipedia. Proceedings of Machine Translation Summit XII (2009), 379--386.Google ScholarGoogle Scholar
  58. Xiaoheng Zhang. 1998. Dialect MT: A case study between Cantonese and Mandarin. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2. Association for Computational Linguistics, 1460--1464. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Constraint Approach to Pivot-Based Bilingual Dictionary Induction

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 1
      January 2016
      89 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/2847552
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 November 2015
      • Accepted: 1 January 2015
      • Revised: 1 October 2014
      • Received: 1 August 2013
      Published in tallip Volume 15, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader