research-article

A Constraint Approach to Pivot-Based Bilingual Dictionary Induction

Authors:
Mairidan Wushouer

Department of Social Informatics, Kyoto University

Department of Social Informatics, Kyoto University
View Profile

,
Donghui Lin

Department of Social Informatics, Kyoto University

Department of Social Informatics, Kyoto University
View Profile

,
Toru Ishida

Department of Social Informatics, Kyoto University

Department of Social Informatics, Kyoto University
View Profile

,
Katsutoshi Hirayama

Graduate School of Maritime Sciences, Kobe University

Graduate School of Maritime Sciences, Kobe University
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 15 Issue 1Article No.: 4pp 1–26https://doi.org/10.1145/2723144

Published:21 November 2015Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

High-quality bilingual dictionaries are very useful, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. Using a third language to link two other languages is a well-known solution and usually requires only two input bilingual dictionaries A-B and B-C to automatically induce the new one, A-C. This approach, however, has never been demonstrated to utilize the complete structures of the input bilingual dictionaries, and this is a key failing because the dropped meanings negatively influence the result. This article proposes a constraint approach to pivot-based dictionary induction where language A and C are closely related. We create constraints from language similarity and model the structures of the input dictionaries as a Boolean optimization problem, which is then formulated within the Weighted Partial Max-SAT framework, an extension of Boolean Satisfiability (SAT). All of the encoded CNF (Conjunctive Normal Form), the predominant input language of modern SAT/MAX-SAT solvers, formulas are evaluated by a solver to produce the target (output) bilingual dictionary. Moreover, we discuss alternative formalizations as a comparison study. We designed a tool that uses the Sat4j library as the default solver to implement our method and conducted an experiment in which the output bilingual dictionary achieved better quality than the baseline method.

References

Kisuh Ahn and Matthew Frampton. 2006. Automatic generation of translation dictionaries using intermediary languages. In Proceedings of the International Workshop on Cross-Language Knowledge Induction. Association for Computational Linguistics, 41--44. Google ScholarDigital Library
Fadi A. Aloul, Arathi Ramani, Igor L. Markov, and Karem A. Sakallah. 2002. Generic ILP versus specialized 0-1 ILP: An update. In Proceedings of the 2002 IEEE/ACM International Conference on Computer-Aided Design. ACM, 450--457. Google ScholarDigital Library
Hitham Abo Bakr, Khaled Shaalan, and Ibrahim Ziedan. 2008. A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. In Proceedings of the he 6th International Conference on Informatics and Systems (INFOS’08). Cairo University.Google Scholar
Shane Bergsma and Benjamin Van Durme. 2011. Learning bilingual lexicons using the visual similarity of labeled web images. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22. 1764. Google ScholarDigital Library
Armin Biere, Marijn J. H. Heule, Hans van Maaren, and Toby Walsh (Eds.). 2009. Handbook of Satisfiability. Frontiers in Artificial Intelligence and Applications, Vol. 185. IOS Press. Google ScholarDigital Library
Francis Bond and Kentaro Ogura. 2008. Combining linguistic resources to create a machine-tractable Japanese-Malay dictionary. Language Resources and Evaluation 42, 2 (2008), 127--136.Google ScholarCross Ref
Francis Bond, Takefumi Yamazaki, Ruhaida Binti Sulong, and Kentaro Okura. 2001. Design and construction of a machine--tractable Japanese-Malay lexicon. In Annual Meeting of the Association for Natural Language Processing, Vol. 7. 1.Google Scholar
Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Computational Linguistics 16, 2 (1990), 79--85. Google ScholarDigital Library
Y. Cheng, Victor Wu, Robert Collins, A. Hanson, and E. Riseman. 1996. Maximum-weight bipartite matching technique and its application in image feature matching. In SPIE Conference on Visual Communication and Image Processing. 1358--1379.Google Scholar
Stephen A. Cook. 1971. The complexity of theorem-proving procedures. In Proceedings of the 3rd Annual ACM Symposium on Theory of Computing. ACM, 151--158. Google ScholarDigital Library
Inderjit Dhillon, Yuqiang Guan, and Brian Kulis. 2005. A fast kernel-based multilevel algorithm for graph clustering. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 629--634. Google ScholarDigital Library
Qing Dou and Kevin Knight. 2012. Large scale decipherment for out-of-domain machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 266--275. Google ScholarDigital Library
Zhaohui Fu and Sharad Malik. 2006. On solving the partial MAX-SAT problem. In Theory and Applications of Satisfiability Testing (SAT’06). Springer, 252--265. Google ScholarDigital Library
Pascale Fung. 1995. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. Proceedings of the 3rd Annual Workshop on Very Large Corpora.Google Scholar
Pascale Fung. 1998. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Machine Translation and the Information Soup. Springer, 1--17. Google ScholarDigital Library
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. Proceedings of ACL-08: HLT (2008), 771--779.Google Scholar
Jirka Hana, Anna Feldman, Chris Brew, and Luiz Amaral. 2006. Tagging Portuguese with a Spanish tagger using cognates. In Proceedings of the International Workshop on Cross-Language Knowledge Induction. Association for Computational Linguistics, 33--40. Google ScholarDigital Library
Ahlem Ben Hassine, Shigeo Matsubara, and Toru Ishida. 2006. A constraint-based approach to horizontal web service composition. In The Semantic Web-ISWC 2006. Springer, 130--143. Google ScholarDigital Library
John Hopcroft and Robert Tarjan. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Communications of the ACM 16, 6 (1973), 372--378. Google ScholarDigital Library
Toru Ishida. 2011. The Language Grid. Springer.Google Scholar
Azniah Ismail and Suresh Manandhar. 2010. Bilingual lexicon extraction from comparable corpora using in-domain terms. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 481--489. Google ScholarDigital Library
Varga István and Yokoyama Shoichi. 2009. Bilingual dictionary generation for low-resourced language pairs. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2. Association for Computational Linguistics, 862--870. Google ScholarDigital Library
Hiroyuki Kaji and Toshiko Aizono. 1996. Extracting word correspondences from bilingual corpora based on word co-occurrences information. In Proceedings of the 16th Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 23--28. Google ScholarDigital Library
Hiroyuki Kaji, Shin’ichi Tamamura, and Dashtseren Erdenebat. 2008. Automatic construction of a Japanese-Chinese dictionary via English. In LREC, Vol. 2008. 699--706.Google Scholar
Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition-Volume 9. Association for Computational Linguistics, 9--16. Google ScholarDigital Library
Ruiming Li, Dian Zhou, and Donglei Du. 2004. Satisfiability and integer programming as complementary tools. In Proceedings of the 2004 Asia and South Pacific Design Automation Conference. IEEE Press, 879--882. Google ScholarDigital Library
Wushouer Mairidan, Lin Donghui, and Toru Ishida. 2013. A heuristic framework for pivot-based bilingual dictionary induction. In Proceedings of 3rd International Conference on Culture and Computing. Google ScholarDigital Library
Jun Matsuno and Toru Ishida. 2011. Constraint optimization approach to context based word selection. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence-Volume Volume 3. AAAI Press, 1846--1851. Google ScholarDigital Library
I. Dan Melamed. 1997. A word-to-word model of translational equivalence. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 490--497. Google ScholarDigital Library
I. Dan Melamed. 2000. Models of translational equivalence among words. Computational Linguistics 26, 2 (2000), 221--249. Google ScholarDigital Library
Preslav Nakov and Hwee Tou Ng. 2012. Improving statistical machine translation for a resource-poor language using related resource-rich languages. Journal of Artificial Intelligence Research 44, 1 (2012), 179--222. Google ScholarDigital Library
Luka Nerima and Eric Wehrli. 2008. Generating bilingual dictionaries by transitivity. In LREC.Google Scholar
Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 74--81. Google ScholarDigital Library
Malte Nuhn, Arne Mauser, and Hermann Ney. 2012. Deciphering foreign language by combining language models and context vectors. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 156--164. Google ScholarDigital Library
Pablo Gamallo Otero and José Ramom Pichel Campos. 2010. Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora. In Computational Linguistics and Intelligent Text Processing. Springer, 473--483. Google ScholarDigital Library
Reinhard Rapp. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 519--526. Google ScholarDigital Library
Sujith Ravi and Kevin Knight. 2008. Attacking decipherment problems optimally with low-order n-gram models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 812--819. Google ScholarDigital Library
Magnus Sahlgren and Jussi Karlgren. 2005. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Natural Language Engineering 11, 03 (2005), 327--341. Google ScholarDigital Library
Wael Salloum and Nizar Habash. 2011. Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the 1st Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties. Association for Computational Linguistics, 10--21. Google ScholarDigital Library
Xabier Saralegi, Iker Manterola, and Iñaki San Vicente. 2012. Building a Basque-Chinese dictionary by using english as pivot. In LREC. 1443--1447.Google Scholar
Xabier Saralegi, Iker Manterola, and Iñaki San Vicente. 2011. Analyzing methods for improving precision of pivot based bilingual dictionaries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 846--856. Google ScholarDigital Library
Hassan Sawaf. 2010. Arabic dialect handling in hybrid machine translation. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA’10).Google Scholar
Charles Schafer and David Yarowsky. 2002. Inducing translation lexicons via diverse similarity measures and bridge languages. In Proceedings of the 6th Conference on Natural Language Learning-Volume 20. Association for Computational Linguistics, 1--7. Google ScholarDigital Library
Alexander Schrijver. 1998. Theory of Linear and Integer {rogramming. John Wiley & Sons. Google ScholarDigital Library
Stefan Schulz, Kornél Markó, Eduardo Sbrissia, Percy Nohama, and Udo Hahn. 2004. Cognate mapping: A heuristic strategy for the semi-supervised acquisition of a Spanish lexicon from a Portuguese seed lexicon. In Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, 813. Google ScholarDigital Library
Daphna Shezaf and Ari Rappoport. 2010. Bilingual lexicon generation using non-aligned signatures. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 98--107. Google ScholarDigital Library
Carsten Sinz. 2005. Towards an optimal CNF encoding of boolean cardinality constraints. In Principles and Practice of Constraint Programming (CP’05). Springer, 827--831.Google Scholar
Jonas Sjobergh. 2005. Creating a free digital Japanese-Swedish lexicon. In Proceedings of PACLING. Citeseer, 296--300.Google Scholar
Hana Skoumalova. 2001. Bridge dictionaries as bridges between languages. International Journal of Corpus Linguistics, 6, Special Issue 95, 105 (2001), 11.Google Scholar
Mausam, Stephen Soderland, Oren Etzioni, Daniel S. Weld, Michael Skinner, and Jeff Bilmes. 2009. Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1. Association for Computational Linguistics. Google ScholarDigital Library
Kumiko Tanaka and Hideya Iwasaki. 1996. Extraction of lexical translations from non-aligned corpora. In Proceedings of the 16th Conference on Computational Linguistics-Volume 2. Association for Computational Linguistics, 580--585. Google ScholarDigital Library
Kumiko Tanaka and Kyoji Umemura. 1994. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th Conference on Computational Linguistics - Volume 1 (COLING’94). Association for Computational Linguistics, Stroudsburg, PA, 297--303. DOI:http://dx.doi.org/10.3115/991886.991937 Google ScholarDigital Library
Rie Tanaka, Yohei Murakami, and Toru Ishida. 2009. Context-based approach for pivot translation services. In IJCAI. 1555--1561. Google ScholarDigital Library
Jerzy Tomaszczyk. 1986. The bilingual dictionary under review. In Zurilex’86 Proceedings: Papers Read at the Euralex International Congress, University of Zurich. 289--297.Google Scholar
Laurence A. Wolsey. 1998. Integer Programming. Vol. 42. Wiley, New York.Google Scholar
Dekai Wu and Xuanyin Xia. 1994. Learning an English-Chinese lexicon from a parallel corpus. In Proceedings of the 1st Conference of the Association for Machine Translation in the Americas. Citeseer, 206--213.Google Scholar
Kun Yu and Junichi Tsujii. 2009. Bilingual dictionary extraction from wikipedia. Proceedings of Machine Translation Summit XII (2009), 379--386.Google Scholar
Xiaoheng Zhang. 1998. Dialect MT: A case study between Cantonese and Mandarin. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2. Association for Computational Linguistics, 1460--1464. Google ScholarDigital Library

Index Terms

A Constraint Approach to Pivot-Based Bilingual Dictionary Induction
1. Computing methodologies
  1. Machine learning

Recommendations

Plan Optimization to Bilingual Dictionary Induction for Low-resource Language Families
Creating bilingual dictionary is the first crucial step in enriching low-resource languages. Especially for the closely related ones, it has been shown that the constraint-based approach is useful for inducing bilingual lexicons from two bilingual ...
Read More
A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such ...
Read More
A Heuristic Framework for Pivot-Based Bilingual Dictionary Induction
CULTURECOMPUTING '13: Proceedings of the 2013 International Conference on Culture and Computing

High quality machine readable dictionaries are very useful, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. In this paper, we proposed a heuristic framework that aims at inducing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 15, Issue 1
January 2016
89 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/2847552
Editor:
Richard Sproat
Google, Inc., USA
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 November 2015
- Accepted: 1 January 2015
- Revised: 1 October 2014
- Received: 1 August 2013
Published in tallip Volume 15, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bilingual dictionary induction
Weighted Partial Max-SAT
constraint satisfaction problem
low-resource languages
pivot language
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 181
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Constraint Approach to Pivot-Based Bilingual Dictionary Induction

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Plan Optimization to Bilingual Dictionary Induction for Low-resource Language Families

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

A Heuristic Framework for Pivot-Based Bilingual Dictionary Induction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Constraint Approach to Pivot-Based Bilingual Dictionary Induction

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Plan Optimization to Bilingual Dictionary Induction for Low-resource Language Families

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

A Heuristic Framework for Pivot-Based Bilingual Dictionary Induction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media