Abstract
High-quality bilingual dictionaries are very useful, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. Using a third language to link two other languages is a well-known solution and usually requires only two input bilingual dictionaries A-B and B-C to automatically induce the new one, A-C. This approach, however, has never been demonstrated to utilize the complete structures of the input bilingual dictionaries, and this is a key failing because the dropped meanings negatively influence the result. This article proposes a constraint approach to pivot-based dictionary induction where language A and C are closely related. We create constraints from language similarity and model the structures of the input dictionaries as a Boolean optimization problem, which is then formulated within the Weighted Partial Max-SAT framework, an extension of Boolean Satisfiability (SAT). All of the encoded CNF (Conjunctive Normal Form), the predominant input language of modern SAT/MAX-SAT solvers, formulas are evaluated by a solver to produce the target (output) bilingual dictionary. Moreover, we discuss alternative formalizations as a comparison study. We designed a tool that uses the Sat4j library as the default solver to implement our method and conducted an experiment in which the output bilingual dictionary achieved better quality than the baseline method.
- Kisuh Ahn and Matthew Frampton. 2006. Automatic generation of translation dictionaries using intermediary languages. In Proceedings of the International Workshop on Cross-Language Knowledge Induction. Association for Computational Linguistics, 41--44. Google ScholarDigital Library
- Fadi A. Aloul, Arathi Ramani, Igor L. Markov, and Karem A. Sakallah. 2002. Generic ILP versus specialized 0-1 ILP: An update. In Proceedings of the 2002 IEEE/ACM International Conference on Computer-Aided Design. ACM, 450--457. Google ScholarDigital Library
- Hitham Abo Bakr, Khaled Shaalan, and Ibrahim Ziedan. 2008. A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. In Proceedings of the he 6th International Conference on Informatics and Systems (INFOS’08). Cairo University.Google Scholar
- Shane Bergsma and Benjamin Van Durme. 2011. Learning bilingual lexicons using the visual similarity of labeled web images. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22. 1764. Google ScholarDigital Library
- Armin Biere, Marijn J. H. Heule, Hans van Maaren, and Toby Walsh (Eds.). 2009. Handbook of Satisfiability. Frontiers in Artificial Intelligence and Applications, Vol. 185. IOS Press. Google ScholarDigital Library
- Francis Bond and Kentaro Ogura. 2008. Combining linguistic resources to create a machine-tractable Japanese-Malay dictionary. Language Resources and Evaluation 42, 2 (2008), 127--136.Google ScholarCross Ref
- Francis Bond, Takefumi Yamazaki, Ruhaida Binti Sulong, and Kentaro Okura. 2001. Design and construction of a machine--tractable Japanese-Malay lexicon. In Annual Meeting of the Association for Natural Language Processing, Vol. 7. 1.Google Scholar
- Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Computational Linguistics 16, 2 (1990), 79--85. Google ScholarDigital Library
- Y. Cheng, Victor Wu, Robert Collins, A. Hanson, and E. Riseman. 1996. Maximum-weight bipartite matching technique and its application in image feature matching. In SPIE Conference on Visual Communication and Image Processing. 1358--1379.Google Scholar
- Stephen A. Cook. 1971. The complexity of theorem-proving procedures. In Proceedings of the 3rd Annual ACM Symposium on Theory of Computing. ACM, 151--158. Google ScholarDigital Library
- Inderjit Dhillon, Yuqiang Guan, and Brian Kulis. 2005. A fast kernel-based multilevel algorithm for graph clustering. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 629--634. Google ScholarDigital Library
- Qing Dou and Kevin Knight. 2012. Large scale decipherment for out-of-domain machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 266--275. Google ScholarDigital Library
- Zhaohui Fu and Sharad Malik. 2006. On solving the partial MAX-SAT problem. In Theory and Applications of Satisfiability Testing (SAT’06). Springer, 252--265. Google ScholarDigital Library
- Pascale Fung. 1995. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. Proceedings of the 3rd Annual Workshop on Very Large Corpora.Google Scholar
- Pascale Fung. 1998. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Machine Translation and the Information Soup. Springer, 1--17. Google ScholarDigital Library
- Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. Proceedings of ACL-08: HLT (2008), 771--779.Google Scholar
- Jirka Hana, Anna Feldman, Chris Brew, and Luiz Amaral. 2006. Tagging Portuguese with a Spanish tagger using cognates. In Proceedings of the International Workshop on Cross-Language Knowledge Induction. Association for Computational Linguistics, 33--40. Google ScholarDigital Library
- Ahlem Ben Hassine, Shigeo Matsubara, and Toru Ishida. 2006. A constraint-based approach to horizontal web service composition. In The Semantic Web-ISWC 2006. Springer, 130--143. Google ScholarDigital Library
- John Hopcroft and Robert Tarjan. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Communications of the ACM 16, 6 (1973), 372--378. Google ScholarDigital Library
- Toru Ishida. 2011. The Language Grid. Springer.Google Scholar
- Azniah Ismail and Suresh Manandhar. 2010. Bilingual lexicon extraction from comparable corpora using in-domain terms. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 481--489. Google ScholarDigital Library
- Varga István and Yokoyama Shoichi. 2009. Bilingual dictionary generation for low-resourced language pairs. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2. Association for Computational Linguistics, 862--870. Google ScholarDigital Library
- Hiroyuki Kaji and Toshiko Aizono. 1996. Extracting word correspondences from bilingual corpora based on word co-occurrences information. In Proceedings of the 16th Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 23--28. Google ScholarDigital Library
- Hiroyuki Kaji, Shin’ichi Tamamura, and Dashtseren Erdenebat. 2008. Automatic construction of a Japanese-Chinese dictionary via English. In LREC, Vol. 2008. 699--706.Google Scholar
- Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition-Volume 9. Association for Computational Linguistics, 9--16. Google ScholarDigital Library
- Ruiming Li, Dian Zhou, and Donglei Du. 2004. Satisfiability and integer programming as complementary tools. In Proceedings of the 2004 Asia and South Pacific Design Automation Conference. IEEE Press, 879--882. Google ScholarDigital Library
- Wushouer Mairidan, Lin Donghui, and Toru Ishida. 2013. A heuristic framework for pivot-based bilingual dictionary induction. In Proceedings of 3rd International Conference on Culture and Computing. Google ScholarDigital Library
- Jun Matsuno and Toru Ishida. 2011. Constraint optimization approach to context based word selection. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence-Volume Volume 3. AAAI Press, 1846--1851. Google ScholarDigital Library
- I. Dan Melamed. 1997. A word-to-word model of translational equivalence. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 490--497. Google ScholarDigital Library
- I. Dan Melamed. 2000. Models of translational equivalence among words. Computational Linguistics 26, 2 (2000), 221--249. Google ScholarDigital Library
- Preslav Nakov and Hwee Tou Ng. 2012. Improving statistical machine translation for a resource-poor language using related resource-rich languages. Journal of Artificial Intelligence Research 44, 1 (2012), 179--222. Google ScholarDigital Library
- Luka Nerima and Eric Wehrli. 2008. Generating bilingual dictionaries by transitivity. In LREC.Google Scholar
- Jian-Yun Nie, Michel Simard, Pierre Isabelle, and Richard Durand. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 74--81. Google ScholarDigital Library
- Malte Nuhn, Arne Mauser, and Hermann Ney. 2012. Deciphering foreign language by combining language models and context vectors. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 156--164. Google ScholarDigital Library
- Pablo Gamallo Otero and José Ramom Pichel Campos. 2010. Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora. In Computational Linguistics and Intelligent Text Processing. Springer, 473--483. Google ScholarDigital Library
- Reinhard Rapp. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 519--526. Google ScholarDigital Library
- Sujith Ravi and Kevin Knight. 2008. Attacking decipherment problems optimally with low-order n-gram models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 812--819. Google ScholarDigital Library
- Magnus Sahlgren and Jussi Karlgren. 2005. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Natural Language Engineering 11, 03 (2005), 327--341. Google ScholarDigital Library
- Wael Salloum and Nizar Habash. 2011. Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the 1st Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties. Association for Computational Linguistics, 10--21. Google ScholarDigital Library
- Xabier Saralegi, Iker Manterola, and Iñaki San Vicente. 2012. Building a Basque-Chinese dictionary by using english as pivot. In LREC. 1443--1447.Google Scholar
- Xabier Saralegi, Iker Manterola, and Iñaki San Vicente. 2011. Analyzing methods for improving precision of pivot based bilingual dictionaries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 846--856. Google ScholarDigital Library
- Hassan Sawaf. 2010. Arabic dialect handling in hybrid machine translation. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA’10).Google Scholar
- Charles Schafer and David Yarowsky. 2002. Inducing translation lexicons via diverse similarity measures and bridge languages. In Proceedings of the 6th Conference on Natural Language Learning-Volume 20. Association for Computational Linguistics, 1--7. Google ScholarDigital Library
- Alexander Schrijver. 1998. Theory of Linear and Integer {rogramming. John Wiley & Sons. Google ScholarDigital Library
- Stefan Schulz, Kornél Markó, Eduardo Sbrissia, Percy Nohama, and Udo Hahn. 2004. Cognate mapping: A heuristic strategy for the semi-supervised acquisition of a Spanish lexicon from a Portuguese seed lexicon. In Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, 813. Google ScholarDigital Library
- Daphna Shezaf and Ari Rappoport. 2010. Bilingual lexicon generation using non-aligned signatures. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 98--107. Google ScholarDigital Library
- Carsten Sinz. 2005. Towards an optimal CNF encoding of boolean cardinality constraints. In Principles and Practice of Constraint Programming (CP’05). Springer, 827--831.Google Scholar
- Jonas Sjobergh. 2005. Creating a free digital Japanese-Swedish lexicon. In Proceedings of PACLING. Citeseer, 296--300.Google Scholar
- Hana Skoumalova. 2001. Bridge dictionaries as bridges between languages. International Journal of Corpus Linguistics, 6, Special Issue 95, 105 (2001), 11.Google Scholar
- Mausam, Stephen Soderland, Oren Etzioni, Daniel S. Weld, Michael Skinner, and Jeff Bilmes. 2009. Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1. Association for Computational Linguistics. Google ScholarDigital Library
- Kumiko Tanaka and Hideya Iwasaki. 1996. Extraction of lexical translations from non-aligned corpora. In Proceedings of the 16th Conference on Computational Linguistics-Volume 2. Association for Computational Linguistics, 580--585. Google ScholarDigital Library
- Kumiko Tanaka and Kyoji Umemura. 1994. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th Conference on Computational Linguistics - Volume 1 (COLING’94). Association for Computational Linguistics, Stroudsburg, PA, 297--303. DOI:http://dx.doi.org/10.3115/991886.991937 Google ScholarDigital Library
- Rie Tanaka, Yohei Murakami, and Toru Ishida. 2009. Context-based approach for pivot translation services. In IJCAI. 1555--1561. Google ScholarDigital Library
- Jerzy Tomaszczyk. 1986. The bilingual dictionary under review. In Zurilex’86 Proceedings: Papers Read at the Euralex International Congress, University of Zurich. 289--297.Google Scholar
- Laurence A. Wolsey. 1998. Integer Programming. Vol. 42. Wiley, New York.Google Scholar
- Dekai Wu and Xuanyin Xia. 1994. Learning an English-Chinese lexicon from a parallel corpus. In Proceedings of the 1st Conference of the Association for Machine Translation in the Americas. Citeseer, 206--213.Google Scholar
- Kun Yu and Junichi Tsujii. 2009. Bilingual dictionary extraction from wikipedia. Proceedings of Machine Translation Summit XII (2009), 379--386.Google Scholar
- Xiaoheng Zhang. 1998. Dialect MT: A case study between Cantonese and Mandarin. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 2. Association for Computational Linguistics, 1460--1464. Google ScholarDigital Library
Index Terms
- A Constraint Approach to Pivot-Based Bilingual Dictionary Induction
Recommendations
Plan Optimization to Bilingual Dictionary Induction for Low-resource Language Families
Creating bilingual dictionary is the first crucial step in enriching low-resource languages. Especially for the closely related ones, it has been shown that the constraint-based approach is useful for inducing bilingual lexicons from two bilingual ...
A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such ...
A Heuristic Framework for Pivot-Based Bilingual Dictionary Induction
CULTURECOMPUTING '13: Proceedings of the 2013 International Conference on Culture and ComputingHigh quality machine readable dictionaries are very useful, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. In this paper, we proposed a heuristic framework that aims at inducing ...
Comments