skip to main content
research-article

Exploiting Representations from Statistical Machine Translation for Cross-Language Information Retrieval

Published: 28 October 2014 Publication History

Abstract

This work explores how internal representations of modern statistical machine translation systems can be exploited for cross-language information retrieval. We tackle two core issues that are central to query translation: how to exploit context to generate more accurate translations and how to preserve ambiguity that may be present in the original query, thereby retaining a diverse set of translation alternatives. These two considerations are often in tension since ambiguity in natural language is typically resolved by exploiting context, but effective retrieval requires striking the right balance. We propose two novel query translation approaches: the grammar-based approach extracts translation probabilities from translation grammars, while the decoder-based approach takes advantage of n-best translation hypotheses. Both are context-sensitive, in contrast to a baseline context-insensitive approach that uses bilingual dictionaries for word-by-word translation. Experimental results show that by “opening up” modern statistical machine translation systems, we can access intermediate representations that yield high retrieval effectiveness. By combining evidence from multiple sources, we demonstrate significant improvements over competitive baselines on standard cross-language information retrieval test collections. In addition to effectiveness, the efficiency of our techniques are explored as well.

References

[1]
M. Adriani and C. J. V. Rijsbergen. 2000. Phrase identification in cross-language information retrieval. In Proceedings of RIAO: Content-Based Multimedia Information Access.
[2]
A. T. Arampatzis, T. Tsoris, C. H. A. Koster, and P. V. D. Weide. 1998. Phrase-based information retrieval. Inf. Process. Manag. 34, 6, 693--707.
[3]
L. Ballesteros and B. Croft. 1996. Dictionary methods for cross-lingual information retrieval. In Proceedings of the 7th International DEXA Conference on Database and Expert Systems Applications. 791--801.
[4]
L. Ballesteros and W. B. Croft. 1997. Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th International ACM Conference on Research and Development in Information Retrieval (SIGIR'97). 84--91.
[5]
A. Berger and J. Lafferty. 1999. Information retrieval as statistical translation. In Proceedings of the 22nd Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'99). 222--229.
[6]
P. F. Brown, J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. 1990. A statistical approach to machine translation. Comput. Ling. 16, 2, 79--85.
[7]
P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Ling. 19, 2, 263--311.
[8]
G. Cao, J.-Y. Nie, and J. Bai. 2006. Constructing better document and query models with Markov chains. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06). 800--801.
[9]
A. Chen. 2000. Phrasal translation for English-Chinese cross language information retrieval. In Proceedings of the Workshop on English-Chinese Cross Language Information Retrieval at the International Conference on Chinese Language Computing. 195--202.
[10]
D. Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05). 263--270.
[11]
D. Chiang. 2007. Hierarchical phrase-based translation. Comput. Ling. 33, 2, 201--228.
[12]
K. Darwish and D. W. Oard. 2003. Probabilistic structured query methods. In Proceedings of the 26th Annual International ACM Conference on Research and Development in Informaion Retrieval (SIGIR'03). 338--344.
[13]
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Series B 39, 1, 1--38.
[14]
C. Dyer, J. Weese, H. Setiawan, A. Lopez, F. Ture, V. Eidelman, J. Ganitkevitch, P. Blunsom, and P. Resnik. 2010. cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In Proceedings of the ACL System Demonstrations (ACL'10). 7--12.
[15]
M. Federico and N. Bertoldi. 2002. Statistical cross-language information retrieval using n-best query translations. In Proceedings of the 25th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'02). 167--174.
[16]
A. Fraser, J. Xu, and R. Weischedel. 2002. TREC 2002 cross-lingual retrieval at BBN. In Proceedings of the 11th Text REtrieval Conference (TREC'02).
[17]
G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum. 1988. Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of the 11th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'88). 465--480.
[18]
J. Gao, J.-Y. Nie, G. Wu, and G. Cao. 2004. Dependence language model for information retrieval. In Proceedings of the 27th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'04). 170--177.
[19]
J. Gao, J.-Y. Nie, E. Xun, J. Zhang, M. Zhou, and C. Huang. 2001. Improving query translation for cross-language information retrieval using statistical models. In Proceedings of the 24th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'01). 96--104.
[20]
H. Hayurani, S. Sari, and M. Adriani. 2007. Query and document translation for English-Indonesian cross language IR. In Proceedings of the 7th International Conference on Cross-Language Evaluation Forum (CLEF'06). 57--61.
[21]
D. A. Hull and G. Grefenstette. 1996. Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'96). 49--57.
[22]
K. Kishida and N. Kando. 2006. A hybrid approach to query and document translation using a pivot language for cross-language information retrieval. In Proceedings of the 6th International Conference on Cross-Language Evalution Forum (CLEF'05). 93--101.
[23]
P. Koehn. 2010. Statistical Machine Translation. Cambridge University Press.
[24]
P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the ACL Demo and Poster Sessions (ACL'07). 177--180.
[25]
P. Koehn, F. J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL'03). 48--54.
[26]
W. Kraaij, J.-Y. Nie, and M. Simard. 2003. Embedding web-based statistical translation models in cross-language information retrieval. Comput. Ling. 29, 3, 381--419.
[27]
K. L. Kwok. 1999. English-Chinese cross-language retrieval based on a translation package. In Proceedings of the Workshop on Machine Translation for Cross Language Information Retrieval, Machine Translation Summit VII. 8--13.
[28]
V. Lavrenko, M. Choquette, and W. B. Croft. 2002. Cross-lingual relevance models. In Proceedings of the 25th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'02). 175--182.
[29]
V. Lavrenko and W. B. Croft. 2001. Relevance-based language models. In Proceedings of the 24th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'01). 120--127.
[30]
Z. Li, J. Eisner, and S. Khudanpur. 2009. Variational decoding for statistical machine translation. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 593--601.
[31]
M. Littman, S. T. Dumais, and T. K. Landauer. 1998. Automatic cross-language information retrieval using latent semantic indexing. In Cross-Language Information Retrieval. Kluwer Academic Publishers, 51--62.
[32]
Y. Liu, R. Jin, and J. Y. Chai. 2005. A maximum coherence model for dictionary-based cross-language information retrieval. In Proceedings of the 28th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'05). 536--543.
[33]
A. Lopez. 2007. Hierarchical phrase-based translation with suffix arrays. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 976--985.
[34]
A. Lopez. 2008. Statistical Machine Translation. ACM Comput. Surv. 40, 3, 8:1--8:49.
[35]
Y. Ma, J.-Y. Nie, H. Wu, and H. Wang. 2012. Opening machine translation black box for cross-language information retrieval. In Information Retrieval Technology. Lecture Notes in Computer Science, Vol. 7675, Springer, Berlin, 467--476.
[36]
W. Magdy and G. J. F. Jones. 2011. Should MT systems be used as black boxes in CLIR? In Proceedings of the 33rd European Conference on Information Retrieval (ECIR'11). 683--686.
[37]
D. Marcu and W. Wong. 2002. A phrase-based, joint probability model for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'02). 133--139.
[38]
J. S. McCarley. 1999. Should we translate the documents or the queries in cross-language information retrieval? In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL'99). 208--214.
[39]
J. S. McCarley and S. Roukos. 1998. Fast document translation for cross-language information retrieval. In Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas (AMTA'98). 150--157.
[40]
H. M. Meng, B. Chen, S. Khudanpur, G.-A. Levow, W. K. Lo, D. W. Oard, P. Schone, K. Tang, H.-M. Wang, and J. Wang. 2004. Mandarin-English Information (MEI): Investigating translingual speech retrieval. Comput. Speech Lang. 18, 2, 163--179.
[41]
D. Metzler and W. B. Croft. 2004. Combining the language model and inference network approaches to retrieval. Inf. Process. Manag. 40, 5, 735--750.
[42]
D. Metzler and W. B. Croft. 2005. A Markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'05). 472--479.
[43]
J.-Y. Nie. 2010. Cross-Language Information Retrieval. Morgan & Claypool Publishers.
[44]
V. Nikoulina, B. Kovachev, N. Lagos, and C. Monz. 2012. Adaptation Of statistical machine translation model for cross-language information retrieval in a service context. In Proceedings of 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL'12). 109--119.
[45]
D. W. Oard. 1998. A comparative study of query and document translation for cross-language information retrieval. In Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas (AMTA'98).
[46]
D. W. Oard and P. Hackett. 1997. Document translation for cross-language text retrieval at the University of Maryland. In Proceedings of the 7th Text REtrieval Conference (TREC-7).
[47]
F. J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Ling. 29, 1, 19--51.
[48]
F. J. Och and H. Ney. 2004. The alignment template approach to statistical machine translation. Comput. Ling. 30, 4, 417--449.
[49]
F. J. Och, C. Tillmann, and H. Ney. 1999. Improved alignment models for statistical machine translation. In Proceedings of the 3rd Conference on Empirical Methods for Natural Language Processing (EMNLP'99). 20--28.
[50]
J. Olive, C. Christianson, and J. McCary. 2011. Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation. Springer.
[51]
J. S. Olsson and D. W. Oard. 2009. Combining LVCSR and vocabulary-independent ranked utterance retrieval for robust speech search. In Proceedings of the 32nd Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'09). 91--98.
[52]
A. Pirkola. 1998. The effects of query structure and dictionary-setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'98). 55--63.
[53]
J. M. Ponte and W. B. Croft. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'98). 275--281.
[54]
S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. 1994. Okapi at TREC-3. In Proceedings of the 3rd Text REtrieval Conference (TREC-3). 109--126.
[55]
J. Savoy and S. Abdou. 2007. Experiments with monolingual, bilingual, and robust retrieval. In Proceedings of the 7th International Conference on Cross-Language Evaluation Forum (CLEF'06). 137--144.
[56]
H.-C. Seo, S.-B. Kim, H.-C. Rim, and S.-H. Myaeng. 2005. Improving query translation in English-Korean cross-language information retrieval. Inf. Process. Manag. 41, 3, 507--522.
[57]
M. D. Smucker, J. Allan, and B. Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the 16th ACM conference on Conference on Information and Knowledge Management (CIKM'07). 623--632.
[58]
A. Stolcke. 2002. SRILM -- An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing. 901--904.
[59]
H. Tseng, P.-C. Chang, G. Andrew, D. Jurafsky, and C. Manning. 2005. A conditional random field word segmenter. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing.
[60]
F. Ture. 2013. Searching to translate and translating to search: When information retrieval meets machine translation. Ph.D. Dissertation, University of Maryland.
[61]
F. Ture and J. Lin. 2013. Flat vs. hierarchical phrase-based translation models for cross-language information retrieval. In Proceedings of the 36th International ACM Conference on Research and Development in Information Retrieval (SIGIR'13). 813--816.
[62]
F. Ture, J. Lin, and D. W. Oard. 2012a. Combining statistical translation techniques for cross-language information retrieval. In Proceedings of the 24th International Conference on Computational Linguistics (COLING'12). 2685--2702.
[63]
F. Ture, J. Lin, and D. W. Oard. 2012b. Looking inside the box: Context-sensitive translation for cross-language information retrieval. In Proceedings of the 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR'12). 1105--1106.
[64]
J. Wang and D. W. Oard. 2006. Combining bidirectional translation and synonymy for cross-language information retrieval. In Proceedings of the 29th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'06). 202--209.
[65]
D. Wu and D. He. 2010. Exploring the further integration of machine translation in multilingual information access. In Proceedings of the iConference.
[66]
J. Xu and R. Weischedel. 2005. Empirical studies on the impact of lexical resources on CLIR performance. Inf. Process. Manag. 41, 3, 475--487.
[67]
J. Xu, R. Weischedel, and C. Nguyen. 2001. Evaluating a probabilistic model for cross-lingual information retrieval. In Proceedings of the 24th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR'01). 105--110.
[68]
D. Zhou and V. Wade. 2010. The effectiveness of results re-ranking and query expansion in cross-language information retrieval. In Proceedings of NTCIR-8 Workshop Meeting.

Cited By

View all
  • (2023)Semantic morphological variant selection and translation disambiguation for cross-lingual information retrievalMultimedia Tools and Applications10.1007/s11042-021-11074-w82:6(8197-8212)Online publication date: 1-Mar-2023
  • (2020)A Study of Neural Matching Models for Cross-lingual IRProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401322(1637-1640)Online publication date: 25-Jul-2020
  • (2019)Query-dependent learning to rank for cross-lingual information retrievalKnowledge and Information Systems10.1007/s10115-018-1232-859:3(711-743)Online publication date: 15-May-2019
  • Show More Cited By

Index Terms

  1. Exploiting Representations from Statistical Machine Translation for Cross-Language Information Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 32, Issue 4
    October 2014
    198 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/2684820
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2014
    Accepted: 01 July 2014
    Revised: 01 May 2014
    Received: 01 December 2013
    Published in TOIS Volume 32, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Semantic morphological variant selection and translation disambiguation for cross-lingual information retrievalMultimedia Tools and Applications10.1007/s11042-021-11074-w82:6(8197-8212)Online publication date: 1-Mar-2023
    • (2020)A Study of Neural Matching Models for Cross-lingual IRProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401322(1637-1640)Online publication date: 25-Jul-2020
    • (2019)Query-dependent learning to rank for cross-lingual information retrievalKnowledge and Information Systems10.1007/s10115-018-1232-859:3(711-743)Online publication date: 15-May-2019
    • (2018)The Dilution/Concentration conditions for cross-language information retrieval modelsInformation Processing & Management10.1016/j.ipm.2017.11.00854:2(291-302)Online publication date: Mar-2018
    • (2018)An iterative method for personalized results adaptation in cross-language searchInformation Sciences10.1016/j.ins.2017.11.044430-431(200-215)Online publication date: Mar-2018
    • (2016)Empirical use of information retrieval to build synthetic data for SMT domain adaptationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2016.251731824:4(745-754)Online publication date: 1-Apr-2016
    • (2015)English Language Statistical Machine Translation Oriented Classification AlgorithmProceedings of the 2015 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS)10.1109/ICITBS.2015.99(376-379)Online publication date: 19-Dec-2015

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media