ABSTRACT
Data collection and mining is a crucial bottleneck for cross-lingual information retrieval (CLIR). While previous works used machine translation and iterative training, we present a novel approach to cross-lingual pretraining called LAPCA (language-agnostic pretraining with cross-lingual alignment). We train the LAPCA-LM model based on XLM-RoBERTa and łexa that significantly improves cross-lingual knowledge transfer for question answering and sentence retrieval on, e.g., XOR-TyDi and Mr. TyDi datasets, and in the zero-shot cross-lingual scenario performs on par with supervised methods, outperforming many of them on MKQA.
- Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the Cross-lingual Transferability of Monolingual Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4623--4637.Google ScholarCross Ref
- Akari Asai, Jungo Kasai, Jonathan H Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2021a. XOR QA: Cross-lingual open-retrieval question answering. Proceedings of NAACL-HLT'2021 (2021).Google ScholarCross Ref
- Akari Asai, Xinyan Yu, Jungo Kasai, and Hannaneh Hajishirzi. 2021b. One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval. Proceedings of NeurIPS 2021 (2021).Google Scholar
- Luiz Henrique Bonifacio, Israel Campiotti, Roberto de Alencar Lotufo, and Rodrigo Nogueira. 2021. mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset. CoRR, Vol. abs/2108.13897 (2021). showeprint[arXiv]2108.13897 https://arxiv.org/abs/2108.13897Google Scholar
- Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A Benchmark for Information-Seeking Question Answering in Ty pologically Di verse Languages. Transactions of the Association for Computational Linguistics, Vol. 8 (2020), 454--470.Google ScholarCross Ref
- Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8440--8451.Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://doi.org/10.48550/ARXIV.1810.04805Google ScholarCross Ref
- Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2020. Beyond English-Centric Multilingual Machine Translation. (2020). https://doi.org/10.48550/ARXIV.2010.11125Google ScholarCross Ref
- Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2022. Beyond English-Centric Multilingual Machine Translation. J. Mach. Learn. Res., Vol. 22, 1, Article 107 (jul 2022), 48 pages.Google Scholar
- Luyu Gao and Jamie Callan. 2021. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. (2021). https://arxiv.org/abs/2108.05540Google Scholar
- Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessandro Presta, Jason Baldridge, Eugene Ie, and Diego García-Olano. 2019. Learning Dense Representations for Entity Retrieval. CoRR, Vol. abs/1909.10506 (2019). [arXiv]1909.10506 http://arxiv.org/abs/1909.10506Google Scholar
- Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. arXiv. https://doi.org/10.48550/ARXIV.2003.11080Google ScholarCross Ref
- Gautier Izacard and Edouard Grave. 2020. Distilling Knowledge from Reader to Retriever for Question Answering. (2020). arxiv: 2012.04584 [cs.CL]Google Scholar
- Gautier Izacard and Édouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 874--880.Google Scholar
- Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1601--1611.Google ScholarCross Ref
- Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769--6781.Google ScholarCross Ref
- Phillip Keung, Julian Salazar, Yichao Lu, and Noah A. Smith. 2020. Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings. https://doi.org/10.48550/ARXIV.2010.07761Google ScholarCross Ref
- Jamie Kiros. 2020. Contextual Lensing of Universal Sentence Representations. https://doi.org/10.48550/ARXIV.2002.08866Google ScholarCross Ref
- Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, Vol. 7 (2019), 453--466.Google ScholarCross Ref
- Chia-Hsuan Lee and Hung-Yi Lee. 2019. Cross-lingual transfer learning for question answering. arXiv preprint arXiv:1907.06042 (2019).Google Scholar
- Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021a. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. (2021). arxiv: 2005.11401 [cs.CL]Google Scholar
- Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. 2021b. PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them. (2021). arxiv: 2102.07033 [cs.CL]Google Scholar
- Yulong Li, Martin Franz, Md Arafat Sultan, Bhavani Iyer, Young-Suk Lee, and Avirup Sil. 2021. Learning Cross-Lingual IR from an English Retriever. (2021). https://doi.org/10.48550/ARXIV.2112.08185Google ScholarCross Ref
- Shayne Longpre, Yi Lu, and Joachim Daiber. 2020. MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering. TACL (2020).Google Scholar
- Meryem M'hamdi, Doo Soon Kim, Franck Dernoncourt, Trung Bui, Xiang Ren, and Jonathan May. 2021. X-METRA-ADA: Cross-lingual Meta-Transfer learning Adaptation to Natural Language Understanding and Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 3617--3632.Google Scholar
- Barlas Oğuz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Wen-tau Yih, Sonal Gupta, and Yashar Mehdad. 2021. Domain-matched Pre-training Tasks for Dense Retrieval. (2021). https://doi.org/10.48550/ARXIV.2107.13602Google ScholarCross Ref
- Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. 2021. Contrastive Learning for Many-to-many Multilingual Neural Machine Translation. ArXiv, Vol. abs/2105.09501 (2021).Google Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.Google ScholarDigital Library
- Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. ArXiv, Vol. abs/2010.08191 (2021).Google Scholar
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000 Questions for Machine Comprehension of Text. arxiv: 1606.05250 [cs.CL]Google Scholar
- Nikita Sorokin, Dmitry Abulkhanov, Irina Piontkovskaya, and Valentin Malykh. 2022. Ask Me Anything in Your Native Language. Proceedings of NAACL 2022 (2022).Google ScholarCross Ref
- Chih-chan Tien and Shane Steinert-Threlkeld. 2021. Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining. (2021). https://doi.org/10.48550/ARXIV.2104.07642Google ScholarCross Ref
- Chih-chan Tien and Shane Steinert-Threlkeld. 2022. Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8696--8706. https://doi.org/10.18653/v1/2022.acl-long.595Google ScholarCross Ref
- Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Adversarial Training for Unsupervised Bilingual Lexicon Induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1959--1970. https://doi.org/10.18653/v1/P17-1179Google ScholarCross Ref
- Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021. Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval. (2021). https://doi.org/10.48550/ARXIV.2108.08787Google ScholarCross Ref
Index Terms
- LAPCA: Language-Agnostic Pretraining with Cross-Lingual Alignment
Recommendations
Cross-lingual Language Model Pretraining for Retrieval
WWW '21: Proceedings of the Web Conference 2021Existing research on cross-lingual retrieval cannot take good advantage of large-scale pretrained language models such as multilingual BERT and XLM. We hypothesize that the absence of cross-lingual passage-level relevance data for finetuning and the ...
On cross-lingual retrieval with multilingual text encoders
AbstractPretrained multilingual text encoders based on neural transformer architectures, such as multilingual BERT (mBERT) and XLM, have recently become a default paradigm for cross-lingual transfer of natural language processing models, rendering cross-...
Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval
Advances in Information RetrievalAbstractPretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-...
Comments