skip to main content
10.1145/3539618.3592006acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

LAPCA: Language-Agnostic Pretraining with Cross-Lingual Alignment

Published:18 July 2023Publication History

ABSTRACT

Data collection and mining is a crucial bottleneck for cross-lingual information retrieval (CLIR). While previous works used machine translation and iterative training, we present a novel approach to cross-lingual pretraining called LAPCA (language-agnostic pretraining with cross-lingual alignment). We train the LAPCA-LM model based on XLM-RoBERTa and łexa that significantly improves cross-lingual knowledge transfer for question answering and sentence retrieval on, e.g., XOR-TyDi and Mr. TyDi datasets, and in the zero-shot cross-lingual scenario performs on par with supervised methods, outperforming many of them on MKQA.

References

  1. Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the Cross-lingual Transferability of Monolingual Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4623--4637.Google ScholarGoogle ScholarCross RefCross Ref
  2. Akari Asai, Jungo Kasai, Jonathan H Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2021a. XOR QA: Cross-lingual open-retrieval question answering. Proceedings of NAACL-HLT'2021 (2021).Google ScholarGoogle ScholarCross RefCross Ref
  3. Akari Asai, Xinyan Yu, Jungo Kasai, and Hannaneh Hajishirzi. 2021b. One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval. Proceedings of NeurIPS 2021 (2021).Google ScholarGoogle Scholar
  4. Luiz Henrique Bonifacio, Israel Campiotti, Roberto de Alencar Lotufo, and Rodrigo Nogueira. 2021. mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset. CoRR, Vol. abs/2108.13897 (2021). showeprint[arXiv]2108.13897 https://arxiv.org/abs/2108.13897Google ScholarGoogle Scholar
  5. Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A Benchmark for Information-Seeking Question Answering in Ty pologically Di verse Languages. Transactions of the Association for Computational Linguistics, Vol. 8 (2020), 454--470.Google ScholarGoogle ScholarCross RefCross Ref
  6. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8440--8451.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://doi.org/10.48550/ARXIV.1810.04805Google ScholarGoogle ScholarCross RefCross Ref
  8. Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2020. Beyond English-Centric Multilingual Machine Translation. (2020). https://doi.org/10.48550/ARXIV.2010.11125Google ScholarGoogle ScholarCross RefCross Ref
  9. Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2022. Beyond English-Centric Multilingual Machine Translation. J. Mach. Learn. Res., Vol. 22, 1, Article 107 (jul 2022), 48 pages.Google ScholarGoogle Scholar
  10. Luyu Gao and Jamie Callan. 2021. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. (2021). https://arxiv.org/abs/2108.05540Google ScholarGoogle Scholar
  11. Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessandro Presta, Jason Baldridge, Eugene Ie, and Diego García-Olano. 2019. Learning Dense Representations for Entity Retrieval. CoRR, Vol. abs/1909.10506 (2019). [arXiv]1909.10506 http://arxiv.org/abs/1909.10506Google ScholarGoogle Scholar
  12. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. arXiv. https://doi.org/10.48550/ARXIV.2003.11080Google ScholarGoogle ScholarCross RefCross Ref
  13. Gautier Izacard and Edouard Grave. 2020. Distilling Knowledge from Reader to Retriever for Question Answering. (2020). arxiv: 2012.04584 [cs.CL]Google ScholarGoogle Scholar
  14. Gautier Izacard and Édouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 874--880.Google ScholarGoogle Scholar
  15. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1601--1611.Google ScholarGoogle ScholarCross RefCross Ref
  16. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769--6781.Google ScholarGoogle ScholarCross RefCross Ref
  17. Phillip Keung, Julian Salazar, Yichao Lu, and Noah A. Smith. 2020. Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings. https://doi.org/10.48550/ARXIV.2010.07761Google ScholarGoogle ScholarCross RefCross Ref
  18. Jamie Kiros. 2020. Contextual Lensing of Universal Sentence Representations. https://doi.org/10.48550/ARXIV.2002.08866Google ScholarGoogle ScholarCross RefCross Ref
  19. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, Vol. 7 (2019), 453--466.Google ScholarGoogle ScholarCross RefCross Ref
  20. Chia-Hsuan Lee and Hung-Yi Lee. 2019. Cross-lingual transfer learning for question answering. arXiv preprint arXiv:1907.06042 (2019).Google ScholarGoogle Scholar
  21. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021a. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. (2021). arxiv: 2005.11401 [cs.CL]Google ScholarGoogle Scholar
  22. Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. 2021b. PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them. (2021). arxiv: 2102.07033 [cs.CL]Google ScholarGoogle Scholar
  23. Yulong Li, Martin Franz, Md Arafat Sultan, Bhavani Iyer, Young-Suk Lee, and Avirup Sil. 2021. Learning Cross-Lingual IR from an English Retriever. (2021). https://doi.org/10.48550/ARXIV.2112.08185Google ScholarGoogle ScholarCross RefCross Ref
  24. Shayne Longpre, Yi Lu, and Joachim Daiber. 2020. MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering. TACL (2020).Google ScholarGoogle Scholar
  25. Meryem M'hamdi, Doo Soon Kim, Franck Dernoncourt, Trung Bui, Xiang Ren, and Jonathan May. 2021. X-METRA-ADA: Cross-lingual Meta-Transfer learning Adaptation to Natural Language Understanding and Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 3617--3632.Google ScholarGoogle Scholar
  26. Barlas Oğuz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Wen-tau Yih, Sonal Gupta, and Yashar Mehdad. 2021. Domain-matched Pre-training Tasks for Dense Retrieval. (2021). https://doi.org/10.48550/ARXIV.2107.13602Google ScholarGoogle ScholarCross RefCross Ref
  27. Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. 2021. Contrastive Learning for Many-to-many Multilingual Neural Machine Translation. ArXiv, Vol. abs/2105.09501 (2021).Google ScholarGoogle Scholar
  28. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. ArXiv, Vol. abs/2010.08191 (2021).Google ScholarGoogle Scholar
  30. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000 Questions for Machine Comprehension of Text. arxiv: 1606.05250 [cs.CL]Google ScholarGoogle Scholar
  31. Nikita Sorokin, Dmitry Abulkhanov, Irina Piontkovskaya, and Valentin Malykh. 2022. Ask Me Anything in Your Native Language. Proceedings of NAACL 2022 (2022).Google ScholarGoogle ScholarCross RefCross Ref
  32. Chih-chan Tien and Shane Steinert-Threlkeld. 2021. Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining. (2021). https://doi.org/10.48550/ARXIV.2104.07642Google ScholarGoogle ScholarCross RefCross Ref
  33. Chih-chan Tien and Shane Steinert-Threlkeld. 2022. Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8696--8706. https://doi.org/10.18653/v1/2022.acl-long.595Google ScholarGoogle ScholarCross RefCross Ref
  34. Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Adversarial Training for Unsupervised Bilingual Lexicon Induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1959--1970. https://doi.org/10.18653/v1/P17-1179Google ScholarGoogle ScholarCross RefCross Ref
  35. Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021. Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval. (2021). https://doi.org/10.48550/ARXIV.2108.08787Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. LAPCA: Language-Agnostic Pretraining with Cross-Lingual Alignment

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2023
      3567 pages
      ISBN:9781450394086
      DOI:10.1145/3539618

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 July 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%
    • Article Metrics

      • Downloads (Last 12 months)208
      • Downloads (Last 6 weeks)11

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader