C $$^2$$ LIR: Continual Cross-Lingual Transfer for Low-Resource Information Retrieval

Lee, Jaeseong; Lee, Dohyeon; Kim, Jongho; Hwang, Seung-won

doi:10.1007/978-3-031-28238-6_37

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13981))

Included in the following conference series:

European Conference on Information Retrieval

1495 Accesses
1 Citations

Abstract

This paper proposes a method to train information retrieval (IR) model for a low-resource language with a small corpus and no parallel sentences. Although neural IR models based on pretrained language models (PLMs) have shown high performance in high-resource languages (HRLs), building PLM for LRLs is challenging. We propose C$^2$LIR, a method to build a high-performing neural IR model for LRL, with dictionary-based pretraining objectives for cross-lingual transfer from HRL. Experiments on the monolingual and cross-lingual IR in diverse low-resource scenarios show the effectiveness and data efficiency of C$^2$LIR.

J. Lee and D. Lee—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
XOR-Retrieve train set contains just 2.5k LRL queries, where the average query length is less than 10 words. Mr. Tydi contains LRL documents aligned with LRL queries, which are far unlikely to exist. Thus we discard the train dataset of Mr. Tydi.
2.
Although we can also apply C$^2$LIR on another PLM, such as mBERT, we experiment with English PLM. Comparison can be found in Table 4.
3.
We allow 10 times more English sentences than LRL, based on preliminary experiments to select the upsample ratio of the LRL corpus.
4.
https://github.com/castorini/mr.tydi/tree/4281b6515a.

References

Ansell, A., et al.: MAD-G: Multilingual adapter generation for efficient cross-lingual transfer. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 4762–4781. Association for Computational Linguistics, Punta Cana, Dominican Republic (Nov 2021). https://doi.org/10.18653/v1/2021.findings-emnlp.410
Artetxe, M., Ruder, S., Yogatama, D.: On the cross-lingual transferability of monolingual representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4623–4637. Association for Computational Linguistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.421
Asai, A., Kasai, J., Clark, J., Lee, K., Choi, E., Hajishirzi, H.: XOR QA: cross-lingual open-retrieval question answering. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 547–564. Association for Computational Linguistics, Online (Jun 2021). https://doi.org/10.18653/v1/2021.naacl-main.46
Chau, E.C., Lin, L.H., Smith, N.A.: parsing with multilingual bert, a small corpus, and a small treebank. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1324–1334. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.findings-emnlp.118
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 8440–8451. Association for Computational Linguistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.acl-main.747
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.emnlp-main.550
Kwiatkowski, T., et al.: Natural Questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 452–466 (2019)
Google Scholar
Lample, G., Conneau, A., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. In: International Conference on Learning Representations (Feb 2018)
Google Scholar
Lee, D., Lee, J., Lee, G., Chun, B.g., Hwang, S.w.: SCOPA: Soft code-switching and pairwise alignment for zero-shot cross-lingual transfer. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM 2021, pp. 3176–3180. Association for Computing Machinery, New York (Oct 2021). https://doi.org/10.1145/3459637.3482176
Liu, Z., Winata, G.I., Fung, P.: Continual mixed-language pre-training for extremely low-resource neural machine translation. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2706–2718. Association for Computational Linguistics, Online (Aug 2021). https://doi.org/10.18653/v1/2021.findings-acl.239
Muller, B., Anastasopoulos, A., Sagot, B., Seddah, D.: When being unseen from mbert is just the beginning: handling new languages with multilingual language models. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 448–462. Association for Computational Linguistics, Online (Jun 2021). https://doi.org/10.18653/v1/2021.naacl-main.38
Google Research: BERT (2019). https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/multilingual.md
Ushio, A., Espinosa-Anke, L., Schockaert, S., Camacho-Collados, J.: BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies? In: Proceedings of the ACL-IJCNLP 2021 Main Conference. Association for Computational Linguistics (2021)
Google Scholar
Wang, Z., K, K., Mayhew, S., Roth, D.: Extending multilingual bert to low-resource languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2649–2656. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.findings-emnlp.240
Wu, S., Dredze, M.: Are all languages created equal in multilingual BERT? In: Proceedings of the 5th Workshop on Representation Learning for NLP, pp. 120–130. Association for Computational Linguistics, Online (Jul 2020). https://doi.org/10.18653/v1/2020.repl4nlp-1.16
Wu, Y., et al.: Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs] (Oct 2016)
Zhang, X., Ma, X., Shi, P., Lin, J.: Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval. arXiv:2108.08787 [cs] (Aug 2021)
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 19–27. IEEE, Santiago, Chile (Dec 2015). https://doi.org/10.1109/ICCV.2015.11

Download references

Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [NO.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)]. This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2023-2020-0-01789) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation). We would like to thank Google’s TPU Research Cloud (TRC) program for providing Cloud TPUs.

Author information

Authors and Affiliations

Seoul National University, Seoul, South Korea
Jaeseong Lee, Dohyeon Lee, Jongho Kim & Seung-won Hwang

Authors

Jaeseong Lee
View author publications
You can also search for this author in PubMed Google Scholar
Dohyeon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jongho Kim
View author publications
You can also search for this author in PubMed Google Scholar
Seung-won Hwang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seung-won Hwang .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Jaap Kamps
Université Grenoble-Alpes, Saint-Martin-d’Hères, France
Lorraine Goeuriot
Università della Svizzera Italiana, Lugano, Switzerland
Fabio Crestani
University of Copenhagen, Copenhagen, Denmark
Maria Maistro
University of Tsukuba, Ibaraki, Japan
Hideo Joho
Dublin City University, Dublin, Ireland
Brian Davis
Dublin City University, Dublin, Ireland
Cathal Gurrin
Universität Regensburg, Regensburg, Germany
Udo Kruschwitz
Dublin City University, Dublin, Ireland
Annalina Caputo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, J., Lee, D., Kim, J., Hwang, Sw. (2023). C$^2$LIR: Continual Cross-Lingual Transfer for Low-Resource Information Retrieval. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13981. Springer, Cham. https://doi.org/10.1007/978-3-031-28238-6_37

Download citation

DOI: https://doi.org/10.1007/978-3-031-28238-6_37
Published: 17 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28237-9
Online ISBN: 978-3-031-28238-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

C\(^2\)LIR: Continual Cross-Lingual Transfer for Low-Resource Information Retrieval

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

C\(^2\)LIR: Continual Cross-Lingual Transfer for Low-Resource Information Retrieval

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation