Abstract
Fine-tuning pre-trained cross-lingual language models alleviates the need for annotated data in different languages, as it allows the models to transfer task-specific supervision between languages, especially from high- to low-resource languages. In this work, we propose to improve cross-lingual language understanding with consistency regularization-based fine-tuning. Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations, i.e., subword sampling, Gaussian noise, code-switch substitution, and machine translation. In addition, we employ model consistency to regularize the models trained with two augmented versions of the same training set. Experimental results on the XTREME benchmark show that our method (the code is available at https://github.com/bozheng-hit/xTune) achieves significant improvements across various cross-lingual language understanding tasks, including text classification, question answering, and sequence labeling. Furthermore, we extend our method to the few-shot cross-lingual transfer setting, particularly considering a more realistic setting where machine translation systems are available. Meanwhile, machine translation as data augmentation can be well combined with our consistency regularization method. Experimental results demonstrate that our method also benefits the few-shot scenario.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability statement
All datasets in the XTREME benchmark are available at http://github.com/google-research/xtreme. The bilingual dictionaries used in code-switch substitution data augmentation are available at http://github.com/facebookresearch/MUSE. The machine translation data augmentation is available at repository https://github.com/bozheng-hit/xTune.
Notes
We define conventional cross-lingual fine-tuning as fine-tuning the pre-trained cross-lingual model with the labeled training set in the source language only (typically English) or with labeled training sets in all languages.
Implemented by .detach() in PyTorch.
X-STILTs [39] uses additional SQuAD v1.1 English training data for the TyDiQA-GoldP dataset, while we prefer a cleaner setting here.
FILTER directly selects the best model on the test set of XQuAD and TyDiQA-GoldP. Under this setting, we can obtain 83.1/69.7 for XQuAD, 75.5/61.1 for TyDiQA-GoldP.
For span extraction datasets, to align the labels, the answers are enclosed in quotes before translating, which makes it easy to extract answers from translated context [30]. This method can also be applied to NER tasks. However, aligning label information requires complex post-processing, and there can be alignment errors.
Paragraphs in XQuAD contains more question-answer pairs than MLQA.
References
Aghajanyan A, Shrivastava A, Gupta A, et al (2020) Better fine-tuning by reducing representational collapse. CoRR. arXiv:2008.03156
Artetxe M, Ruder S, Yogatama D (2020) On the cross-lingual transferability of monolingual representations. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 4623–4637. https://www.aclweb.org/anthology/2020.acl-main.421/
Athiwaratkun B, Finzi M, Izmailov P, et al (2019) There are many consistent explanations of unlabeled data: why you should average. In: 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9. OpenReview.net, https://openreview.net/forum?id=rkgKBhA5Y7
Carmon Y, Raghunathan A, Schmidt L, et al (2019) Unlabeled data improves adversarial robustness. In: Wallach HM, Larochelle H, Beygelzimer A, et al (eds) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada, pp 11190–11201. http://papers.nips.cc/paper/9298-unlabeled-data-improves-adversarial-robustness
Chi Z, Dong L, Wei F, et al (2020) InfoXLM: an information-theoretic framework for cross-lingual language model pre-training. CoRR. arXiv:2007.07834
Chi Z, Dong L, Zheng B, et al (2021) Improving pretrained cross-lingual language models via self-labeled word alignment. In: Zong C, Xia F, Li W, et al (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (vol 1: Long Papers), Virtual Event, August 1–6, 2021. Association for Computational Linguistics, pp 3418–3430. https://doi.org/10.18653/v1/2021.acl-long.265
Chi Z, Huang S, Dong L, et al (2022) XLM-E: cross-lingual language model pre-training via ELECTRA. In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th annual meeting of the association for computational linguistics (vol 1: Long Papers), ACL 2022, Dublin, Ireland, May 22–27, 2022. Association for Computational Linguistics, pp 6170–6182. https://doi.org/10.18653/v1/2022.acl-long.427
Chung HW, Garrette D, Tan KC, et al (2020) Improving multilingual models with language-clustered vocabularies. In: Webber B, Cohn T, He Y, et al (eds) Proceedings of the 2020 conference on empirical methods in natural language processing, EMNLP 2020, Online, November 16–20, 2020. Association for Computational Linguistics, pp 4536–4546. https://doi.org/10.18653/v1/2020.emnlp-main.367
Clark JH, Palomaki J, Nikolaev V, et al (2020) Tydi QA: a benchmark for information-seeking question answering in typologically diverse languages. Trans Assoc Comput Linguist 8:454–470. https://transacl.org/ojs/index.php/tacl/article/view/1929
Conneau A, Lample G (2019) Cross-lingual language model pretraining. In: Wallach HM, Larochelle H, Beygelzimer A, et al (eds) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada, pp 7057–7067. http://papers.nips.cc/paper/8928-cross-lingual-language-model-pretraining
Conneau A, Rinott R, Lample G, et al (2018) XNLI: evaluating cross-lingual sentence representations. In: Riloff E, Chiang D, Hockenmaier J, et al (eds) Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31–November 4, 2018. Association for Computational Linguistics, pp 2475–2485. https://doi.org/10.18653/v1/d18-1269
Conneau A, Khandelwal K, Goyal N, et al (2020a) Unsupervised cross-lingual representation learning at scale. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 8440–8451. http://www.aclweb.org/anthology/2020.acl-main.747/
Conneau A, Wu S, Li H, et al (2020b) Emerging cross-lingual structure in pretrained language models. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 6022–6034. https://www.aclweb.org/anthology/2020.acl-main.536/
Devlin J, Chang M, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the Association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, vol 1 (Long and Short Papers). Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423
Fang Y, Wang S, Gan Z, et al (2020) FILTER: an enhanced fusion method for cross-lingual language understanding. CoRR. arXiv:2009.05166
Faruqui M, Dyer C (2014) Improving vector space word representations using multilingual correlation. In: Bouma G, Parmentier Y (eds) Proceedings of the 14th conference of the European chapter of the association for computational linguistics, EACL 2014, April 26–30, 2014, Gothenburg, Sweden. The Association for Computer Linguistics, pp 462–471. https://doi.org/10.3115/v1/e14-1049
Fei H, Zhang M, Ji D (2020) Cross-lingual semantic role labeling with high-quality translated training corpus. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 7014–7026. http://www.aclweb.org/anthology/2020.acl-main.627/
Gao T, Han X, Xie R, et al (2020) Neural snowball for few-shot relation learning. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020. AAAI Press, pp 7772–7779. http://ojs.aaai.org/index.php/AAAI/article/view/6281
Guo J, Che W, Yarowsky D, et al (2015) Cross-lingual dependency parsing based on distributed representations. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing, ACL 2015, July 26–31, 2015, Beijing, China, vol 1: Long Papers. The Association for Computer Linguistics, pp 1234–1244. https://doi.org/10.3115/v1/p15-1119
Hou Y, Che W, Lai Y, et al (2020) Few-shot slot tagging with collapsed dependency transfer and label-enhanced task-adaptive projection network. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, pp 1381–1393. https://doi.org/10.18653/v1/2020.acl-main.128
Hou Y, Mao J, Lai Y, et al (2020) Fewjoint: a few-shot learning benchmark for joint language understanding. CoRR. arXiv:2009.08138
Hu J, Ruder S, Siddhant A, et al (2020) XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In: Proceedings of the 37th international conference on machine learning, ICML 2020, 13–18 July 2020, virtual event, proceedings of machine learning research, vol 119. PMLR, pp 4411–4421. http://proceedings.mlr.press/v119/hu20b.html
Hu J, Johnson M, Firat O, et al (2021) Explicit alignment objectives for multilingual bidirectional encoders. In: Toutanova K, Rumshisky A, Zettlemoyer L, et al (eds) Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2021, Online, June 6–11, 2021. Association for Computational Linguistics, pp 3633–3643. https://doi.org/10.18653/v1/2021.naacl-main.284
Hu W, Miyato T, Tokui S, et al (2017) Learning discrete representations via information maximizing self-augmented training. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, proceedings of machine learning research, vol 70. PMLR, pp 1558–1567. http://proceedings.mlr.press/v70/hu17b.html
Jiang H, He P, Chen W, et al (2020) SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 2177–2190. https://www.aclweb.org/anthology/2020.acl-main.197/
Kudo T (2018) Subword regularization: Improving neural network translation models with multiple subword candidates. In: Gurevych I, Miyao Y (eds) Proceedings of the 56th annual meeting of the association for computational linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, vol 1: long papers. Association for Computational Linguistics, pp 66–75. https://doi.org/10.18653/v1/P18-1007. https://www.aclweb.org/anthology/P18-1007/
Kudo T, Richardson J (2018) Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Blanco E, Lu W (eds) Proceedings of the 2018 conference on empirical methods in natural language processing, EMNLP 2018: system demonstrations, Brussels, Belgium, October 31–November 4, 2018. Association for Computational Linguistics, pp 66–71. https://doi.org/10.18653/v1/d18-2012
Lample G, Conneau A, Denoyer L, et al (2018) Unsupervised machine translation using monolingual corpora only. In: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, conference track proceedings. OpenReview.net, http://openreview.net/forum?id=rkYTTf-AZ
Lauscher A, Ravishankar V, Vulic I, et al (2020) From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers. In: Webber B, Cohn T, He Y, et al (eds) Proceedings of the 2020 conference on empirical methods in natural language processing, EMNLP 2020, Online, November 16–20, 2020. Association for Computational Linguistics, pp 4483–4499. https://doi.org/10.18653/v1/2020.emnlp-main.363
Lewis PSH, Oguz B, Rinott R, et al (2020) MLQA: evaluating cross-lingual extractive question answering. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 7315–7330. http://www.aclweb.org/anthology/2020.acl-main.653/
Li H, Yan H, Li Y, et al (2023) Distinguishability calibration to in-context learning. CoRR. https://doi.org/10.48550/arXiv.2302.06198. arXiv:2302.06198
Liu X, Cheng H, He P, et al (2020) Adversarial training for large neural language models. CoRR. arXiv:2004.08994
Luo F, Wang W, Liu J, et al (2020) VECO: Variable encoder-decoder pre-training for cross-lingual understanding and generation. arXiv:2010.16046
Lv X, Gu Y, Han X, et al (2019) Adapting meta knowledge graph information for multi-hop reasoning over few-shot relations. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, pp 3374–3379. https://doi.org/10.18653/v1/D19-1334
Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. CoRR. arXiv:1309.4168
Miyato T, Maeda S, Koyama M et al (2019) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Trans Pattern Anal Mach Intell 41(8):1979–1993. https://doi.org/10.1109/TPAMI.2018.2858821
Nivre J, Blokland R, Partanen N, et al (2018) Universal dependencies 2.2
Pan X, Zhang B, May J, et al (2017) Cross-lingual name tagging and linking for 282 languages. In: Barzilay R, Kan M (eds) Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, volume 1: long papers. Association for Computational Linguistics, pp 1946–1958. https://doi.org/10.18653/v1/P17-1178
Phang J, Htut PM, Pruksachatkun Y, et al (2020) English intermediate-task training improves zero-shot cross-lingual transfer too. CoRR. arXiv:2005.13013
Provilkov I, Emelianenko D, Voita E (2020) BPE-dropout: simple and effective subword regularization. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020. Association for Computational Linguistics, pp 1882–1892. https://www.aclweb.org/anthology/2020.acl-main.170/
Qin L, Ni M, Zhang Y, et al (2020) CoSDA-ML: multi-lingual code-switching data augmentation for zero-shot cross-lingual NLP. In: Bessiere C (eds) Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI 2020. ijcai.org, pp 3853–3860. https://doi.org/10.24963/ijcai.2020/533
Shah DJ, Gupta R, Fayazi AA, et al (2019) Robust zero-shot cross-domain slot filling with example values. In: Korhonen A, Traum DR, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, vol 1: long papers. Association for Computational Linguistics, pp 5484–5490. https://doi.org/10.18653/v1/p19-1547
Singh J, McCann B, Keskar NS, et al (2019) XLDA: cross-lingual data augmentation for natural language inference and question answering. CoRR. arXiv:1905.11471
Sun S, Sun Q, Zhou K, et al (2019) Hierarchical attention prototypical networks for few-shot text classification. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, pp 476–485. https://doi.org/10.18653/v1/D19-1045
Tarvainen A, Valpola H (2017) Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, Workshop Track Proceedings. OpenReview.net, http://openreview.net/forum?id=ry8u21rtl
Wang Y, Che W, Guo J, et al (2019) Cross-lingual BERT transformation for zero-shot dependency parsing. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, pp 5720–5726. https://doi.org/10.18653/v1/D19-1575
Xie Q, Dai Z, Hovy EH, et al (2020) Unsupervised data augmentation for consistency training. In: Larochelle H, Ranzato M, Hadsell R, et al (eds) Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 2020, December 6–12, 2020, virtual. http://proceedings.neurips.cc/paper/2020/hash/44feb0096faa8326192570788b38c1d1-Abstract.html
Xu H, Murray K (2022) Por qué não utiliser alla språk? mixed training with gradient optimization in few-shot cross-lingual transfer. CoRR. arXiv:2204.13869
Xu R, Yang Y, Otani N, et al (2018) Unsupervised cross-lingual transfer of word embedding spaces. In: Riloff E, Chiang D, Hockenmaier J, et al (eds) Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31–November 4, 2018. Association for Computational Linguistics, pp 2465–2474. https://doi.org/10.18653/v1/d18-1268
Yan H, Gui L, He Y (2022) Hierarchical interpretation of neural text classification. Comput Linguist 48(4):987–1020. https://doi.org/10.1162/coli_a_00459
Yan H, Gui L, Li W, et al (2022b) Addressing token uniformity in transformers via singular value transformation. In: Cussens J, Zhang K (eds) Uncertainty in artificial intelligence, proceedings of the thirty-eighth conference on uncertainty in artificial intelligence, UAI 2022, 1–5 August 2022, Eindhoven, The Netherlands, proceedings of machine learning research, vol 180. PMLR, pp 2181–2191. http://proceedings.mlr.press/v180/yan22b.html
Yan L, Zheng Y, Cao J (2018) Few-shot learning for short text classification. Multimed Tools Appl 77(22):29799–29810. https://doi.org/10.1007/s11042-018-5772-4
Yang H, Chen H, Zhou H et al (2022) Enhancing cross-lingual transfer by manifold mixup. In: The 10th International conference on learning representations, ICLR 2022. Virtual Event. April 25-29, 2022
Yang Y, Zhang Y, Tar C, et al (2019) PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, pp 3685–3690. https://doi.org/10.18653/v1/D19-1382
Ye M, Zhang X, Yuen PC, et al (2019) Unsupervised embedding learning via invariant and spreading instance feature. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019. Computer Vision Foundation/IEEE, pp 6210–6219. https://doi.org/10.1109/CVPR.2019.00637. http://openaccess.thecvf.com/content_CVPR_2019/html/Ye_Unsupervised_Embedding_Learning_via_Invariant_and_Spreading_Instance_Feature_CVPR_2019_paper.html
Yu M, Guo X, Yi J, et al (2018) Diverse few-shot text classification with multiple metrics. In: Walker MA, Ji H, Stent A (eds) Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, vol 1 (long papers). Association for Computational Linguistics, pp 1206–1215. https://doi.org/10.18653/v1/n18-1109
Zhang M, Zhang Y, Fu G (2019) Cross-lingual dependency parsing using code-mixed treebank. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, pp 997–1006. https://doi.org/10.18653/v1/D19-1092
Zhao M, Zhu Y, Shareghi E, et al (2021) A closer look at few-shot crosslingual transfer: the choice of shots matters. In: Zong C, Xia F, Li W, et al (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (vol 1: long papers), virtual event, August 1–6, 2021. Association for Computational Linguistics, pp 5751–5767. https://doi.org/10.18653/v1/2021.acl-long.447
Zhao W, Eger S, Bjerva J, et al (2021) Inducing language-agnostic multilingual representations. In: Nastase V, Vulic I (eds) Proceedings of *SEM 2021: the tenth joint conference on lexical and computational semantics, *SEM 2021, Online, August 5–6, 2021. Association for Computational Linguistics, pp 229–240. https://doi.org/10.18653/v1/2021.starsem-1.22
Zheng B, Dong L, Huang S, et al (2021) Allocating large vocabulary capacity for cross-lingual language model pre-training. In: Moens M, Huang X, Specia L, et al (eds) Proceedings of the 2021 conference on empirical methods in natural language processing, EMNLP 2021, virtual event/Punta Cana, Dominican Republic, 7–11 November, 2021. Association for Computational Linguistics, pp 3203–3215. https://doi.org/10.18653/v1/2021.emnlp-main.257
Zheng S, Song Y, Leung T, et al (2016) Improving the robustness of deep neural networks via stability training. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society, pp 4480–4488. https://doi.org/10.1109/CVPR.2016.485
Zhu C, Cheng Y, Gan Z, et al (2020) FreeLB: enhanced adversarial training for natural language understanding. In: 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net, https://openreview.net/forum?id=BygzbyHFvB
Acknowledgements
This work was supported by the National Key R &D Program of China via grant 2020AAA0106501 and the National Natural Science Foundation of China (NSFC) via Grants 62236004 and 61976072.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Statistics of XTREME datasets
See Table 9.
Appendix 2: Hyper-parameters
See Table 10.
See Table 11.
1.1 Conventional cross-lingual fine-tuning
For XNLI, PAWS-X, POS and NER, we fine-tune 10 epochs. For XQuAD and MLQA, we fine-tune 4 epochs. For TyDiQA-GoldP, we fine-tune 20 epochs and 10 epochs for base and large model, respectively. We select \(\lambda _{1}\) in [1.0, 2.0, 5.0], \(\lambda _{2}\) in [0.3, 0.5, 1.0, 2.0, 5.0]. For learning rate, we select in [5e-6, 7e-6, 1e-5, 1.5e\(-\)5] for large models, [7e-6, 1e-5, 2e-5, 3e-5] for base models. We use batch size 32 for all datasets and 10% of total training steps for warmup with a linear learning rate schedule. Our experiments are conducted with a single 32GB Nvidia V100 GPU, and we use gradient accumulation for large-size models. The other hyper-parameters for the two-stage xTune training are shown in Tables 10 and 11.
1.2 Few-shot cross-lingual fine-tuning
During the source-training stage, we use the same hyper-parameters as conventional cross-lingual fine-tuning. During the target-adapting stage, for POS and NER tasks, we fine-tune the model 100 epochs and select the model on the development set every epoch, with an early-stopping strategy of 10 epochs. For MLQA and XQuAD tasks, we fine-tune the model for 2 or 3 epochs and use the model of the last epoch. For learning rate, we select in [2e-5, 1e-5]. We use batch size 32 for POS and NER tasks and batch size 8 for XQuAD and MLQA tasks. We set \(\lambda _1\) to 5.0 in UDA since it has the best performance in the previous experiments.
Appendix 3: Results for each dataset and language
We provide detailed results for each dataset and language below. We compare our method against \(\text {XLM-R}_\text {large}\) for cross-lingual transfer setting, FILTER [15] for translate-train-all setting.
Appendix 4: How to select data augmentation strategies in xTune
We give instructions on selecting a proper data augmentation strategy depending on the corresponding task.
1.1 Classification
The two distribution in example consistency \(\mathcal {R}_{1}\) can always be aligned. Therefore, we recommend using machine translation as data augmentation if the machine translation systems are available. Otherwise, the priority of our data augmentation strategies is code-switch substitution, subword sampling and Gaussian noise.
1.2 Span extraction
The two distribution in example consistency \(\mathcal {R}_{1}\) can not be aligned in translation-pairs. Therefore, it is impossible to use machine translation as data augmentation in example consistency \(\mathcal {R}_{1}\). We prefer to use code-switch when applying example consistency \(\mathcal {R}_{1}\) individually. However, when the training corpus is augmented with translations, since the bilingual dictionaries between arbitrary language pairs may not be available, we recommend using subword sampling in example consistency \(\mathcal {R}_1\).
1.3 Sequence labeling
Similar to span extraction, the two distribution in example consistency \(\mathcal {R}_{1}\) can not be aligned in translation-pairs. Therefore, we do not use machine translation in example consistency \(\mathcal {R}_{1}\). Unlike classification and span extraction, sequence labeling requires finer-grained information and is more sensitive to noise. We found code-switch is worse than subword sampling as data augmentation in both example consistency \(\mathcal {R}_{1}\) and model consistency \(\mathcal {R}_{2}\), it will even degrade performance for certain hyper-parameters. Thus we recommend using subword sampling in example consistency \(\mathcal {R}_{1}\), and use machine translation to augment the English training corpus if machine translation systems are available, otherwise subword sampling.
Appendix 5: Results for each language
See Tables 12, 13, 14, 15, 16 and 17.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zheng, B., Che, W. Improving cross-lingual language understanding with consistency regularization-based fine-tuning. Int. J. Mach. Learn. & Cyber. 14, 3621–3639 (2023). https://doi.org/10.1007/s13042-023-01854-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-023-01854-1