Assessing the Efficacy of Synthetic Data for Enhancing Machine Translation Models in Low Resource Domains

Yadav, Shweta

doi:10.1007/978-3-031-49601-1_9

Shweta Yadav¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14418))

Included in the following conference series:

International Conference on Big Data Analytics

618 Accesses

Abstract

An artificially generated dataset mimics real-world data in terms of its statistical properties, but it contains no real information. Data around rare occurrences like Covid-19 pandemic is difficult to capture in real-world data due to their infrequent nature. Additionally, cost involved and time-consumption to gather real world data is a big challenge. In such cases, synthetic data can help create more balanced datasets for model training. This project investigates the effectiveness of using synthetic data for tuning machine translation models when training data is limited. The Covid-19 domain is chosen considering the urgency and importance of the global accessibility of information related to the pandemic. TICO-19, a publically available dataset was effectively formulated to cater to this need. The medical terminologies were extracted and passed to OpenAI API to generate training language pair data. The fine-tuned davinci model is then verified with blind test data provided under TICO-19 for translation from English to French. SacreBLEU score is used to compute the translation quality, the fine-tuned model has a significantly higher BLEU score of 19.54 in comparison to the base model with a BLEU score of 0.44. The adapted model also has a comparable score to the next-generation version of davinci with a BLEU score of 22.29.

National College of Ireland.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Extremely low-resource neural machine translation for Asian languages

Article Open access 01 December 2020

Bridging the Language Gap: Domain-Specific Dataset Construction for Medical LLMs

GEDset: Automatic Dataset Builder for Machine Translation System with Specific Reference to Gujarati-English

Notes

References

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv preprint arXiv:1409.0473
Koehn, P., Knowles, R.: Six challenges for neural machine translation (2017). arXiv preprint arXiv:1706.03872
Kumar, S., Anastasopoulos, A., Wintner, S., Tsvetkov, Y.: Machine translation into low-resource language varieties (2021). arXiv preprint arXiv:2106.06797
Luong, M.T., Manning, C.D.: Stanford neural machine translation systems for spoken language domains. In: Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, pp. 76–79 (2015)
Google Scholar
Yu, A.W., et al.: QANet: combining local convolution with global self-attention for reading comprehension (2018). arXiv preprint arXiv:1804.09541
Amjad, M., Sidorov, G., Zhila, A.: Data augmentation using machine translation for fake news detection in the Urdu language. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 2537–2542, May 2020
Google Scholar
Dakwale, P., Monz, C.: Fine-tuning for neural machine translation with limited degradation across in-and out-of-domain data. In: Proceedings of Machine Translation Summit XVI: Research Track, pp. 156–169 (2017)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data (2015). arXiv preprint arXiv:1511.06709
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units (2015). arXiv preprint arXiv:1508.07909
Zhang, J., Zong, C.: Exploiting source-side monolingual data in neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545, November 2016
Google Scholar
Park, J., Song, J., Yoon, S.: Building a neural machine translation system using only synthetic parallel data (2017). arXiv preprint arXiv:1704.00253
Carvajal-Patiño, D., Ramos-Polláin, R.: Synthetic data generation with deep generative models to enhance predictive tasks in trading strategies. Res. Int. Bus. Finan. 62, 101747 (2022)
Google Scholar
James, S., Harbron, C., Branson, J., Sundler, M.: Synthetic data use: exploring use cases to optimise data utility. Discov. Artif. Intell. 1(1), 15 (2021)
Article Google Scholar
Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., Bennett, K.P.: Privacy preserving synthetic health data. In: ESANN 2019-European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, April 2019
Google Scholar
Javaid, M., Haleem, A., Singh, R.P.: ChatGPT for healthcare services: an emerging stage for an innovative perspective. BenchCouncil Trans. Benchmarks Stand. Eval. 3(1), 100105 (2023)
Article Google Scholar
Anastasopoulos, A., et al.: TICO-19: the translation initiative for COvid-19 (2020). arXiv preprint arXiv:2007.01788

Download references

Author information

Authors and Affiliations

National College of Ireland, Dublin, Ireland
Shweta Yadav

Authors

Shweta Yadav
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shweta Yadav .

Editor information

Editors and Affiliations

Indraprastha Institute of Information Technology, Delhi, India
Vikram Goyal
University of Delhi, Delhi, India
Naveen Kumar
Nanyang Technological University, Singapore, Singapore
Sourav S. Bhowmick
Indian Institute of Technology, Kharagpur, India
Pawan Goyal
Birla Institute of Technology and Science, Pilani, India
Navneet Goyal
Indraprastha Institute of Information Technology, Delhi, India
Dhruv Kumar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yadav, S. (2023). Assessing the Efficacy of Synthetic Data for Enhancing Machine Translation Models in Low Resource Domains. In: Goyal, V., Kumar, N., Bhowmick, S.S., Goyal, P., Goyal, N., Kumar, D. (eds) Big Data and Artificial Intelligence. BDA 2023. Lecture Notes in Computer Science, vol 14418. Springer, Cham. https://doi.org/10.1007/978-3-031-49601-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-49601-1_9
Published: 04 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49600-4
Online ISBN: 978-3-031-49601-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Assessing the Efficacy of Synthetic Data for Enhancing Machine Translation Models in Low Resource Domains