Skip to main content

Assessing the Efficacy of Synthetic Data for Enhancing Machine Translation Models in Low Resource Domains

  • Conference paper
  • First Online:
Big Data and Artificial Intelligence (BDA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14418))

Included in the following conference series:

  • 618 Accesses

Abstract

An artificially generated dataset mimics real-world data in terms of its statistical properties, but it contains no real information. Data around rare occurrences like Covid-19 pandemic is difficult to capture in real-world data due to their infrequent nature. Additionally, cost involved and time-consumption to gather real world data is a big challenge. In such cases, synthetic data can help create more balanced datasets for model training. This project investigates the effectiveness of using synthetic data for tuning machine translation models when training data is limited. The Covid-19 domain is chosen considering the urgency and importance of the global accessibility of information related to the pandemic. TICO-19, a publically available dataset was effectively formulated to cater to this need. The medical terminologies were extracted and passed to OpenAI API to generate training language pair data. The fine-tuned davinci model is then verified with blind test data provided under TICO-19 for translation from English to French. SacreBLEU score is used to compute the translation quality, the fine-tuned model has a significantly higher BLEU score of 19.54 in comparison to the base model with a BLEU score of 0.44. The adapted model also has a comparable score to the next-generation version of davinci with a BLEU score of 22.29.

National College of Ireland.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://phrase.com/blog/posts/machine-translation/.

  2. 2.

    https://platform.openai.com/docs/introduction/key-concepts.

  3. 3.

    https://platform.openai.com/docs/models/.

  4. 4.

    https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset.

  5. 5.

    https://github.com/shweta-0511/fineTuningDavinci/tree/master/thesis.

References

  1. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv preprint arXiv:1409.0473

  3. Koehn, P., Knowles, R.: Six challenges for neural machine translation (2017). arXiv preprint arXiv:1706.03872

  4. Kumar, S., Anastasopoulos, A., Wintner, S., Tsvetkov, Y.: Machine translation into low-resource language varieties (2021). arXiv preprint arXiv:2106.06797

  5. Luong, M.T., Manning, C.D.: Stanford neural machine translation systems for spoken language domains. In: Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, pp. 76–79 (2015)

    Google Scholar 

  6. Yu, A.W., et al.: QANet: combining local convolution with global self-attention for reading comprehension (2018). arXiv preprint arXiv:1804.09541

  7. Amjad, M., Sidorov, G., Zhila, A.: Data augmentation using machine translation for fake news detection in the Urdu language. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 2537–2542, May 2020

    Google Scholar 

  8. Dakwale, P., Monz, C.: Fine-tuning for neural machine translation with limited degradation across in-and out-of-domain data. In: Proceedings of Machine Translation Summit XVI: Research Track, pp. 156–169 (2017)

    Google Scholar 

  9. Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data (2015). arXiv preprint arXiv:1511.06709

  10. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units (2015). arXiv preprint arXiv:1508.07909

  11. Zhang, J., Zong, C.: Exploiting source-side monolingual data in neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545, November 2016

    Google Scholar 

  12. Park, J., Song, J., Yoon, S.: Building a neural machine translation system using only synthetic parallel data (2017). arXiv preprint arXiv:1704.00253

  13. Carvajal-Patiño, D., Ramos-Polláin, R.: Synthetic data generation with deep generative models to enhance predictive tasks in trading strategies. Res. Int. Bus. Finan. 62, 101747 (2022)

    Google Scholar 

  14. James, S., Harbron, C., Branson, J., Sundler, M.: Synthetic data use: exploring use cases to optimise data utility. Discov. Artif. Intell. 1(1), 15 (2021)

    Article  Google Scholar 

  15. Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., Bennett, K.P.: Privacy preserving synthetic health data. In: ESANN 2019-European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, April 2019

    Google Scholar 

  16. Javaid, M., Haleem, A., Singh, R.P.: ChatGPT for healthcare services: an emerging stage for an innovative perspective. BenchCouncil Trans. Benchmarks Stand. Eval. 3(1), 100105 (2023)

    Article  Google Scholar 

  17. Anastasopoulos, A., et al.: TICO-19: the translation initiative for COvid-19 (2020). arXiv preprint arXiv:2007.01788

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shweta Yadav .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yadav, S. (2023). Assessing the Efficacy of Synthetic Data for Enhancing Machine Translation Models in Low Resource Domains. In: Goyal, V., Kumar, N., Bhowmick, S.S., Goyal, P., Goyal, N., Kumar, D. (eds) Big Data and Artificial Intelligence. BDA 2023. Lecture Notes in Computer Science, vol 14418. Springer, Cham. https://doi.org/10.1007/978-3-031-49601-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-49601-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-49600-4

  • Online ISBN: 978-3-031-49601-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics