Skip to main content

Cross-Lingual Summarization of Speech-to-Speech Translation: A Baseline

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15299))

Included in the following conference series:

  • 217 Accesses

Abstract

Cross-lingual speech-to-speech translation, which enables spoken language conversion from one language to another, plays a pivotal role in overcoming language barriers and promoting cross-cultural communication. The proliferation of multimedia content poses challenges for audiences to efficiently consume extended audio, such as news broadcasts, academic lectures, and political speeches. To address this, we propose a novel investigation of summarization in the context of cross-lingual speech-to-speech translation (S2S-Summ) with a focus on low-resource Indic languages. To the best of our knowledge, this task has not been explored in prior research. We develop and present a semi-synthetic dataset of translated summaries in Hindi (Hi), Bengali (Bn), Gujarati (Gu), and Tamil (Ta) languages, alongside baseline models for this task. The performance of our models is evaluated using metrics such as BERTScore, ROUGE, and UniEval. Our study aims to catalyze further exploration in this area, facilitating streamlined access to multilingual audio content and enhancing information dissemination across linguistic boundaries. Code and data is available at https://github.com/pranavkarande/S2S-Summ.

P. Karande B. Sarkar — contributed equally

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.ted.com/.

  2. 2.

    https://huggingface.co/facebook/bart-large-cnn.

  3. 3.

    https://doctranslator.com/.

  4. 4.

    https://github.com/AI4Bharat/Indic-TTS.

  5. 5.

    https://huggingface.co/spaces/evaluate-metric/rouge.

  6. 6.

    https://github.com/Tiiiger/bert_score.

  7. 7.

    https://github.com/maszhongming/UniEval.

References

  1. Fitch, W.T.: The evolution of language: a comparative review. Biol. Philos. 20, 193–203 (2005)

    Google Scholar 

  2. Popuri, S., et al.: Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation (2022). Interspeech

    Google Scholar 

  3. Wang, Y., Bai, J., Huang, R., Li, R., Hong, Z., Zhao, Z.: Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer. arXiv:abs/2309.07566 (2023)

  4. Inaguma, H., et al.: UnitY: two-pass direct speech-to-speech translation with discrete units. arXiv:abs/2212.08055 (2022)

  5. Zhou, G., Lam, T., Birch, A., Haddow, B.: Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases. Findings (2024)

    Google Scholar 

  6. Sarkar, B., Maurya, C.K., Agrahri, A.: Direct speech to text translation: bridging the modality gap using SimSiam. In: Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), pp. 250–255 (2023)

    Google Scholar 

  7. Zhou, Q., Yang, N., Wei, F., Huang, S., Zhou, M., Zhao, T.: Neural document summarization by jointly learning to score and select sentences. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 654–663 (2018)

    Google Scholar 

  8. Zhong, M., Liu, P., Wang, D., Qiu, X., Huang, X.: Searching for effective neural extractive summarization: what works and what’s next. In: Annual Meeting of the Association for Computational Linguistics (2019)

    Google Scholar 

  9. Wang, D., Liu, P., Zhong, M., Fu, J., Qiu, X., Huang, X.: Exploring domain shift in extractive text summarization. arXiv:abs/1908.11664 (2019)

  10. Wang, D., Liu, P., Zheng, Y., Qiu, X., Huang, X.: Heterogeneous Graph Neural Networks for Extractive Document Summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6209–6219 (2020)

    Google Scholar 

  11. Narayan, S., Cohen, S.B., Lapata, M.: Ranking sentences for extractive summarization with reinforcement learning. In: North American Chapter of the Association for Computational Linguistics (2018)

    Google Scholar 

  12. Arumae, K., Liu, F.: Reinforced extractive summarization with question-focused rewards. arXiv:abs/1805.10392 (2018)

  13. Jadhav, A., Rajan, V.: Extractive summarization with SWAP-NET: sentences and words from alternating pointer networks. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 142–151 (2018)

    Google Scholar 

  14. Liu, Yang and Mirella Lapata.: Text Summarization with Pretrained Encoders. arXiv:abs/1908.08345 (2019)

  15. Givchi, A., Ramezani, R., Baraani-Dastjerdi, A.: Graph-based abstractive biomedical text summarization. J. Biomed Inform. 132, 104099 (2022). https://doi.org/10.1016/j.jbi.2022.104099.Epub 2022 Jun 11. PMID: 35700914

  16. Rothe, S., Narayan, S., Severyn, A.: Leveraging pre-trained checkpoints for sequence generation tasks. Trans. Assoc. Comput. Linguist. 8, 264–280 (2020)

    Article  MATH  Google Scholar 

  17. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:abs/1907.11692 (2019)

  18. Sharma, R., et al.: End-to-end speech summarization using restricted self-attention. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8072–8076 (2021)

    Google Scholar 

  19. Matsuura, K., et al.: Leveraging large text corpora for end-to-end speech summarization. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)

    Google Scholar 

  20. Monteiro, R., Pernes, D.: Towards end-to-end speech-to-text summarization. arXiv:abs/2306.05432 (2023)

  21. Gangi, M.A.D., et al.: MuST-C: a multilingual speech translation corpus. In: North American Chapter of the Association for Computational Linguistics (2019)

    Google Scholar 

  22. Lewis, Mike et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Annual Meeting of the Association for Computational Linguistics (2019)

    Google Scholar 

  23. Chen, D., Bolton, J., Manning, C.D.: A thorough examination of the CNN/daily mail reading comprehension task. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2358–2367 (2016)

    Google Scholar 

  24. Kumar, G.K., et al.: Towards building text-to-speech systems for the next billion users. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2022)

    Google Scholar 

  25. Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. In: North American Chapter of the Association for Computational Linguistics (2020)

    Google Scholar 

  26. Grusky, M., Naaman, M., Artzi, Y.: Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 708–719 (2018)

    Google Scholar 

  27. Vaswani, A., et al.: Attention is all you need. In: Neural Information Processing Systems (2017)

    Google Scholar 

  28. Ott, M., et al.: fairseq: a fast, extensible toolkit for sequence modeling. In: North American Chapter of the Association for Computational Linguistics (2019)

    Google Scholar 

  29. Tang, Y., et al.: Multilingual translation from denoising pre-training. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP, 2021, pp. 3450–3466 (2021)

    Google Scholar 

  30. Kudugunta, S., et al.: MADLAD-400: a multilingual and document-level large audited dataset. arXiv:abs/2309.04662 (2023)

  31. Seamless Communication, et al.: Seamless: multilingual expressive and streaming speech translation. arXiv:abs/2312.05187 (2023)

  32. Nllb team, et al.: No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:abs/2207.04672 (2022)

  33. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1401–14067 (2019)

    MathSciNet  MATH  Google Scholar 

  34. Ali, A., Renals, S.: Word error rate estimation for speech recognition: e-WER. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 20–24 (2018)

    Google Scholar 

  35. Post, M.: A call for clarity in reporting BLEU scores. In: Conference on Machine Translation (2018)

    Google Scholar 

  36. Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Annual Meeting of the Association for Computational Linguistics (2004)

    Google Scholar 

  37. Zhang, T., et al.: BERTScore: evaluating text generation with BERT. arXiv:abs/1904.09675 (2019)

  38. Zhong, M., et al.: Towards a unified multi-dimensional evaluator for text generation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2023–2038 (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pranav Karande .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Karande, P., Sarkar, B., Kumar Maurya, C. (2025). Cross-Lingual Summarization of Speech-to-Speech Translation: A Baseline. In: Karpov, A., Delić, V. (eds) Speech and Computer. SPECOM 2024. Lecture Notes in Computer Science(), vol 15299. Springer, Cham. https://doi.org/10.1007/978-3-031-77961-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-77961-9_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-77960-2

  • Online ISBN: 978-3-031-77961-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics