Abstract
Cross-lingual speech-to-speech translation, which enables spoken language conversion from one language to another, plays a pivotal role in overcoming language barriers and promoting cross-cultural communication. The proliferation of multimedia content poses challenges for audiences to efficiently consume extended audio, such as news broadcasts, academic lectures, and political speeches. To address this, we propose a novel investigation of summarization in the context of cross-lingual speech-to-speech translation (S2S-Summ) with a focus on low-resource Indic languages. To the best of our knowledge, this task has not been explored in prior research. We develop and present a semi-synthetic dataset of translated summaries in Hindi (Hi), Bengali (Bn), Gujarati (Gu), and Tamil (Ta) languages, alongside baseline models for this task. The performance of our models is evaluated using metrics such as BERTScore, ROUGE, and UniEval. Our study aims to catalyze further exploration in this area, facilitating streamlined access to multilingual audio content and enhancing information dissemination across linguistic boundaries. Code and data is available at https://github.com/pranavkarande/S2S-Summ.
P. Karande B. Sarkar — contributed equally
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Fitch, W.T.: The evolution of language: a comparative review. Biol. Philos. 20, 193–203 (2005)
Popuri, S., et al.: Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation (2022). Interspeech
Wang, Y., Bai, J., Huang, R., Li, R., Hong, Z., Zhao, Z.: Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer. arXiv:abs/2309.07566 (2023)
Inaguma, H., et al.: UnitY: two-pass direct speech-to-speech translation with discrete units. arXiv:abs/2212.08055 (2022)
Zhou, G., Lam, T., Birch, A., Haddow, B.: Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases. Findings (2024)
Sarkar, B., Maurya, C.K., Agrahri, A.: Direct speech to text translation: bridging the modality gap using SimSiam. In: Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), pp. 250–255 (2023)
Zhou, Q., Yang, N., Wei, F., Huang, S., Zhou, M., Zhao, T.: Neural document summarization by jointly learning to score and select sentences. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 654–663 (2018)
Zhong, M., Liu, P., Wang, D., Qiu, X., Huang, X.: Searching for effective neural extractive summarization: what works and what’s next. In: Annual Meeting of the Association for Computational Linguistics (2019)
Wang, D., Liu, P., Zhong, M., Fu, J., Qiu, X., Huang, X.: Exploring domain shift in extractive text summarization. arXiv:abs/1908.11664 (2019)
Wang, D., Liu, P., Zheng, Y., Qiu, X., Huang, X.: Heterogeneous Graph Neural Networks for Extractive Document Summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6209–6219 (2020)
Narayan, S., Cohen, S.B., Lapata, M.: Ranking sentences for extractive summarization with reinforcement learning. In: North American Chapter of the Association for Computational Linguistics (2018)
Arumae, K., Liu, F.: Reinforced extractive summarization with question-focused rewards. arXiv:abs/1805.10392 (2018)
Jadhav, A., Rajan, V.: Extractive summarization with SWAP-NET: sentences and words from alternating pointer networks. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 142–151 (2018)
Liu, Yang and Mirella Lapata.: Text Summarization with Pretrained Encoders. arXiv:abs/1908.08345 (2019)
Givchi, A., Ramezani, R., Baraani-Dastjerdi, A.: Graph-based abstractive biomedical text summarization. J. Biomed Inform. 132, 104099 (2022). https://doi.org/10.1016/j.jbi.2022.104099.Epub 2022 Jun 11. PMID: 35700914
Rothe, S., Narayan, S., Severyn, A.: Leveraging pre-trained checkpoints for sequence generation tasks. Trans. Assoc. Comput. Linguist. 8, 264–280 (2020)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:abs/1907.11692 (2019)
Sharma, R., et al.: End-to-end speech summarization using restricted self-attention. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8072–8076 (2021)
Matsuura, K., et al.: Leveraging large text corpora for end-to-end speech summarization. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Monteiro, R., Pernes, D.: Towards end-to-end speech-to-text summarization. arXiv:abs/2306.05432 (2023)
Gangi, M.A.D., et al.: MuST-C: a multilingual speech translation corpus. In: North American Chapter of the Association for Computational Linguistics (2019)
Lewis, Mike et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Annual Meeting of the Association for Computational Linguistics (2019)
Chen, D., Bolton, J., Manning, C.D.: A thorough examination of the CNN/daily mail reading comprehension task. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2358–2367 (2016)
Kumar, G.K., et al.: Towards building text-to-speech systems for the next billion users. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2022)
Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. In: North American Chapter of the Association for Computational Linguistics (2020)
Grusky, M., Naaman, M., Artzi, Y.: Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 708–719 (2018)
Vaswani, A., et al.: Attention is all you need. In: Neural Information Processing Systems (2017)
Ott, M., et al.: fairseq: a fast, extensible toolkit for sequence modeling. In: North American Chapter of the Association for Computational Linguistics (2019)
Tang, Y., et al.: Multilingual translation from denoising pre-training. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP, 2021, pp. 3450–3466 (2021)
Kudugunta, S., et al.: MADLAD-400: a multilingual and document-level large audited dataset. arXiv:abs/2309.04662 (2023)
Seamless Communication, et al.: Seamless: multilingual expressive and streaming speech translation. arXiv:abs/2312.05187 (2023)
Nllb team, et al.: No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:abs/2207.04672 (2022)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1401–14067 (2019)
Ali, A., Renals, S.: Word error rate estimation for speech recognition: e-WER. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 20–24 (2018)
Post, M.: A call for clarity in reporting BLEU scores. In: Conference on Machine Translation (2018)
Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Annual Meeting of the Association for Computational Linguistics (2004)
Zhang, T., et al.: BERTScore: evaluating text generation with BERT. arXiv:abs/1904.09675 (2019)
Zhong, M., et al.: Towards a unified multi-dimensional evaluator for text generation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2023–2038 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Karande, P., Sarkar, B., Kumar Maurya, C. (2025). Cross-Lingual Summarization of Speech-to-Speech Translation: A Baseline. In: Karpov, A., Delić, V. (eds) Speech and Computer. SPECOM 2024. Lecture Notes in Computer Science(), vol 15299. Springer, Cham. https://doi.org/10.1007/978-3-031-77961-9_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-77961-9_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-77960-2
Online ISBN: 978-3-031-77961-9
eBook Packages: Computer ScienceComputer Science (R0)