Cross-Lingual Summarization of Speech-to-Speech Translation: A Baseline

Karande, Pranav; Sarkar, Balaram; Kumar Maurya, Chandresh

doi:10.1007/978-3-031-77961-9_9

Pranav Karande⁹,
Balaram Sarkar⁹ &
Chandresh Kumar Maurya⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15299))

Included in the following conference series:

International Conference on Speech and Computer

217 Accesses

Abstract

Cross-lingual speech-to-speech translation, which enables spoken language conversion from one language to another, plays a pivotal role in overcoming language barriers and promoting cross-cultural communication. The proliferation of multimedia content poses challenges for audiences to efficiently consume extended audio, such as news broadcasts, academic lectures, and political speeches. To address this, we propose a novel investigation of summarization in the context of cross-lingual speech-to-speech translation (S2S-Summ) with a focus on low-resource Indic languages. To the best of our knowledge, this task has not been explored in prior research. We develop and present a semi-synthetic dataset of translated summaries in Hindi (Hi), Bengali (Bn), Gujarati (Gu), and Tamil (Ta) languages, alongside baseline models for this task. The performance of our models is evaluated using metrics such as BERTScore, ROUGE, and UniEval. Our study aims to catalyze further exploration in this area, facilitating streamlined access to multilingual audio content and enhancing information dissemination across linguistic boundaries. Code and data is available at https://github.com/pranavkarande/S2S-Summ.

P. Karande B. Sarkar — contributed equally

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages

Bridging Language Barriers: Exploring Hindi-to-English Speech-to-Speech Translation for Multilingual Communication

Cross-Lingual Speech-to-Text Summarization

Notes

References

Fitch, W.T.: The evolution of language: a comparative review. Biol. Philos. 20, 193–203 (2005)
Google Scholar
Popuri, S., et al.: Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation (2022). Interspeech
Google Scholar
Wang, Y., Bai, J., Huang, R., Li, R., Hong, Z., Zhao, Z.: Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer. arXiv:abs/2309.07566 (2023)
Inaguma, H., et al.: UnitY: two-pass direct speech-to-speech translation with discrete units. arXiv:abs/2212.08055 (2022)
Zhou, G., Lam, T., Birch, A., Haddow, B.: Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases. Findings (2024)
Google Scholar
Sarkar, B., Maurya, C.K., Agrahri, A.: Direct speech to text translation: bridging the modality gap using SimSiam. In: Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), pp. 250–255 (2023)
Google Scholar
Zhou, Q., Yang, N., Wei, F., Huang, S., Zhou, M., Zhao, T.: Neural document summarization by jointly learning to score and select sentences. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 654–663 (2018)
Google Scholar
Zhong, M., Liu, P., Wang, D., Qiu, X., Huang, X.: Searching for effective neural extractive summarization: what works and what’s next. In: Annual Meeting of the Association for Computational Linguistics (2019)
Google Scholar
Wang, D., Liu, P., Zhong, M., Fu, J., Qiu, X., Huang, X.: Exploring domain shift in extractive text summarization. arXiv:abs/1908.11664 (2019)
Wang, D., Liu, P., Zheng, Y., Qiu, X., Huang, X.: Heterogeneous Graph Neural Networks for Extractive Document Summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6209–6219 (2020)
Google Scholar
Narayan, S., Cohen, S.B., Lapata, M.: Ranking sentences for extractive summarization with reinforcement learning. In: North American Chapter of the Association for Computational Linguistics (2018)
Google Scholar
Arumae, K., Liu, F.: Reinforced extractive summarization with question-focused rewards. arXiv:abs/1805.10392 (2018)
Jadhav, A., Rajan, V.: Extractive summarization with SWAP-NET: sentences and words from alternating pointer networks. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 142–151 (2018)
Google Scholar
Liu, Yang and Mirella Lapata.: Text Summarization with Pretrained Encoders. arXiv:abs/1908.08345 (2019)
Givchi, A., Ramezani, R., Baraani-Dastjerdi, A.: Graph-based abstractive biomedical text summarization. J. Biomed Inform. 132, 104099 (2022). https://doi.org/10.1016/j.jbi.2022.104099.Epub 2022 Jun 11. PMID: 35700914
Rothe, S., Narayan, S., Severyn, A.: Leveraging pre-trained checkpoints for sequence generation tasks. Trans. Assoc. Comput. Linguist. 8, 264–280 (2020)
Article MATH Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:abs/1907.11692 (2019)
Sharma, R., et al.: End-to-end speech summarization using restricted self-attention. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8072–8076 (2021)
Google Scholar
Matsuura, K., et al.: Leveraging large text corpora for end-to-end speech summarization. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Google Scholar
Monteiro, R., Pernes, D.: Towards end-to-end speech-to-text summarization. arXiv:abs/2306.05432 (2023)
Gangi, M.A.D., et al.: MuST-C: a multilingual speech translation corpus. In: North American Chapter of the Association for Computational Linguistics (2019)
Google Scholar
Lewis, Mike et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Annual Meeting of the Association for Computational Linguistics (2019)
Google Scholar
Chen, D., Bolton, J., Manning, C.D.: A thorough examination of the CNN/daily mail reading comprehension task. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2358–2367 (2016)
Google Scholar
Kumar, G.K., et al.: Towards building text-to-speech systems for the next billion users. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2022)
Google Scholar
Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. In: North American Chapter of the Association for Computational Linguistics (2020)
Google Scholar
Grusky, M., Naaman, M., Artzi, Y.: Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 708–719 (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Neural Information Processing Systems (2017)
Google Scholar
Ott, M., et al.: fairseq: a fast, extensible toolkit for sequence modeling. In: North American Chapter of the Association for Computational Linguistics (2019)
Google Scholar
Tang, Y., et al.: Multilingual translation from denoising pre-training. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP, 2021, pp. 3450–3466 (2021)
Google Scholar
Kudugunta, S., et al.: MADLAD-400: a multilingual and document-level large audited dataset. arXiv:abs/2309.04662 (2023)
Seamless Communication, et al.: Seamless: multilingual expressive and streaming speech translation. arXiv:abs/2312.05187 (2023)
Nllb team, et al.: No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:abs/2207.04672 (2022)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1401–14067 (2019)
MathSciNet MATH Google Scholar
Ali, A., Renals, S.: Word error rate estimation for speech recognition: e-WER. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 20–24 (2018)
Google Scholar
Post, M.: A call for clarity in reporting BLEU scores. In: Conference on Machine Translation (2018)
Google Scholar
Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Annual Meeting of the Association for Computational Linguistics (2004)
Google Scholar
Zhang, T., et al.: BERTScore: evaluating text generation with BERT. arXiv:abs/1904.09675 (2019)
Zhong, M., et al.: Towards a unified multi-dimensional evaluator for text generation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2023–2038 (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Technology Indore, Indore, India
Pranav Karande, Balaram Sarkar & Chandresh Kumar Maurya

Authors

Pranav Karande
View author publications
You can also search for this author in PubMed Google Scholar
Balaram Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Chandresh Kumar Maurya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pranav Karande .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
University of Novi Sad, Novi Sad, Serbia
Vlado Delić

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Karande, P., Sarkar, B., Kumar Maurya, C. (2025). Cross-Lingual Summarization of Speech-to-Speech Translation: A Baseline. In: Karpov, A., Delić, V. (eds) Speech and Computer. SPECOM 2024. Lecture Notes in Computer Science(), vol 15299. Springer, Cham. https://doi.org/10.1007/978-3-031-77961-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-77961-9_9
Published: 22 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-77960-2
Online ISBN: 978-3-031-77961-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-Lingual Summarization of Speech-to-Speech Translation: A Baseline