ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Multimodal Speech Summarization Through Semantic Concept Learning

Shruti Palaskar, Ruslan Salakhutdinov, Alan W. Black, Florian Metze

We propose a cascaded multimodal abstractive speech summarization model that generates semantic concepts as an intermediate step towards summarization. We describe a method to leverage existing multimodal dataset annotations to curate groundtruth labels for such intermediate concept modeling. In addition to cascaded training, the concept labels also provide an interpretable intermediate output level that helps improve performance on the downstream summarization task. On the open-domain How2 data, we conduct utterance-level and video-level experiments for two granularities of concepts: Specific and Abstract. We compare various multimodal fusion models for concept generation based on the respective input modalities. We observe consistent improvements in concept modeling by using multimodal adaptation models over unimodal models. Using the cascaded multimodal speech summarization model, we see a significant improvement of 7.5 METEOR points and 5.1 ROUGE-L points compared to previous methods of speech summarization. Finally, we show the benefits of scalability of the proposed approaches on 2000 h of video data.


doi: 10.21437/Interspeech.2021-1923

Cite as: Palaskar, S., Salakhutdinov, R., Black, A.W., Metze, F. (2021) Multimodal Speech Summarization Through Semantic Concept Learning. Proc. Interspeech 2021, 791-795, doi: 10.21437/Interspeech.2021-1923

@inproceedings{palaskar21_interspeech,
  author={Shruti Palaskar and Ruslan Salakhutdinov and Alan W. Black and Florian Metze},
  title={{Multimodal Speech Summarization Through Semantic Concept Learning}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={791--795},
  doi={10.21437/Interspeech.2021-1923}
}