research-article

Open access

DoubleDistillation: Enhancing LLMs for Informal Text Analysis using Multistage Knowledge Distillation from Speech and Text

Authors:

James R. Foulds,

Bishwaranjan BhattacharjeeAuthors Info & Claims

ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction

Pages 526 - 535

https://doi.org/10.1145/3678957.3685705

Published: 04 November 2024 Publication History

All formats PDF

Abstract

Traditional large language models (LLMs) leverage extensive text corpora but lack access to acoustic and para-linguistic cues present in speech. There is a growing interest in enhancing text-based models with audio information. However, current models often require an aligned audio-text dataset which is frequently much smaller than typical language model training corpora. Moreover, these models often require both text and audio streams during inference/testing. In this study, we introduce a novel two-stage knowledge distillation (KD) approach that enables language models to (a) incorporate rich acoustic and paralinguistic information from speech, (b) utilize text corpora comparable in size to typical language model training data, and (c) support text-only analysis without requiring an audio stream during inference/testing. Specifically, we employ a pre-trained speech embedding teacher model (OpenAI Whisper) to train a Teacher Assistant (TA) model on an aligned audio-text dataset in the first stage. In the second stage, the TA’s knowledge is transferred to a student language model trained on a conventional text dataset. Thus, our two-stage KD method leverages both the acoustic and paralinguistic cues in the aligned audio-text data and the nuanced linguistic knowledge in a large text-only dataset. Based on our evaluation, this DoubleDistillation system consistently outperforms traditional LLMs in 15 informal text understanding tasks.

Supplemental Material

PDF File

Appendix

Download
297.36 KB

References

[1]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.

[2]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42 (2008), 335–359.

[3]

Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann, Rada Mihalcea, and Soujanya Poria. 2019. Towards multimodal sarcasm detection (an _obviously_ perfect paper). arXiv preprint arXiv:1906.01815 (2019).

[4]

Vamsikrishna Chemudupati, Marzieh Tahaei, Heitor Guimaraes, Arthur Pimentel, Anderson Avila, Mehdi Rezagholizadeh, Boxing Chen, and Tiago Falk. 2023. On the Transferability of Whisper-based Representations for" In-the-Wild" Cross-Task Downstream Speech Applications. arXiv preprint arXiv:2305.14546 (2023).

[5]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518.

[6]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. UNITER: Learning UNiversal Image-TExt Representations. CoRR abs/1909.11740 (2019). arXiv:1909.11740http://arxiv.org/abs/1909.11740

[7]

Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, and Naoyuki Onoe. 2022. M2FNet: multi-modal fusion network for emotion recognition in conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4652–4661.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[9]

Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cerón, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, and Vijay Janapa Reddi. 2021. The People’s Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage. arXiv preprint arXiv:2111.09344 (2021).

[10]

Kathleen R Gibson, Kathleen Rita Gibson, and Tim Ingold. 1993. Tools, language and cognition in human evolution. Cambridge University Press.

[11]

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. The Journal of Machine Learning Research 13, 1 (2012), 723–773.

Digital Library

[12]

Fatema Hasan, Yulong Li, James Foulds, Shimei Pan, and Bishwaranjan Bhattacharjee. 2023. Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings. arXiv preprint arXiv:2311.07014 (2023).

[13]

Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. 2019. Bag of Tricks for Image Classification with Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arxiv:1503.02531 [stat.ML]

[15]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.

Digital Library

[16]

Zehao Huang and Naiyan Wang. 2017. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219 (2017).

[17]

Yu Kang, Tianqiao Liu, Hang Li, Yang Hao, and Wenbiao Ding. 2022. Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 10875–10883.

[18]

Jasleen Kaur and Jatinderkumar R Saini. 2014. Emotion detection and sentiment analysis in text corpus: a differential study with informal and formal writing styles. International Journal of Computer Application, ISSN (2014), 0975–8887.

[19]

Yoon Kim and Alexander M. Rush. 2016. Sequence-Level Knowledge Distillation. CoRR abs/1606.07947 (2016). arXiv:1606.07947http://arxiv.org/abs/1606.07947

[20]

Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. CoRR abs/1908.06066 (2019). arXiv:1908.06066http://arxiv.org/abs/1908.06066

[21]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. CoRR abs/2005.00200 (2020). arXiv:2005.00200https://arxiv.org/abs/2005.00200

[22]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.

[23]

Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60.

[24]

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2019. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. CoRR abs/1912.06430 (2019). arXiv:1912.06430http://arxiv.org/abs/1912.06430

[25]

Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 5191–5198.

[26]

Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 task 1: Affect in tweets. In Proceedings of the 12th international workshop on semantic evaluation. 1–17.

[27]

Sara Rosenthal, Noura Farra, and Preslav Nakov. 2019. SemEval-2017 task 4: Sentiment analysis in Twitter. arXiv preprint arXiv:1912.00741 (2019).

[28]

Shamane Siriwardhana, Tharindu Kaluarachchi, Mark Billinghurst, and Suranga Nanayakkara. 2020. Multimodal emotion recognition with transformer-based self supervised feature fusion. IEEE Access 8 (2020), 176274–176285.

[29]

Wonchul Son, Jaemin Na, Junyong Choi, and Wonjun Hwang. 2021. Densely guided knowledge distillation using multiple teacher assistants. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9395–9404.

[30]

Hao Tan and Mohit Bansal. 2020. Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2066–2080.

[31]

Zineng Tang, Jaemin Cho, Hao Tan, and Mohit Bansal. 2021. Vidlankd: Improving language understanding via video-distilled knowledge transfer. Advances in Neural Information Processing Systems 34 (2021), 24468–24481.

[32]

Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019).

[33]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. CoRR abs/1906.00295 (2019). arXiv:1906.00295http://arxiv.org/abs/1906.00295

[34]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.

[35]

Austin Waters and Yevgen Chebotar. 2016. Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition. In Interspeech.

[36]

Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, 2022. i-code: An integrative and composable multimodal learning framework. arXiv preprint arXiv:2205.01818 (2022).

[37]

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2236–2246.

Index Terms

DoubleDistillation: Enhancing LLMs for Informal Text Analysis using Multistage Knowledge Distillation from Speech and Text
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
      2. Lexical semantics
  2. Machine learning
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Analysis and modeling of F0 contours for cantonese text-to-speech

For the generation of highly natural synthetic speech, the control of prosody is of primary importance. The fundamental frequency (F0) is one of the most important components of speech prosody. This research investigates the variation of F0 in ...
Text dependant speaker recognition using MFCC, LPC and DWT

The objective of this work is to investigate the benefit of discrete wavelet transform combined with LPC, for speaker identification system applied for Algerian Berber language, compared to the traditional Mel frequency analysis. We've developed a ...
Multimodal Speaker Identification Based on Text and Speech
Biometrics and Identity Management

This paper proposes a novel method for speaker identification based on both speech utterances and their transcribed text. The transcribed text of each speaker's utterance is processed by the probabilistic latent semantic indexing (PLSI) that offers a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction

November 2024

725 pages

ISBN:9798400704628

DOI:10.1145/3678957

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICMI '24

ICMI '24: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

November 4 - 8, 2024

San Jose, Costa Rica

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
188
Total Downloads

Downloads (Last 12 months)188
Downloads (Last 6 weeks)87

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten