MMExit: Enabling Fast and Efficient Multi-modal DNN Inference with Adaptive Network Exits

Hou, Xiaofeng; Liu, Jiacheng; Tang, Xuehan; Li, Chao; Cheng, Kwang-Ting; Li, Li; Guo, Minyi

doi:10.1007/978-3-031-39698-4_29

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14100))

Included in the following conference series:

European Conference on Parallel Processing

2387 Accesses

Abstract

Multi-modal DNNs have been demonstrated to outperform the best uni-modal DNNs by fusing information from different modalities. However, the performance improvement of multi-modal DNNs is always associated with an incredible increase in computational cost (e.g., network parameters, MAC operations, etc.) to handle more modalities, which ultimately makes them impractical for many real-world applications where computing capability is limited.

In this paper, we propose MMExit, a multi-modal exit architecture that allows for computing appropriate modalities and layers to predict results for different data samples. To this end, we define a novel metric called utility of exit (UoE) to measure the correlations of performance and computational cost for different exits. We then use an equivalent modality serialization method to map the two-dimensional exit space into an equivalent linear space and rank the exits according to their UoE to achieve fast and adaptive inference. To train the MMExit network, we devise a joint loss function which synthesizes the features of different modalities and layers. Our results show that MMExit can slash up to 48.72% of MAC operations with the best performance compared to SOTA multi-modal architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Deep Multi-modal Learning with Cascade Consensus

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

Robust Multimodal Learning via Representation Decoupling

References

Akhtar, M.S., Chauhan, D.S., Ghosal, D., Poria, S., Ekbal, A., Bhattacharyya, P.: Multi-task learning for multi-modal emotion recognition and sentiment analysis. In: NAACL-HLT (2019)
Google Scholar
Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. In: ICLR (2017)
Google Scholar
Bach, F.R., Lanckriet, G.R., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: ICML (2004)
Google Scholar
Bhattacharjee, A., et al.: MIME: adapting a single neural network for multi-task inference with memory-efficient dynamic pruning. In: DAC (2022)
Google Scholar
Castro, S., Hazarika, D., Pérez-Rosas, V., Zimmermann, R., Mihalcea, R., Poria, S.: Towards multimodal sarcasm detection (an _obviously_ perfect paper). In: ACL (2019)
Google Scholar
Choi, K., Yang, H.: A GPU architecture aware fine-grain pruning technique for deep neural networks. In: Sousa, L., Roma, N., Tomás, P. (eds.) Euro-Par 2021. LNCS, vol. 12820, pp. 217–231. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85665-6_14
Chapter Google Scholar
Cui, W., et al.: DVABatch: diversity-aware multi-entry multi-exit batching for efficient processing of DNN services on GPUs. In: USENIX ATC (2022)
Google Scholar
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. TPAMI 44, 7436–7456 (2021)
Article Google Scholar
Hasan, M.K., et al.: Humor knowledge enriched transformer for understanding multimodal humor. In: AAAI (2021)
Google Scholar
Hou, X., et al.: Architecting efficient multi-modal AIoT systems. In: ISCA (2023)
Google Scholar
Hou, X., et al.: Characterizing and understanding end-to-end multi-modal neural networks on GPUs. In: IEEE CAL (2022)
Google Scholar
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: ICLR (2018)
Google Scholar
Jayakumar, S.M., et al.: Multiplicative interactions and where to find them. In: ICLR (2020)
Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: ICML (2021)
Google Scholar
Laskaridis, S., Kouris, A., Lane, N.D.: Adaptive inference through early-exit networks: design, challenges and directions. In: MobiSys (2021)
Google Scholar
Liang, P.P., et al.: MultiBench: multiscale benchmarks for multimodal representation learning. In: NeurIPS (2021)
Google Scholar
Liu, J., Hou, X., Tang, F.: Fine-grained machine teaching with attention modeling. In: AAAI (2020)
Google Scholar
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. In: ACL (2018)
Google Scholar
Neverova, N., Wolf, C., Taylor, G.W., Nebout, F.: Multi-scale deep learning for gesture detection and localization. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8925, pp. 474–490. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16178-5_33
Chapter Google Scholar
Peng, X., Wei, Y., Deng, A., Wang, D., Hu, D.: Balanced multimodal learning via on-the-fly gradient modulation. In: CVPR (2022)
Google Scholar
Pham, H., Liang, P.P., Manzini, T., Morency, L.P., Póczos, B.: Found in translation: learning robust joint representations by cyclic translations between modalities. In: AAAI (2019)
Google Scholar
Scardapane, S., Scarpiniti, M., Baccarelli, E., Uncini, A.: Why should we add early exits to neural networks? Cogn. Comput. 12, 954–966 (2020)
Article Google Scholar
Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural networks. In: Synthesis Lectures on Computer Architecture (2020)
Google Scholar
Teerapittayanon, S., McDanel, B., Kung, H.T.: BranchyNet: fast inference via early exiting from deep neural networks. In: ICPR (2016)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. TPAMI 39, 652–663 (2016)
Article Google Scholar
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. In: EMNLP (2017)
Google Scholar
Zhang, C., Yang, Z., He, X., Deng, L.: Multimodal intelligence: representation learning, information fusion, and applications. JSTSP 14, 478–493 (2020)
Google Scholar

Download references

Acknowledgements

This work is supported in part by the National Key R &D Program of China under grant No.2021ZD0110104, and the National Natural Science Foundation of China under grant No.62122053. It was also partially supported by ACCESS - AI Chip Center for Emerging Smart Systems, InnoHK funding, Hong Kong SAR. We thank all the anonymous reviewers for their valuable feedback.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
Xiaofeng Hou, Jiacheng Liu, Xuehan Tang, Chao Li, Li Li & Minyi Guo
ACCESS - AI Chip Center for Emerging Smart Systems, InnoHK Centers, The Hong Kong University of Science and Technology, Hong Kong, China
Kwang-Ting Cheng

Authors

Xiaofeng Hou
View author publications
You can also search for this author in PubMed Google Scholar
Jiacheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xuehan Tang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Li
View author publications
You can also search for this author in PubMed Google Scholar
Kwang-Ting Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Li Li
View author publications
You can also search for this author in PubMed Google Scholar
Minyi Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chao Li or Kwang-Ting Cheng .

Editor information

Editors and Affiliations

University of Glasgow, Glasgow, UK
José Cano
University of Cyprus, Nicosia, Cyprus
Marios D. Dikaiakos
University of Cyprus, Nicosia, Cyprus
George A. Papadopoulos
Chalmers University of Technology, Gothenburg, Sweden
Miquel Pericàs
University of Manchester, Manchester, UK
Rizos Sakellariou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hou, X. et al. (2023). MMExit: Enabling Fast and Efficient Multi-modal DNN Inference with Adaptive Network Exits. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-39698-4_29
Published: 24 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39697-7
Online ISBN: 978-3-031-39698-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MMExit: Enabling Fast and Efficient Multi-modal DNN Inference with Adaptive Network Exits

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Multi-modal Learning with Cascade Consensus

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Robust Multimodal Learning via Representation Decoupling

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

MMExit: Enabling Fast and Efficient Multi-modal DNN Inference with Adaptive Network Exits

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Multi-modal Learning with Cascade Consensus

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Robust Multimodal Learning via Representation Decoupling

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation