IFI: Interpreting for Improving: A Multimodal Transformer with an Interpretability Technique for Recognition of Risk Events

Mallick, Rupayan; Benois-Pineau, Jenny; Zemmari, Akka

doi:10.1007/978-3-031-53302-0_9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14557))

Included in the following conference series:

International Conference on Multimedia Modeling

1074 Accesses

Abstract

Methods of Explainable AI (XAI) are popular for understanding the features and decisions of neural networks. Transformers used for single modalities such as videos, texts, or signals as well as multi-modal data can be considered as a state-of-the-art model for various tasks such as classification, detection, segmentation, etc. as they generalize better than conventional CNNs. The use of feature selection methods using interpretability techniques can be exciting to train the transformer models. This work proposes the use of an interpretability method based on attention gradients to highlight important attention weights along the training iterations. This guides the transformer parameters to evolve in a more optimal direction. This work considers a multimodal transformer on multimodal data: video and sensors. First studied on the video part of multimodal data, this strategy is applied to the sensor data in the proposed multimodal transformer architecture before fusion. We show that the late fusion via a combined loss from both modalities outperforms single-modality results. The target application of this approach is Multimedia in Health for the detection of risk situations for frail adults in the @home environment from the wearable video and sensor data (BIRDS dataset). We also benchmark our approach on the publicly available single-video Kinetics-400 dataset to assess the performance, which is indeed better than the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Stepwise Fusion Transformer for Affective Video Content Analysis

A hybrid transformer with domain adaptation using interpretability techniques for the application to the detection of risk situations

Article 11 March 2024

Not only Look, But Also Listen: Learning Multimodal Violence Detection Under Weak Supervision

Notes

1.
BIRDS will be publicly available upon GDPR clearance.

References

Ayyar, M.P., Benois-Pineau, J., Zemmari, A.: Review of white box methods for explanations of convolutional neural networks in image classification tasks. J. Electron. Imaging 30(5), 050901 (2021)
Article Google Scholar
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10(7), e0130140 (2015)
Article Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on CVPR 2017
Google Scholar
Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 782–791, June 2021
Google Scholar
Chefer, H., Schwartz, I., Wolf, L.: Optimizing relevance maps of vision transformers improves robustness. In: NeuRIPS 2022
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR 2021
Google Scholar
Feichtenhofer, C.: X3d: expanding architectures for efficient video recognition. In: 2020 IEEE/CVF Conference on CVPR (2020)
Google Scholar
Guo, X., Guo, X., Lu, Y.: SSAN: separable self-attention network for video representation learning. In: Proceedings of the IEEE/CVF CVPR 2021
Google Scholar
Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: Action recognition by learning deep multi-granular spatio-temporal video representation. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM (2016)
Google Scholar
Liang, H., et al.: Training interpretable convolutional neural networks by differentiating class-specific filters. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 622–638. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_37
Chapter Google Scholar
Liu, K., Li, Y., Xu, N., Natarajan, P.: Learn to combine modalities in multimodal deep learning. ArXiv abs/1805.11730 (2018)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022, October 2021
Google Scholar
Liu, Z., et al.: Video swin transformer. arXiv preprint arXiv:2106.13230 (2021)
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., et al. (eds.) Advances in NeuRIPS 2017
Google Scholar
Mallick, R., Benois-Pineau, J., Zemmari, A.: I saw: a self-attention weighted method for explanation of visual transformers. In: IEEE ICIP 2022
Google Scholar
Mallick, R., et al.: Pooling transformer for detection of risk events in in-the-wild video ego data. In: 26th ICPR 2022
Google Scholar
Mallick, R., Yebda, T., Benois-Pineau, J., Zemmari, A., Pech, M., Amieva, H.: Detection of risky situations for frail adults with hybrid neural networks on multimodal health data. IEEE Multim. 29(1), 7–17 (2022)
Article Google Scholar
Meditskos, G., Plans, P., Stavropoulos, T.G., Benois-Pineau, J., Buso, V., Kompatsiaris, I.: Multi-modal activity recognition from egocentric vision, semantic enrichment and lifelogging applications for the care of dementia. J. Vis. Commun. Image Represent. 51, 169–190 (2018)
Article Google Scholar
Montavon, G., Lapuschkin, S., Binder, A., Samek, W., Müller, K.R.: Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recogn. 65, 211–222 (2017)
Article Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th ICML 2011
Google Scholar
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 639–658. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_39
Chapter Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should I trust you?: Explaining the predictions of any classifier. In: KDD, pp. 1135–1144. ACM (2016)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE/CVF ICCV (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
Google Scholar
Springenberg, J., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net, December 2014
Google Scholar
Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: CVPR (2021)
Google Scholar
Srinivas, S., Fleuret, F.: Full-gradient representation for neural network visualization. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4489–4497, December 2015
Google Scholar
Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity (2020). https://arxiv.org/abs/2006.04768
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Yebda, T., Benois-Pineau, J., Pech, M., Amieva, H., Middleton, L., Bergelt, M.: Multimodal sensor data analysis for detection of risk situations of fragile people in @home environments. In: Lokoč, J., et al. (eds.) MMM 2021. LNCS, vol. 12573, pp. 342–353. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67835-7_29
Chapter Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Zhang, H., Torres, F., Sicre, R., Avrithis, Y., Ayache, S.: Opti-CAM: optimizing saliency maps for interpretability. CoRR abs/2301.07002 (2023)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

LaBRI UMR CNRS 5800, Universite de Bordeaux, 33400, Talence, France
Rupayan Mallick, Jenny Benois-Pineau & Akka Zemmari

Authors

Rupayan Mallick
View author publications
You can also search for this author in PubMed Google Scholar
Jenny Benois-Pineau
View author publications
You can also search for this author in PubMed Google Scholar
Akka Zemmari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rupayan Mallick .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mallick, R., Benois-Pineau, J., Zemmari, A. (2024). IFI: Interpreting for Improving: A Multimodal Transformer with an Interpretability Technique for Recognition of Risk Events. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-53302-0_9
Published: 29 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

IFI: Interpreting for Improving: A Multimodal Transformer with an Interpretability Technique for Recognition of Risk Events