research-article

A Study on the Use of Attention for Explaining Video Summarization

Authors:
Evlampios Apostolidis

CERTH-ITI, Thessaloniki, Greece

CERTH-ITI, Thessaloniki, Greece

0000-0001-5376-7158
View Profile

,
Vasileios Mezaris

CERTH-ITI, Thessaloniki, Greece

CERTH-ITI, Thessaloniki, Greece

0000-0002-0121-4364
View Profile

,
Ioannis Patras

Queen Mary University of London, London, United Kingdom

Queen Mary University of London, London, United Kingdom

0000-0003-3913-4738
View Profile

NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long VideosOctober 2023Pages 41–49https://doi.org/10.1145/3607540.3617138

Published:29 October 2023Publication History

NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos

Pages 41–49

ABSTRACT

In this paper we present our study on the use of attention for explaining video summarization. We build on a recent work that formulates the task, called XAI-SUM, and we extend it by: a) taking into account two additional network architectures and b) introducing two novel explanation signals that relate to the entropy and diversity of attention weights. In total, we examine the effectiveness of seven types of explanation, using three state-of-the-art attention-based network architectures (CA-SUM, VASNet, SUM-GDA) and two datasets (SumMe, TVSum) for video summarization. The conducted evaluations show that the inherent attention weights are more suitable for explaining network architectures which integrate mechanisms for estimating attentive diversity (SUM-GDA) and uniqueness (CA-SUM). The explanation of simpler architectures (VASNet) can benefit from taking into account estimates about the strength of the input vectors, while another option is to consider the entropy of attention weights.

Supplemental Material

nars13-video.mp4

mp4

31.1 MB

Download

References

Sathyanarayanan N. Aakur, Fillipe D. M. de Souza, and Sudeep Sarkar. 2018. An Inherently Explainable Model for Video Activity Interpretation. In The Workshops of the 32nd AAAI Conf. on Artificial Intelligence.Google Scholar
Evlampios Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, Vasileios Mezaris, and Ioannis Patras. 2021a. Video Summarization Using Deep Neural Networks: A Survey. Proc. IEEE, Vol. 109, 11 (2021), 1838--1863. https://doi.org/10.1109/JPROC.2021.3117472Google ScholarCross Ref
Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, and Ioannis Patras. 2021b. Combining Global and Local Attention with Positional Encoding for Video Summarization. In 2021 IEEE International Symposium on Multimedia (ISM). 226--234. https://doi.org/10.1109/ISM52913.2021.00045Google ScholarCross Ref
Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, and Ioannis Patras. 2022a. Explaining video summarization based on the focus of attention. In 2022 IEEE Int. Symposium on Multimedia (ISM). 146--150. https://doi.org/10.1109/ISM55400.2022.00029Google ScholarCross Ref
Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, and Ioannis Patras. 2022b. Summarizing Videos Using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames. In Proc. of the 2022 Int. Conf. on Multimedia Retrieval (Newark, NJ, USA) (ICMR '22). Association for Computing Machinery, New York, NY, USA, 407--415. https://doi.org/10.1145/3512527.3531404Google ScholarDigital Library
Sarah Adel Bargal, Andrea Zunino, Donghyun Kim, Jianming Zhang, Vittorio Murino, and Stan Sclaroff. 2018. Excitation Backprop for RNNs. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
George Chrysostomou and Nikolaos Aletras. 2021. Improving the Faithfulness of Attention-based Explanations with Task-specific Information for Text Classification. In Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 477--488. https://doi.org/10.18653/v1/2021.acl-long.40Google ScholarCross Ref
George Chrysostomou and Nikolaos Aletras. 2022. An Empirical Study on Explanations in Out-of-Domain Settings. In Proc. of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 6920--6938. https://doi.org/10.18653/v1/2022.acl-long.477Google ScholarCross Ref
Chrysa Collyda, Konstantinos Apostolidis, Evlampios Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, and Vasileios Mezaris. 2020. A Web Service for Video Summarization. In ACM Int. Conf. on Interactive Media Experiences (Cornella, Barcelona, Spain) (IMX '20). Association for Computing Machinery, New York, NY, USA, 148--153. https://doi.org/10.1145/3391614.3399391Google ScholarDigital Library
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 248--255. https://doi.org/10.1109/CVPR.2009.5206848Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), , Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/v1/n19--1423Google ScholarCross Ref
Jiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, and Paolo Remagnino. 2019. Summarizing Videos with Attention. In Asian Conf. on Computer Vision (ACCV) 2018 Workshops, Gustavo Carneiro and Shaodi You (Eds.). Springer International Publishing, Cham, 39--54.Google Scholar
Nikolaos Gkalelis, Dimitrios Daskalakis, and Vasileios Mezaris. 2022. ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network. IEEE Access , Vol. 10 (2022), 108797--108816. https://doi.org/10.1109/ACCESS.2022.3213652Google ScholarCross Ref
Ioanna Gkartzonika, Nikolaos Gkalelis, and Vasileios Mezaris. 2023. Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism. In Computer Vision -- ECCV 2022 Workshops, , Leonid Karlinsky, Tomer Michaeli, and Ko Nishino (Eds.). Springer Nature Switzerland, Cham, 396--411.Google Scholar
Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating Summaries from User Videos. In Europ. Conf. on Computer Vision (ECCV) 2014, , David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 505--520. https://gyglim.github.io/me/Google ScholarCross Ref
Yamin Han, Tao Zhuo, Peng Zhang, Wei Huang, Yufei Zha, Yanning Zhang, and Mohan Kankanhalli. 2022. One-shot Video Graph Generation for Explainable Action Reasoning. Neurocomputing , Vol. 488 (2022), 212--225. https://doi.org/10.1016/j.neucom.2022.02.069Google ScholarDigital Library
Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2021. Self-Attention Attribution: Interpreting Information Interactions Inside Transformer. Proc. of the AAAI Conf. on Artificial Intelligence, Vol. 35, 14 (May 2021), 12963--12971. https://doi.org/10.1609/aaai.v35i14.17533Google ScholarCross Ref
Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 3543--3556. https://doi.org/10.18653/v1/N19--1357Google ScholarCross Ref
Maurice G Kendall. 1945. The treatment of ties in ranking problems. Biometrika, Vol. 33, 3 (1945), 239--251.Google ScholarCross Ref
Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms. In Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 7057--7075. https://doi.org/10.18653/v1/2020.emnlp-main.574Google ScholarCross Ref
Stephen Kokoska and Daniel Zwillinger. 2000. CRC standard probability and statistics tables and formulae. Crc Press.Google Scholar
Liangzhi Li, Bowen Wang, Manisha Verma, Yuta Nakashima, Ryo Kawasaki, and Hajime Nagahara. 2021b. SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition. In 2021 IEEE/CVF Int. Conf. on Computer Vision (ICCV). 1026--1035. https://doi.org/10.1109/ICCV48922.2021.00108Google ScholarCross Ref
Ping Li, Qinghao Ye, Luming Zhang, Li Yuan, Xianghua Xu, and Ling Shao. 2021c. Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognition , Vol. 111 (2021), 107677. https://doi.org/10.1016/j.patcog.2020.107677Google ScholarCross Ref
Zhenqiang Li, Weimin Wang, Zuoyue Li, Yifei Huang, and Yoichi Sato. 2021a. Towards Visually Explaining Video Understanding Networks with Perturbation. 2021 IEEE Winter Conf. on Applications of Computer Vision (WACV) (2021), 1119--1128.Google Scholar
Yibing Liu, Haoliang Li, Yangyang Guo, Chenqi Kong, Jing Li, and Shiqi Wang. 2022. Rethinking Attention-Model Explainability through Faithfulness Violation Test. In Proc. of the 39th Int. Conf. on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), , Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 13807--13824. https://proceedings.mlr.press/v162/liu22i.htmlGoogle Scholar
Joonatan M"antt"ari, Sofia Broomé, John Folkesson, and Hedvig Kjellström. 2020. Interpreting Video Features: A Comparison of 3D Convolutional Networks and Convolutional LSTM Networks. In Asian Conference on Computer Vision (ACCV) 2020, , Hiroshi Ishikawa, Cheng-Lin Liu, Tomas Pajdla, and Jianbo Shi (Eds.). Springer International Publishing, Cham, 411--426.Google Scholar
Medhini Narasimhan, Anna Rohrbach, and Trevor Darrell. 2021. CLIP-It! Language-Guided Video Summarization. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 13988--14000. https://proceedings.neurips.cc/paper/2021/file/7503cfacd12053d309b6bed5c89de212-Paper.pdfGoogle Scholar
Mariano Ntrougkas, Nikolaos Gkalelis, and Vasileios Mezaris. 2022. TAME: Attention Mechanism Based Feature Fusion for Generating Explanation Maps of Convolutional Neural Networks. In 2022 IEEE Int. Symposium on Multimedia (ISM). 58--65. https://doi.org/10.1109/ISM55400.2022.00014Google ScholarCross Ref
Konstantinos E. Papoutsakis and Antonis A. Argyros. 2019. Unsupervised and Explainable Assessment of Video Similarity. In British Machine Vision Conference. https://api.semanticscholar.org/CorpusID:199525379Google Scholar
Zhao Ren, Kun Qian, Fengquan Dong, Zhenyu Dai, Wolfgang Nejdl, Yoshiharu Yamamoto, and Björn W. Schuller. 2022. Deep attention-based neural networks for explainable heart sound classification. Machine Learning with Applications , Vol. 9 (2022), 100322. https://doi.org/10.1016/j.mlwa.2022.100322Google ScholarCross Ref
Chiradeep Roy, Mahesh Shanbhag, Mahsan Nourani, Tahrima Rahman, Samia Kabir, Vibhav Gogate, Nicholas Ruozzi, and Eric D. Ragan. 2019. Explainable Activity Recognition in Videos. In ACM Intelligent User Interfaces (IUI) Workshops.Google Scholar
Sofia Serrano and Noah A. Smith. 2019. Is Attention Interpretable?. In Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 2931--2951. https://doi.org/10.18653/v1/P19--1282Google ScholarCross Ref
Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. TVSum: Summarizing web videos using titles. In 2015 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). 5179--5187. https://doi.org/10.1109/CVPR.2015.7299154Google ScholarCross Ref
Alexandros Stergiou, Georgios Kapidis, Grigorios Kalliatakis, Christos Chrysoulas, Remco Veltkamp, and Ronald Poppe. 2019. Saliency Tubes: Visual Explanations for Spatio-Temporal Convolutions. In 2019 IEEE Int. Conf. on Image Processing (ICIP). 1830--1834. https://doi.org/10.1109/ICIP.2019.8803153Google ScholarCross Ref
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In 2015 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). 1--9. https://doi.org/10.1109/CVPR.2015.7298594Google ScholarCross Ref
Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not Explanation. In Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 11--20. https://doi.org/10.18653/v1/D19--1002Google ScholarCross Ref
Chongke Wu, Sicong Shao, Pratik Satam, and Salim Hariri. 2022. An explainable and efficient deep learning framework for video anomaly detection. Cluster Computing, Vol. 25, 4 (Aug. 2022), 2715--2737. https://doi.org/10.1007/s10586-021-03439--5Google ScholarDigital Library
Hongyuan Yu, Yan Huang, Lihong Pi, Chengquan Zhang, Xuan Li, and Liang Wang. 2021. End-to-end video text detection with online tracking. Pattern Recognition , Vol. 113 (2021), 107791. https://doi.org/10.1016/j.patcog.2020.107791Google ScholarCross Ref
Kunpeng Zhang and Li Li. 2022. Explainable multimodal trajectory prediction using attention models. Transportation Research Part C: Emerging Technologies , Vol. 143 (2022), 103829. https://doi.org/10.1016/j.trc.2022.103829Google ScholarCross Ref
Tao Zhuo, Zhiyong Cheng, Peng Zhang, Yongkang Wong, and Mohan Kankanhalli. 2019. Explainable Video Action Reasoning via Prior Knowledge and State Transitions. In Proc. of the 27th ACM Int. Conf. on Multimedia (Nice, France) (MM '19). Association for Computing Machinery, New York, NY, USA, 521--529. https://doi.org/10.1145/3343031.3351040 ioGoogle ScholarDigital Library

Index Terms

A Study on the Use of Attention for Explaining Video Summarization
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization

Recommendations

Attention history-based attention for abstractive text summarization
SAC '20: Proceedings of the 35th Annual ACM Symposium on Applied Computing

Recently, encoder-decoder model using attention has shown meaningful results in the abstractive summarization tasks. In the attention mechanism, the attention distribution is generated based only on the current decoder state. However, since there are ...
Read More
Hierarchical Variational Network for User-Diversified & Query-Focused Video Summarization
ICMR '19: Proceedings of the 2019 on International Conference on Multimedia Retrieval

This paper focuses on the query-focused video summarization, which is an extended task of video summarization and aims to automatically generate user-oriented summary by highlighting frames/shots relevant to the query. This task is different from ...
Read More
A user attention model for video summarization
MULTIMEDIA '02: Proceedings of the tenth ACM international conference on Multimedia

Automatic generation of video summarization is one of the key techniques in video management and browsing. In this paper, we present a generic framework of video summarization based on the modeling of viewer's attention. Without fully semantic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos
October 2023
82 pages
ISBN:9798400702778
DOI:10.1145/3607540
General Chairs:
Mohan S. Kankanhalli
National University of Singapore
,
Ioannis (Yiannis) Patras
Queen Mary University of London
,
Program Chairs:
Jianquan Liu
NEC Corporation, Japan
,
Yongkang Wong
National University of Singapore
,
Takahiro Komamizu
Nagoya University, Japan
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
attention mechanism
explainable ai
explanation signals
replacement functions
sanity violation
video summarization
Qualifiers
- research-article
Conference
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 78
  Total Downloads
- Downloads (Last 12 months)78
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Study on the Use of Attention for Explaining Video Summarization

NarSUM '23: Proceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Attention history-based attention for abstractive text summarization

Hierarchical Variational Network for User-Diversified & Query-Focused Video Summarization

A user attention model for video summarization