Abstract
Video summarization compresses videos while preserving the most meaningful content for users. Many image-based works focus on how to effectively utilize video visual cues to choose keyframes. However, apart from visual content, videos also contain useful audio information. In this paper, we propose a novel attention-based audio-visual fusion framework which integrates the audio information with visual information. Our framework is composed of two key components: asymmetrical self-attention mechanism, and odd-even attention. The asymmetrical self-attention mechanism addresses the problem that visual information is more strongly related to video summarization than audio information. The odd-even attention focuses on alleviating the memory requirements. Besides, we create ViAu-SumMe, an audio-visual dataset, which is based on SumMe dataset. Experimental results on the dataset show that our proposed method outperforms the state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
37 Mind blowing youtube facts, figures and statistics – 2019 (2019). https://merchdope.com/youtube-stats/
Fajtl, J., Sokeh, H.S., Argyriou, V., Monekosso, D., Remagnino, P.: Summarizing videos with attention (2018)
Gong, Y., Liu, X.: Video summarization using singular value decomposition. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), vol. 2, pp. 174–180. IEEE (2000)
Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_33
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)
Katsaggelos, A.K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lahat, D., Adali, T., Jutten, C.: Multimodal data fusion: an overview of methods, challenges, and prospects. Proc. IEEE 103(9), 1449–1477 (2015)
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems, pp. 289–297 (2016)
Nam, J., Tewfik, A.H.: Video abstract of video. In: 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No. 99TH8451), pp. 117–122. IEEE (1999)
Petridis, S., Wang, Y., Li, Z., Pantic, M.: End-to-end audiovisual fusion with LSTMs. arXiv preprint arXiv:1709.04343 (2017)
Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_35
Rochan, M., Ye, L., Wang, Y.: Video summarization using fully convolutional sequence networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 358–374. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_22
Sterpu, G., Saam, C., Harte, N.: Attention-based audio-visual fusion for robust automatic speech recognition. In: Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 111–115. ACM (2018)
Wei, H., Ni, B., Yan, Y., Yu, H., Yang, X., Yao, C.: Video summarization via semantic attended networks. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Yuan, L., Tay, F.E., Li, P., Zhou, L., Feng, J.: Cycle-SUM: cycle-consistent adversarial LSTM networks for unsupervised video summarization. arXiv preprint arXiv:1904.08265 (2019)
Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_47
Zhou, K., Qiao, Y., Xiang, T.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Zhou, P., Yang, W., Chen, W., Wang, Y., Jia, J.: Modality attention for end-to-end audio-visual speech recognition. arXiv preprint arXiv:1811.05250 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Fang, Y., Zhang, J., Lu, C. (2019). Attention-Based Audio-Visual Fusion for Video Summarization. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Lecture Notes in Computer Science(), vol 11954. Springer, Cham. https://doi.org/10.1007/978-3-030-36711-4_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-36711-4_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36710-7
Online ISBN: 978-3-030-36711-4
eBook Packages: Computer ScienceComputer Science (R0)