Abstract
Currently, micro-video is a popular form on various multimedia platforms. The venue information of micro-videos is beneficial for venue-related applications, such as personalized location recommendation and venue recognition. However, the performance of micro-video venue classification task is limited in existing works due to the ignorant of the global dependency of features. To this end, an enhanced non-local (ENL) module is devised to improve the expressiveness of features. Furthermore, in this paper an attention-enhanced joint learning model is proposed to generate discriminative venue representations in an end-to-end manner. Such unified model is consisted of normalized NeXtVLAD, ENL module, CNN layer, and context gate. Specifically, the sequential features extracted from multiple modalities are aggregated into compact vectors via parallel NNeXtVLAD modules. In ENL, the interactions between any two positions of the aggregated features are captured to reinforce the valuable information in multiple modalities. Moreover, the enhanced channel information is adaptively added for further feature enhancement. Then, a CNN layer is applied to fuse enhanced features of multiple modalities. In addition, the effective activation function is explored in the CNN layer to achieve better performance. Finally, the context gate method is used to dynamically model the relationships between features and venue categories for prediction. Experimental results on a public dataset reveal that our proposed micro-video venue classification scheme achieves state-of-the-art performance.
Similar content being viewed by others
References
Arandjelovic, R, Zisserman, A (2013) All about VLAD. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. IEEE Computer Society, pp 1578–1585, https://doi.org/10.1109/CVPR.2013.207
Arandjelovic R, Gronát P, Torii A et al (2018) Netvlad: CNN architecture for weakly supervised place recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1437–1451. https://doi.org/10.1109/TPAMI.2017.2711011
Bowles, M, Scaife, AMM, Porter, F, et al (2020) Attention-gating for improved radio galaxy classification. CoRR arXiv:2012.01248
AB, C, D, P, N, B, et al (2022) Efficient local cloud-based solution for liver cancer detection using deep learning. Int J Cloud Appl Comput 12(1):1–13. https://doi.org/10.4018/IJCAC.2022010109
Chen, J (2016) Multi-modal learning: Study on A large-scale micro-video data collection. In: Proceedings of the 2016 ACM Conference on Multimedia Conference MM. ACM, pp 1454–1458, https://doi.org/10.1145/2964284.2971477
Chen, J, Song, X, Nie, L, et al (2016) Micro tells macro: Predicting the popularity of micro-videos via a transductive model. In: Proceedings of the 2016 ACM Conference on Multimedia Conference. ACM, pp 898–907, https://doi.org/10.1145/2964284.2964314
Csurka, G, Dance, CR, Fan, L, et al (2004) Visual categorization with bags of keypoints. workshop on statistical learning in computer vision eccv. https://doi.org/10.1080/01621459.1949.10483312
Guo, J, Nie, X, Cui, C, et al (2018) Getting more from one attractive scene: Venue retrieval in micro-videos. In: Advances in Multimedia Information Processing PCM 2018 - 19th Pacific-Rim Conference on Multimedia, Lecture Notes in Computer Science, vol 11164. Springer, pp 721–733. https://doi.org/10.1007/978-3-030-00776-8_66
Guo, J, Nie, X, Yin, Y (2020) Mutual complementarity: Multi-modal enhancement semantic learning for micro-video scene recognition. IEEE Access, 8:29,518–29,524. https://doi.org/10.1109/ACCESS.2020.2973240
Guo J, Nie X, Ma Y et al (2021) Attention based consistent semantic learning for micro-video scene recognition. Inf Sci 543:504–516. https://doi.org/10.1016/j.ins.2020.05.064
Guo, M, Liu, Z, Mu, T, et al (2021b) Beyond self-attention: External attention using two linear layers for visual tasks. CoRR https://doi.org/10.48550/arXiv.2105.02358
Hammad, MA, Alkinani, MH, Gupta, BB, et al (2021) Myocardial infarction detection based on deep neural network on imbalanced data. Multimedia Systems (2). https://doi.org/10.1007/s00530-020-00728-8
He, K, Zhang, X, Ren, S, et al (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: 2015 IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society, pp 1026–1034, https://doi.org/10.1109/ICCV.2015.123
He, K, Zhang, X, Ren, S, et al (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hu J, Shen L, Albanie S et al (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372
Huang, L, Luo, B (2017) Tag refinement of micro-videos by learning from multiple data sources. Multim Tools Appl 76(19):20,341–20,358. https://doi.org/10.1007/s11042-017-4781-z
Jing P, Su Y, Nie L et al (2018) Low-rank multi-view embedding learning for micro-video popularity prediction. IEEE Trans Knowl Data Eng 30(8):1519–1532. https://doi.org/10.1109/TKDE.2017.2785784
Kmiec, S, Bae, J, An, R (2018) Learnable pooling methods for video classification. In: Computer Vision - ECCV 2018 Workshops Proceedings, Part IV, Lecture Notes in Computer Science, vol 11132. Springer, pp 229–238, https://doi.org/10.1007/978-3-030-11018-5_21
Krizhevsky, A, Sutskever, I, Hinton, GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012., pp 1106–1114
Lepri, B, Mana, N, Cappelletti, A, et al (2009) Automatic prediction of individual performance from "thin slices" of social behavior. In: Proceedings of the 17th International Conference on Multimedia 2009. ACM, pp 733–736. https://doi.org/10.1145/1631272.1631400
Li D, Deng L, Gupta BB et al (2019) A novel CNN based security guaranteed image watermarking generation scenario for smart city applications. Inf Sci 479:432–447. https://doi.org/10.1016/j.ins.2018.02.060
Lin, R, Xiao, J, Fan, J (2018) Nextvlad: An efficient neural network to aggregate frame-level features for large-scale video classification. In: Computer Vision - ECCV 2018 Workshops Proceedings, Part IV, Lecture Notes in Computer Science, vol 11132. Springer, pp 206–218. https://doi.org/10.1007/978-3-030-11018-5_19
Liu, M, Nie, L, Wang, M, et al (2017) Towards micro-video understanding by joint sequential-sparse modeling. In: Proceedings of the 2017 ACM on Multimedia Conference. ACM, pp 970–978. https://doi.org/10.1145/3123266.3123341
Liu M, Nie L, Wang X et al (2019) Online data organizer: Micro-video categorization by structure-guided multimodal dictionary learning. IEEE Trans Image Process 28(3):1235–1247. https://doi.org/10.1109/TIP.2018.2875363
Liu, W, Huang, X, Cao, G, et al (2018) Joint learning of lstms-cnn and prototype for micro-video venue classification. In: Advances in Multimedia Information Processing - PCM 2018 - 19th Pacific-Rim Conference on Multimedia, Lecture Notes in Computer Science, vol 11165. Springer, pp 705–715. https://doi.org/10.1007/978-3-030-00767-6_65
Liu, W, Huang, X, Cao, G, et al (2019b) Joint learning of nnextvlad, CNN and context gating for micro-video venue classification. IEEE Access, 7:77,091–77,099. https://doi.org/10.1109/ACCESS.2019.2922430
Liu W, Huang X, Cao G et al (2020) Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification. Multim Tools Appl 79(9–10):6709–6726. https://doi.org/10.1007/s11042-019-08147-2
Ma, N, Zhang, X, Sun, J (2020) Funnel activation for visual recognition. In: Computer Vision - ECCV 2020 - 16th European Conference, Lecture Notes in Computer Science, vol 12356. Springer, pp 351–368. https://doi.org/10.1007/978-3-030-58621-8_21
Miech, A, Laptev, I, Sivic, J (2017) Learnable pooling with context gating for video classification. CoRR arXiv:1706.06905
Nair, V, Hinton, GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). Omnipress, pp 807–814
Nguyen, PX, Rogez, G, Fowlkes, CC, et al (2016) The open world of micro-videos. CoRR https://doi.org/10.48550/arXiv.1603.09439
Nie, L, Wang, X, Zhang, J, et al (2017) Enhancing micro-video understanding by harnessing external sounds. In: 2017 ACM on Multimedia Conference. ACM, pp 1192–1200, https://doi.org/10.1145/3123266.3123313
Perronnin, F, Dance, CR (2007) Fisher kernels on visual vocabularies for image categorization. In: 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007). IEEE Computer Society. https://doi.org/10.1109/CVPR.2007.383266
Ramachandran, P, Zoph, B, Le, QV (2018) Searching for activation functions. In: 6th International Conference on Learning Representations, ICLR Workshop Track Proceedings. OpenReview.net
Redi, M, O’Hare, N, Schifanella, R, et al (2014) 6 seconds of sound and vision: Creativity in micro-videos. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition CVPR. IEEE Computer Society, pp 4272–4279. https://doi.org/10.1109/CVPR.2014.544
Salhi DE, Tari A, Kechadi MT (2021) Using e-reputation for sentiment analysis: Twitter as a case study. Int J Cloud Appl Comput 11(2):32–47. https://doi.org/10.4018/IJCAC.2021040103
Sanden, C, Zhang, JZ (2011) Enhancing multi-label music genre classification through ensemble techniques. In: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp 705–714, https://doi.org/10.1145/2009916.2010011
Sharaff, A, Gupta, H (2019) Extra-tree classifier with metaheuristics approach for email classification. In: Advances in computer communication and computational sciences. Springer, p 189–197. https://doi.org/10.1007/978-981-13-6861-5_17
Sharaff A, Nagwani NK (2020) Ml-ec2: An algorithm for multi-label email classification using clustering. International Journal of Web-Based Learning and Teaching Technologies (IJWLTT) 15(2):19–33. https://doi.org/10.4018/IJWLTT.2020040102
Sharaff, A, Nagwani, NK, Dhadse, A (2016) Comparative study of classification algorithms for spam email detection. In: Emerging research in computing, information, communication and applications. Springer, pp 237–244. https://doi.org/10.1007/978-81-322-2553-9_23
Simonyan, K, Zisserman, A (2015) Very deep convolutional networks for large-scale image recognition. In: Bengio Y, LeCun Y (eds) 3rd International Conference on Learning Representations, ICLR Conference Track Proceedings
Wang, Q, Wu, B, Zhu, P, et al (2020) Eca-net: Efficient channel attention for deep convolutional neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE, pp 11,531–11,539. https://doi.org/10.1109/CVPR42600.2020.01155
Wang W, Shen J (2018) Deep visual attention prediction. IEEE Trans Image Process 27(5):2368–2378. https://doi.org/10.1109/TIP.2017.2787612
Wang W, Shen J, Lu X et al (2021) Paying attention to video object pattern understanding. IEEE Trans Pattern Anal Mach Intell 43(7):2413–2428. https://doi.org/10.1109/TPAMI.2020.2966453
Wang, X, Girshick, RB, Gupta, A, et al (2018) Non-local neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE Computer Society, pp 7794–7803. https://doi.org/10.1109/CVPR.2018.00813
Wei Y, Wang X, Guan W et al (2020) Neural multimodal cooperative learning toward micro-video understanding. IEEE Trans Image Process 29:1–14. https://doi.org/10.1109/TIP.2019.2923608
Woo, S, Park, J, Lee, J, et al (2018) CBAM: convolutional block attention module. In: Computer Vision - ECCV 2018 - 15th European Conference, Lecture Notes in Computer Science, vol 11211. Springer, pp 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
Wu D, Ye M, Lin G et al (2022) Person re-identification by context-aware part attention and multi-head collaborative learning. IEEE Trans Inf Forensics Secur 17:115–126. https://doi.org/10.1109/TIFS.2021.3075894
Xie, S, Girshick, RB, Dollár, P, et al (2017) Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, pp 5987–5995. https://doi.org/10.1109/CVPR.2017.634
Yin, J, Shen, J, Gao, X, et al (2021) Graph neural network and spatiotemporal transformer attention for 3d video object detection from point clouds. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–1. https://doi.org/10.1109/TPAMI.2021.3125981
Zhang, J, Nie, L, Wang, X, et al (2016) Shorter-is-better: Venue category estimation from micro-video. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016. ACM, pp 1415–1424. https://doi.org/10.1145/2964284.2964307
Acknowledgements
The authors would like to thank the reviewers for their valuable comments. This research is supported by the national Key Research and Development Program of China (No.2019YFB1406201, No.2020YFB1406800), the National Natural Science Foundation of China under Grant (No.62071434), the Fundamental Research Funds for the Central Universities (Grant No.CUC21GZ010, CUC210B017, CUC22GZ065).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, B., Huang, X., Cao, G. et al. Attention-enhanced joint learning network for micro-video venue classification. Multimed Tools Appl 83, 12425–12443 (2024). https://doi.org/10.1007/s11042-023-15699-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15699-x