Skip to main content
Log in

Attention-enhanced joint learning network for micro-video venue classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Currently, micro-video is a popular form on various multimedia platforms. The venue information of micro-videos is beneficial for venue-related applications, such as personalized location recommendation and venue recognition. However, the performance of micro-video venue classification task is limited in existing works due to the ignorant of the global dependency of features. To this end, an enhanced non-local (ENL) module is devised to improve the expressiveness of features. Furthermore, in this paper an attention-enhanced joint learning model is proposed to generate discriminative venue representations in an end-to-end manner. Such unified model is consisted of normalized NeXtVLAD, ENL module, CNN layer, and context gate. Specifically, the sequential features extracted from multiple modalities are aggregated into compact vectors via parallel NNeXtVLAD modules. In ENL, the interactions between any two positions of the aggregated features are captured to reinforce the valuable information in multiple modalities. Moreover, the enhanced channel information is adaptively added for further feature enhancement. Then, a CNN layer is applied to fuse enhanced features of multiple modalities. In addition, the effective activation function is explored in the CNN layer to achieve better performance. Finally, the context gate method is used to dynamically model the relationships between features and venue categories for prediction. Experimental results on a public dataset reveal that our proposed micro-video venue classification scheme achieves state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://www.tensorflow.org

References

  1. Arandjelovic, R, Zisserman, A (2013) All about VLAD. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. IEEE Computer Society, pp 1578–1585, https://doi.org/10.1109/CVPR.2013.207

  2. Arandjelovic R, Gronát P, Torii A et al (2018) Netvlad: CNN architecture for weakly supervised place recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1437–1451. https://doi.org/10.1109/TPAMI.2017.2711011

    Article  Google Scholar 

  3. Bowles, M, Scaife, AMM, Porter, F, et al (2020) Attention-gating for improved radio galaxy classification. CoRR arXiv:2012.01248

  4. AB, C, D, P, N, B, et al (2022) Efficient local cloud-based solution for liver cancer detection using deep learning. Int J Cloud Appl Comput 12(1):1–13. https://doi.org/10.4018/IJCAC.2022010109

  5. Chen, J (2016) Multi-modal learning: Study on A large-scale micro-video data collection. In: Proceedings of the 2016 ACM Conference on Multimedia Conference MM. ACM, pp 1454–1458, https://doi.org/10.1145/2964284.2971477

  6. Chen, J, Song, X, Nie, L, et al (2016) Micro tells macro: Predicting the popularity of micro-videos via a transductive model. In: Proceedings of the 2016 ACM Conference on Multimedia Conference. ACM, pp 898–907, https://doi.org/10.1145/2964284.2964314

  7. Csurka, G, Dance, CR, Fan, L, et al (2004) Visual categorization with bags of keypoints. workshop on statistical learning in computer vision eccv. https://doi.org/10.1080/01621459.1949.10483312

  8. Guo, J, Nie, X, Cui, C, et al (2018) Getting more from one attractive scene: Venue retrieval in micro-videos. In: Advances in Multimedia Information Processing PCM 2018 - 19th Pacific-Rim Conference on Multimedia, Lecture Notes in Computer Science, vol 11164. Springer, pp 721–733. https://doi.org/10.1007/978-3-030-00776-8_66

  9. Guo, J, Nie, X, Yin, Y (2020) Mutual complementarity: Multi-modal enhancement semantic learning for micro-video scene recognition. IEEE Access, 8:29,518–29,524. https://doi.org/10.1109/ACCESS.2020.2973240

  10. Guo J, Nie X, Ma Y et al (2021) Attention based consistent semantic learning for micro-video scene recognition. Inf Sci 543:504–516. https://doi.org/10.1016/j.ins.2020.05.064

    Article  MathSciNet  Google Scholar 

  11. Guo, M, Liu, Z, Mu, T, et al (2021b) Beyond self-attention: External attention using two linear layers for visual tasks. CoRR https://doi.org/10.48550/arXiv.2105.02358

  12. Hammad, MA, Alkinani, MH, Gupta, BB, et al (2021) Myocardial infarction detection based on deep neural network on imbalanced data. Multimedia Systems (2). https://doi.org/10.1007/s00530-020-00728-8

  13. He, K, Zhang, X, Ren, S, et al (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: 2015 IEEE International Conference on Computer Vision, ICCV. IEEE Computer Society, pp 1026–1034, https://doi.org/10.1109/ICCV.2015.123

  14. He, K, Zhang, X, Ren, S, et al (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  15. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  16. Hu J, Shen L, Albanie S et al (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372

    Article  Google Scholar 

  17. Huang, L, Luo, B (2017) Tag refinement of micro-videos by learning from multiple data sources. Multim Tools Appl 76(19):20,341–20,358. https://doi.org/10.1007/s11042-017-4781-z

  18. Jing P, Su Y, Nie L et al (2018) Low-rank multi-view embedding learning for micro-video popularity prediction. IEEE Trans Knowl Data Eng 30(8):1519–1532. https://doi.org/10.1109/TKDE.2017.2785784

    Article  Google Scholar 

  19. Kmiec, S, Bae, J, An, R (2018) Learnable pooling methods for video classification. In: Computer Vision - ECCV 2018 Workshops Proceedings, Part IV, Lecture Notes in Computer Science, vol 11132. Springer, pp 229–238, https://doi.org/10.1007/978-3-030-11018-5_21

  20. Krizhevsky, A, Sutskever, I, Hinton, GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012., pp 1106–1114

  21. Lepri, B, Mana, N, Cappelletti, A, et al (2009) Automatic prediction of individual performance from "thin slices" of social behavior. In: Proceedings of the 17th International Conference on Multimedia 2009. ACM, pp 733–736. https://doi.org/10.1145/1631272.1631400

  22. Li D, Deng L, Gupta BB et al (2019) A novel CNN based security guaranteed image watermarking generation scenario for smart city applications. Inf Sci 479:432–447. https://doi.org/10.1016/j.ins.2018.02.060

    Article  Google Scholar 

  23. Lin, R, Xiao, J, Fan, J (2018) Nextvlad: An efficient neural network to aggregate frame-level features for large-scale video classification. In: Computer Vision - ECCV 2018 Workshops Proceedings, Part IV, Lecture Notes in Computer Science, vol 11132. Springer, pp 206–218. https://doi.org/10.1007/978-3-030-11018-5_19

  24. Liu, M, Nie, L, Wang, M, et al (2017) Towards micro-video understanding by joint sequential-sparse modeling. In: Proceedings of the 2017 ACM on Multimedia Conference. ACM, pp 970–978. https://doi.org/10.1145/3123266.3123341

  25. Liu M, Nie L, Wang X et al (2019) Online data organizer: Micro-video categorization by structure-guided multimodal dictionary learning. IEEE Trans Image Process 28(3):1235–1247. https://doi.org/10.1109/TIP.2018.2875363

    Article  MathSciNet  Google Scholar 

  26. Liu, W, Huang, X, Cao, G, et al (2018) Joint learning of lstms-cnn and prototype for micro-video venue classification. In: Advances in Multimedia Information Processing - PCM 2018 - 19th Pacific-Rim Conference on Multimedia, Lecture Notes in Computer Science, vol 11165. Springer, pp 705–715. https://doi.org/10.1007/978-3-030-00767-6_65

  27. Liu, W, Huang, X, Cao, G, et al (2019b) Joint learning of nnextvlad, CNN and context gating for micro-video venue classification. IEEE Access, 7:77,091–77,099. https://doi.org/10.1109/ACCESS.2019.2922430

  28. Liu W, Huang X, Cao G et al (2020) Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification. Multim Tools Appl 79(9–10):6709–6726. https://doi.org/10.1007/s11042-019-08147-2

    Article  Google Scholar 

  29. Ma, N, Zhang, X, Sun, J (2020) Funnel activation for visual recognition. In: Computer Vision - ECCV 2020 - 16th European Conference, Lecture Notes in Computer Science, vol 12356. Springer, pp 351–368. https://doi.org/10.1007/978-3-030-58621-8_21

  30. Miech, A, Laptev, I, Sivic, J (2017) Learnable pooling with context gating for video classification. CoRR arXiv:1706.06905

  31. Nair, V, Hinton, GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). Omnipress, pp 807–814

  32. Nguyen, PX, Rogez, G, Fowlkes, CC, et al (2016) The open world of micro-videos. CoRR https://doi.org/10.48550/arXiv.1603.09439

  33. Nie, L, Wang, X, Zhang, J, et al (2017) Enhancing micro-video understanding by harnessing external sounds. In: 2017 ACM on Multimedia Conference. ACM, pp 1192–1200, https://doi.org/10.1145/3123266.3123313

  34. Perronnin, F, Dance, CR (2007) Fisher kernels on visual vocabularies for image categorization. In: 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007). IEEE Computer Society. https://doi.org/10.1109/CVPR.2007.383266

  35. Ramachandran, P, Zoph, B, Le, QV (2018) Searching for activation functions. In: 6th International Conference on Learning Representations, ICLR Workshop Track Proceedings. OpenReview.net

  36. Redi, M, O’Hare, N, Schifanella, R, et al (2014) 6 seconds of sound and vision: Creativity in micro-videos. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition CVPR. IEEE Computer Society, pp 4272–4279. https://doi.org/10.1109/CVPR.2014.544

  37. Salhi DE, Tari A, Kechadi MT (2021) Using e-reputation for sentiment analysis: Twitter as a case study. Int J Cloud Appl Comput 11(2):32–47. https://doi.org/10.4018/IJCAC.2021040103

    Article  Google Scholar 

  38. Sanden, C, Zhang, JZ (2011) Enhancing multi-label music genre classification through ensemble techniques. In: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp 705–714, https://doi.org/10.1145/2009916.2010011

  39. Sharaff, A, Gupta, H (2019) Extra-tree classifier with metaheuristics approach for email classification. In: Advances in computer communication and computational sciences. Springer, p 189–197. https://doi.org/10.1007/978-981-13-6861-5_17

  40. Sharaff A, Nagwani NK (2020) Ml-ec2: An algorithm for multi-label email classification using clustering. International Journal of Web-Based Learning and Teaching Technologies (IJWLTT) 15(2):19–33. https://doi.org/10.4018/IJWLTT.2020040102

    Article  Google Scholar 

  41. Sharaff, A, Nagwani, NK, Dhadse, A (2016) Comparative study of classification algorithms for spam email detection. In: Emerging research in computing, information, communication and applications. Springer, pp 237–244. https://doi.org/10.1007/978-81-322-2553-9_23

  42. Simonyan, K, Zisserman, A (2015) Very deep convolutional networks for large-scale image recognition. In: Bengio Y, LeCun Y (eds) 3rd International Conference on Learning Representations, ICLR Conference Track Proceedings

  43. Wang, Q, Wu, B, Zhu, P, et al (2020) Eca-net: Efficient channel attention for deep convolutional neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE, pp 11,531–11,539. https://doi.org/10.1109/CVPR42600.2020.01155

  44. Wang W, Shen J (2018) Deep visual attention prediction. IEEE Trans Image Process 27(5):2368–2378. https://doi.org/10.1109/TIP.2017.2787612

    Article  MathSciNet  Google Scholar 

  45. Wang W, Shen J, Lu X et al (2021) Paying attention to video object pattern understanding. IEEE Trans Pattern Anal Mach Intell 43(7):2413–2428. https://doi.org/10.1109/TPAMI.2020.2966453

    Article  Google Scholar 

  46. Wang, X, Girshick, RB, Gupta, A, et al (2018) Non-local neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE Computer Society, pp 7794–7803. https://doi.org/10.1109/CVPR.2018.00813

  47. Wei Y, Wang X, Guan W et al (2020) Neural multimodal cooperative learning toward micro-video understanding. IEEE Trans Image Process 29:1–14. https://doi.org/10.1109/TIP.2019.2923608

    Article  MathSciNet  Google Scholar 

  48. Woo, S, Park, J, Lee, J, et al (2018) CBAM: convolutional block attention module. In: Computer Vision - ECCV 2018 - 15th European Conference, Lecture Notes in Computer Science, vol 11211. Springer, pp 3–19. https://doi.org/10.1007/978-3-030-01234-2_1

  49. Wu D, Ye M, Lin G et al (2022) Person re-identification by context-aware part attention and multi-head collaborative learning. IEEE Trans Inf Forensics Secur 17:115–126. https://doi.org/10.1109/TIFS.2021.3075894

    Article  Google Scholar 

  50. Xie, S, Girshick, RB, Dollár, P, et al (2017) Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, pp 5987–5995. https://doi.org/10.1109/CVPR.2017.634

  51. Yin, J, Shen, J, Gao, X, et al (2021) Graph neural network and spatiotemporal transformer attention for 3d video object detection from point clouds. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–1. https://doi.org/10.1109/TPAMI.2021.3125981

  52. Zhang, J, Nie, L, Wang, X, et al (2016) Shorter-is-better: Venue category estimation from micro-video. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016. ACM, pp 1415–1424. https://doi.org/10.1145/2964284.2964307

Download references

Acknowledgements

The authors would like to thank the reviewers for their valuable comments. This research is supported by the national Key Research and Development Program of China (No.2019YFB1406201, No.2020YFB1406800), the National Natural Science Foundation of China under Grant (No.62071434), the Fundamental Research Funds for the Central Universities (Grant No.CUC21GZ010, CUC210B017, CUC22GZ065).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gang Cao.

Ethics declarations

Conflicts of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, B., Huang, X., Cao, G. et al. Attention-enhanced joint learning network for micro-video venue classification. Multimed Tools Appl 83, 12425–12443 (2024). https://doi.org/10.1007/s11042-023-15699-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15699-x

Keywords

Navigation