skip to main content
research-article

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

Published:08 March 2024Publication History
Skip Abstract Section

Abstract

As one of the representative types of user-generated contents (UGCs) in social platforms, micro-videos have been becoming popular in our daily life. Although micro-videos naturally exhibit multimodal features that are rich enough to support representation learning, the complex correlations across modalities render valuable information difficult to integrate. In this paper, we introduced a multimodal attentive representation network (MARNET) to learn complete and robust representations to benefit micro-video multi-label classification. To address the commonly missing modality issue, we presented a multimodal information aggregation mechanism module to integrate multimodal information, where latent common representations are obtained by modeling the complementarity and consistency in terms of visual-centered modality groupings instead of single modalities. For the label correlation issue, we designed an attentive graph neural network module to adaptively learn the correlation matrix and representations of labels for better compatibility with training data. In addition, a cross-modal multi-head attention module is developed to make the learned common representations label-aware for multi-label classification. Experiments conducted on two micro-video datasets demonstrate the superior performance of MARNET compared with state-of-the-art methods.

REFERENCES

  1. [1] Andrew Galen, Arora Raman, Bilmes Jeff, and Livescu Karen. 2013. Deep canonical correlation analysis. In Proceedings of International Conference on Machine Learning. 12471255.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Baltrušaitis Tadas, Ahuja Chaitanya, and Morency Louis-Philippe. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423443.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Boutell Matthew R., Luo Jiebo, Shen Xipeng, and Brown Christopher M.. 2004. Learning multi-label scene classification. Pattern Recognition 37, 9 (2004), 17571771.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Brox Thomas, Bruhn Andrés, Papenberg Nils, and Weickert Joachim. 2004. High accuracy optical flow estimation based on a theory for warping. In Proceedings of European Conference on Computer Vision. Springer, 2536.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Cai Desheng, Qian Shengsheng, Fang Quan, and Xu Changsheng. 2021. Heterogeneous hierarchical feature aggregation network for personalized micro-video recommendation. IEEE Transactions on Multimedia (2021).Google ScholarGoogle Scholar
  6. [6] Chatfield Ken, Simonyan Karen, Vedaldi Andrea, and Zisserman Andrew. 2014. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014).Google ScholarGoogle Scholar
  7. [7] Chen Guibin, Ye Deheng, Xing Zhenchang, Chen Jieshan, and Cambria Erik. 2017. Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In Proceedings of International Joint Conference on Neural Networks. 23772383.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chen Hui, Ding Guiguang, Liu Xudong, Lin Zijia, Liu Ji, and Han Jungong. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1265512663.Google ScholarGoogle Scholar
  9. [9] Chen Jingyuan, Song Xuemeng, Nie Liqiang, Wang Xiang, Zhang Hanwang, and Chua Tat-Seng. 2016. Micro tells macro: Predicting the popularity of micro-videos via a transductive model. In Proceedings of ACM International Conference on Multimedia. 898907.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Chen Shizhe, Zhao Yida, Jin Qin, and Wu Qi. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1063810647.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Chen Tianshui, Xu Muxin, Hui Xiaolu, Wu Hefeng, and Lin Liang. 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of IEEE International Conference on Computer Vision. 522531.Google ScholarGoogle Scholar
  12. [12] Chen Xusong, Liu Dong, Xiong Zhiwei, and Zha Zheng-Jun. 2020. Learning and fusing multiple user interest representations for micro-video and movie recommendations. IEEE Transactions on Multimedia 23 (2020), 484496.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Chen Zhao-Min, Wei Xiu-Shen, Wang Peng, and Guo Yanwen. 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 51775186.Google ScholarGoogle Scholar
  14. [14] Cheng Zhi-Qi, Dai Qi, Li Siyao, Mitamura Teruko, and Hauptmann Alexander. 2022. GSRFormer: Grounded situation recognition transformer with alternate semantic attention refinement. In Proceedings of ACM International Conference on Multimedia. 32723281.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Cheng Zhi-Qi, Liu Yang, Wu Xiao, and Hua Xian-Sheng. 2016. Video eCommerce: Towards online video advertising. In Proceedings of the 24th ACM International Conference on Multimedia. 13651374.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Cheng Zhi-Qi, Wu Xiao, Huang Siyu, Li Jun-Xiu, Hauptmann Alexander G., and Peng Qiang. 2018. Learning to transfer: Generalizable attribute learning with multitask neural model search. In Proceedings of ACM International Conference on Multimedia. 9098.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Cheng Zhi-Qi, Wu Xiao, Liu Yang, and Hua Xian-Sheng. 2017. Video2Shop: Exact matching clothes in videos to online shopping images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 40484056.Google ScholarGoogle Scholar
  18. [18] Cheng Zhi-Qi, Zhang Hao, Wu Xiao, and Ngo Chong-Wah. 2017. On the selection of anchors and targets for video hyperlinking. In Proceedings of ACM on International Conference on Multimedia Retrieval. 287293.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Ding Zhengming and Fu Yun. 2016. Robust multi-view subspace learning through dual low-rank decompositions. In Proceedings of AAAI Conference on Artificial Intelligence. 11811187.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Durand Thibaut, Mehrasa Nazanin, and Mori Greg. 2019. Learning a deep ConvNet for multi-label classification with partial labels. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 647657.Google ScholarGoogle Scholar
  21. [21] Fan Wenqi, Ma Yao, Xu Han, Liu Xiaorui, Wang Jianping, Li Qing, and Tang Jiliang. 2020. Deep adversarial canonical correlation analysis. In Proceedings of SIAM International Conference on Data Mining. 352360.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Frome Andrea, Corrado Greg S., Shlens Jon, Bengio Samy, Dean Jeff, Ranzato Marc'Aurelio, and Mikolov Tomas. 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of Advances in Neural Information Processing Systems, Vol. 26.Google ScholarGoogle Scholar
  23. [23] Fürnkranz Johannes, Hüllermeier Eyke, Mencía Eneldo Loza, and Brinker Klaus. 2008. Multilabel classification via calibrated label ranking. Machine Learning 73, 2 (2008), 133153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Gao Ying, Feng Xiaohan, Zhang Tiange, Rigall Eric, Zhou Huiyu, Qi Lin, and Dong Junyu. 2021. Wallpaper texture generation and style transfer based on multi-label semantics. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 15521563.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] He Jun-Yan, Wu Xiao, Cheng Zhi-Qi, Yuan Zhaoquan, and Jiang Yu-Gang. 2021. DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition. Neurocomputing 444 (2021), 319331.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Hershey Shawn, Chaudhuri Sourish, Ellis Daniel P. W., Gemmeke Jort F., Jansen Aren, Moore R. Channing, Plakal Manoj, Platt Devin, Saurous Rif A., and Seybold Bryan. 2017. CNN architectures for large-scale audio classification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 131135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Huang Wei, Xiao Tong, Liu Qi, Huang Zhenya, Ma Jianhui, and Chen Enhong. 2023. HMNet: A hierarchical multi-modal network for educational video concept prediction. International Journal of Machine Learning and Cybernetics (2023), 112.Google ScholarGoogle Scholar
  28. [28] Jiang Qing-Yuan and Li Wu-Jun. 2017. Deep cross-modal hashing. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 32323240.Google ScholarGoogle Scholar
  29. [29] Jiang Yu-Gang, Wu Zuxuan, Wang Jun, Xue Xiangyang, and Chang Shih-Fu. 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 2 (2017), 352364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Jiang Zheheng, Zhou Feixiang, Zhao Aite, Li Xin, Li Ling, Tao Dacheng, Li Xuelong, and Zhou Huiyu. 2021. Muti-view mouse social behaviour recognition with deep graphic model. IEEE Transactions on Image Processing 30 (2021), 54905504.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Jing Peiguang, Su Yuting, Nie Liqiang, Bai Xu, Liu Jing, and Wang Meng. 2017. Low-rank multi-view embedding learning for micro-video popularity prediction. IEEE Transactions on Knowledge and Data Engineering 30, 8 (2017), 15191532.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Karpathy Andrej, Toderici George, Shetty Sanketh, Leung Thomas, Sukthankar Rahul, and Fei-Fei Li. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 17251732.Google ScholarGoogle Scholar
  33. [33] Li Cheng, Wang Bingyu, Pavlu Virgil, and Aslam Javed. 2016. Conditional Bernoulli mixtures for multi-label classification. In Proceedings of International Conference on Machine Learning. 24822491.Google ScholarGoogle Scholar
  34. [34] Li Xiang and Chen Songcan. 2021. A concise yet effective model for non-aligned incomplete multi-view and missing multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2021), 59185932.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Li Yun, Liu Shuyi, Wang Xuejun, and Jing Peiguang. 2023. Self-supervised deep partial adversarial network for micro-video multimodal classification. Information Sciences 630 (2023), 356369. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Liu Jingzhou, Chang Wei-Cheng, Wu Yuexin, and Yang Yiming. 2017. Deep learning for extreme multi-label text classification. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. 115124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Liu Meng, Nie Liqiang, Wang Meng, and Chen Baoquan. 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of ACM International Conference on Multimedia. 970978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Lu Wei, Li Desheng, Nie Liqiang, Jing Peiguang, and Su Yuting. 2021. Learning dual low-rank representation for multi-label micro-video classification. IEEE Transactions on Multimedia 25 (2021), 7789. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Lu Wei, Lin Jiaxin, Jing Peiguang, and Su Yuting. 2023. A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification. IEEE Signal Processing Letters 30 (2023), 6064. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Lu Xu, Zhu Lei, Cheng Zhiyong, Li Jingjing, Nie Xiushan, and Zhang Huaxiang. 2019. Flexible online multi-modal hashing for large-scale multimedia retrieval. In Proceedings of the 27th ACM International Conference on Multimedia (MM). 11291137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Lyu Gengyu, Deng Xiang, Wu Yanan, and Feng Songhe. 2022. Beyond shared subspace: A view-specific fusion for multi-view multi-label learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 76477654.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Markatopoulou Foteini, Mezaris Vasileios, and Patras Ioannis. 2018. Implicit and explicit concept relations in deep neural networks for multi-label video/image annotation. IEEE Transactions on Circuits and Systems for Video Technology 29, 6 (2018), 16311644.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Minling Zhang and Zhihua Zhou. 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18, 10 (2006), 13381351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Ngiam Jiquan, Khosla Aditya, Kim Mingyu, Nam Juhan, Lee Honglak, and Ng Andrew Y.. 2011. Multimodal deep learning. In Proceedings of International Conference on Machine Learning. 689696.Google ScholarGoogle Scholar
  45. [45] Nguyen Phuong Anh, Li Qing, Cheng Zhi-Qi, Lu Yi-Jie, Zhang Hao, Wu Xiao, and Ngo Chong-Wah. 2017. VIREO@ TRECVID 2017: Video-to-text, Ad-hoc video search and video hyperlinking. In 2017 TREC Video Retrieval Evaluation (TRECVID 2017). National Institute of Standards and Technology (NIST).Google ScholarGoogle Scholar
  46. [46] Nie Liqiang, Wang Xiang, Zhang Jianglong, He Xiangnan, Zhang Hanwang, Hong Richang, and Tian Qi. 2017. Enhancing micro-video understanding by harnessing external sounds. In Proceedings of ACM International Conference on Multimedia. 11921200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Pancoast Stephanie and Akbacak Murat. 2014. Softening quantization in bag-of-audio-words. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 13701374.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Peng Minlong, Zhang Qi, Jiang Yu-gang, and Huang Xuan-Jing. 2018. Cross-domain sentiment classification with target domain specific information. In Proceedings of Annual Meeting of the Association for Computational Linguistics. 25052513.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Rajagopalan Shyam Sundar, Morency Louis-Philippe, Baltrusaitis Tadas, and Goecke Roland. 2016. Extending long short-term memory for multi-view structured learning. In Proceedings of European Conference on Computer Vision. 338353.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Read Jesse, Pfahringer Bernhard, Holmes Geoff, and Frank Eibe. 2011. Classifier chains for multi-label classification. Machine Learning 85, 3 (2011), 333.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Sadanand Sreemanananth and Corso Jason J.. 2012. Action bank: A high-level representation of activity in video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 12341241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Scovanner Paul, Ali Saad, and Shah Mubarak. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of ACM International Conference on Multimedia. 357360.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Srivastava Nitish and Salakhutdinov Ruslan. 2014. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research 15, 1 (2014), 29492980.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 19.Google ScholarGoogle Scholar
  55. [55] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of IEEE international Conference on Computer Vision. 44894497.Google ScholarGoogle Scholar
  56. [56] Tran Du, Wang Heng, Torresani Lorenzo, and Feiszli Matt. 2019. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 55525561.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Tu Shuyuan, Dai Qi, Wu Zuxuan, Cheng Zhi-Qi, Hu Han, and Jiang Yu-Gang. 2023. Implicit temporal modeling with learnable alignment for video recognition. arXiv preprint arXiv:2304.10465 (2023). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Maaten Laurens van der and Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 25792605.Google ScholarGoogle Scholar
  59. [59] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 60006010.Google ScholarGoogle Scholar
  60. [60] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Lio Pietro, and Bengio Yoshua. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).Google ScholarGoogle Scholar
  61. [61] Wang Liwei, Li Yin, and Lazebnik Svetlana. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 50055013.Google ScholarGoogle Scholar
  62. [62] Wang Lichen, Liu Yunyu, Qin Can, Sun Gan, and Fu Yun. 2020. Dual relation semi-supervised multi-label learning. In Proceedings of AAAI Conference on Artificial Intelligence. 62276234.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Wang Limin, Qiao Yu, and Tang Xiaoou. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 43054314.Google ScholarGoogle Scholar
  64. [64] Wang Liyuan, Zhang Jing, Tian Qi, Li Chenhao, and Zhuo Li. 2019. Porn streamer recognition in live video streaming via attention-gated multimodal deep features. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2019), 48764886.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Wang Zhe, Fang Zhongli, Li Dongdong, Yang Hai, and Du Wenli. 2021. Semantic supplementary network with prior information for multi-label image classification. IEEE Transactions on Circuits and Systems for Video Technology 32, 4 (2021), 18481859.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Wei Yinwei, Wang Xiang, Guan Weili, Nie Liqiang, Lin Zhouchen, and Chen Baoquan. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29 (2019), 114.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Wei Yinwei, Wang Xiang, Nie Liqiang, He Xiangnan, Hong Richang, and Chua Tat-Seng. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of ACM International Conference on Multimedia. 14371445.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Wu Nan, Jastrzebski Stanislaw, Cho Kyunghyun, and Geras Krzysztof J.. 2022. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In Proceedings of International Conference on Machine Learning. PMLR, 2404324055.Google ScholarGoogle Scholar
  69. [69] Xie Jiayi, Zhu Yaochen, Zhang Zhibin, Peng Jian, Yi Jing, Hu Yaosi, Liu Hongyi, and Chen Zhenzhong. 2020. A multimodal variational encoder-decoder framework for micro-video popularity prediction. In Proceedings of The Web Conference. 25422548.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. [70] Yang Dingkang, Huang Shuai, Kuang Haopeng, Du Yangtao, and Zhang Lihua. 2022. Disentangled representation learning for multimodal emotion recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 16421651.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. [71] Yang Xitong, Ramesh Palghat, Chitta Radha, Madhvanath Sriganesh, Bernal Edgar A., and Luo Jiebo. 2017. Deep multimodal representation learning from temporal data. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 54475455.Google ScholarGoogle Scholar
  72. [72] Ye Jin, He Junjun, Peng Xiaojiang, Wu Wenhao, and Qiao Yu. 2020. Attention-driven dynamic graph convolutional network for multi-label image recognition. In Proceedings of European Conference on Computer Vision. 649665.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. [73] Yeh Chih-Kuan, Wu Wei-Chieh, Ko Wei-Jen, and Wang Yu-Chiang Frank. 2017. Learning deep latent space for multi-label classification. In Proceedings of AAAI Conference on Artificial Intelligence. 28382844.Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Zellinger Werner, Grubinger Thomas, Lughofer Edwin, Natschläger Thomas, and Saminger-Platz Susanne. 2017. Central moment discrepancy (CMD) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811 (2017).Google ScholarGoogle Scholar
  75. [75] Zhang Jia, Luo Zhiming, Li Candong, Zhou Changen, and Li Shaozi. 2019. Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognition 95 (2019), 136150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Zhang Min-Ling and Zhou Zhi-Hua. 2013. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2013), 18191837.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Zhang Yu and Yeung Dit-Yan. 2012. A convex formulation for learning task relationships in multi-task learning. arXiv preprint arXiv:1203.3536 (2012).Google ScholarGoogle Scholar
  78. [78] Zhao Dawei, Gao Qingwei, Lu Yixiang, and Sun Dong. 2022. Non-aligned multi-view multi-label classification via learning view-specific labels. IEEE Transactions on Multimedia (2022).Google ScholarGoogle Scholar
  79. [79] Zhou Fengtao, Huang Sheng, Liu Bo, and Yang Dan. 2021. Multi-label image classification via category prototype compositional learning. IEEE Transactions on Circuits and Systems for Video Technology (2021).Google ScholarGoogle Scholar
  80. [80] Zhu Feng, Li Hongsheng, Ouyang Wanli, Yu Nenghai, and Wang Xiaogang. 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 55135522.Google ScholarGoogle Scholar
  81. [81] Zhu Lei, Lu Xu, Cheng Zhiyong, Li Jingjing, and Zhang Huaxiang. 2020. Deep collaborative multi-view hashing for large-scale image search. IEEE Trans. Image Process. 29 (2020), 46434655.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. [82] Zhu Lei, Zheng Chaoqun, Guan Weili, Li Jingjing, Yang Yang, and Shen Heng Tao. 2023. Multi-modal hashing for efficient multimedia retrieval: A survey. IEEE Transactions on Knowledge and Data Engineering (2023), 120. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. [83] Zhu Xiaoyan, Li Jiaxuan, Ren Jingtao, Wang Jiayin, and Wang Guangtao. 2023. Dynamic ensemble learning for multi-label classification. Information Sciences 623 (2023), 94111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. [84] Zhu Yue, Kwok James T., and Zhou Zhi-Hua. 2017. Multi-label learning with global and local label correlation. IEEE Transactions on Knowledge and Data Engineering 30, 6 (2017), 10811094.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 6
      June 2024
      715 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613638
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 March 2024
      • Online AM: 6 February 2024
      • Accepted: 27 January 2024
      • Revised: 4 January 2024
      • Received: 2 June 2023
      Published in tomm Volume 20, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)212
      • Downloads (Last 6 weeks)67

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text