Abstract
As one of the representative types of user-generated contents (UGCs) in social platforms, micro-videos have been becoming popular in our daily life. Although micro-videos naturally exhibit multimodal features that are rich enough to support representation learning, the complex correlations across modalities render valuable information difficult to integrate. In this paper, we introduced a multimodal attentive representation network (MARNET) to learn complete and robust representations to benefit micro-video multi-label classification. To address the commonly missing modality issue, we presented a multimodal information aggregation mechanism module to integrate multimodal information, where latent common representations are obtained by modeling the complementarity and consistency in terms of visual-centered modality groupings instead of single modalities. For the label correlation issue, we designed an attentive graph neural network module to adaptively learn the correlation matrix and representations of labels for better compatibility with training data. In addition, a cross-modal multi-head attention module is developed to make the learned common representations label-aware for multi-label classification. Experiments conducted on two micro-video datasets demonstrate the superior performance of MARNET compared with state-of-the-art methods.
- [1] . 2013. Deep canonical correlation analysis. In Proceedings of International Conference on Machine Learning. 1247–1255.Google ScholarDigital Library
- [2] . 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423–443.Google ScholarDigital Library
- [3] . 2004. Learning multi-label scene classification. Pattern Recognition 37, 9 (2004), 1757–1771.Google ScholarCross Ref
- [4] . 2004. High accuracy optical flow estimation based on a theory for warping. In Proceedings of European Conference on Computer Vision. Springer, 25–36.Google ScholarCross Ref
- [5] . 2021. Heterogeneous hierarchical feature aggregation network for personalized micro-video recommendation. IEEE Transactions on Multimedia (2021).Google Scholar
- [6] . 2014. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014).Google Scholar
- [7] . 2017. Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In Proceedings of International Joint Conference on Neural Networks. 2377–2383.Google ScholarCross Ref
- [8] . 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 12655–12663.Google Scholar
- [9] . 2016. Micro tells macro: Predicting the popularity of micro-videos via a transductive model. In Proceedings of ACM International Conference on Multimedia. 898–907.Google ScholarDigital Library
- [10] . 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10638–10647.Google ScholarCross Ref
- [11] . 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of IEEE International Conference on Computer Vision. 522–531.Google Scholar
- [12] . 2020. Learning and fusing multiple user interest representations for micro-video and movie recommendations. IEEE Transactions on Multimedia 23 (2020), 484–496.Google ScholarCross Ref
- [13] . 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5177–5186.Google Scholar
- [14] . 2022. GSRFormer: Grounded situation recognition transformer with alternate semantic attention refinement. In Proceedings of ACM International Conference on Multimedia. 3272–3281.Google ScholarDigital Library
- [15] . 2016. Video eCommerce: Towards online video advertising. In Proceedings of the 24th ACM International Conference on Multimedia. 1365–1374.Google ScholarDigital Library
- [16] . 2018. Learning to transfer: Generalizable attribute learning with multitask neural model search. In Proceedings of ACM International Conference on Multimedia. 90–98.Google ScholarDigital Library
- [17] . 2017. Video2Shop: Exact matching clothes in videos to online shopping images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4048–4056.Google Scholar
- [18] . 2017. On the selection of anchors and targets for video hyperlinking. In Proceedings of ACM on International Conference on Multimedia Retrieval. 287–293.Google ScholarDigital Library
- [19] . 2016. Robust multi-view subspace learning through dual low-rank decompositions. In Proceedings of AAAI Conference on Artificial Intelligence. 1181–1187.Google ScholarDigital Library
- [20] . 2019. Learning a deep ConvNet for multi-label classification with partial labels. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 647–657.Google Scholar
- [21] . 2020. Deep adversarial canonical correlation analysis. In Proceedings of SIAM International Conference on Data Mining. 352–360.Google ScholarCross Ref
- [22] . 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of Advances in Neural Information Processing Systems, Vol. 26.Google Scholar
- [23] . 2008. Multilabel classification via calibrated label ranking. Machine Learning 73, 2 (2008), 133–153.Google ScholarDigital Library
- [24] . 2021. Wallpaper texture generation and style transfer based on multi-label semantics. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 1552–1563.Google ScholarCross Ref
- [25] . 2021. DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition. Neurocomputing 444 (2021), 319–331.Google ScholarCross Ref
- [26] . 2017. CNN architectures for large-scale audio classification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 131–135.Google ScholarDigital Library
- [27] . 2023. HMNet: A hierarchical multi-modal network for educational video concept prediction. International Journal of Machine Learning and Cybernetics (2023), 1–12.Google Scholar
- [28] . 2017. Deep cross-modal hashing. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 3232–3240.Google Scholar
- [29] . 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 2 (2017), 352–364.Google ScholarDigital Library
- [30] . 2021. Muti-view mouse social behaviour recognition with deep graphic model. IEEE Transactions on Image Processing 30 (2021), 5490–5504.Google ScholarCross Ref
- [31] . 2017. Low-rank multi-view embedding learning for micro-video popularity prediction. IEEE Transactions on Knowledge and Data Engineering 30, 8 (2017), 1519–1532.Google ScholarDigital Library
- [32] . 2014. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1725–1732.Google Scholar
- [33] . 2016. Conditional Bernoulli mixtures for multi-label classification. In Proceedings of International Conference on Machine Learning. 2482–2491.Google Scholar
- [34] . 2021. A concise yet effective model for non-aligned incomplete multi-view and missing multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2021), 5918–5932.Google ScholarDigital Library
- [35] . 2023. Self-supervised deep partial adversarial network for micro-video multimodal classification. Information Sciences 630 (2023), 356–369.
DOI: Google ScholarDigital Library - [36] . 2017. Deep learning for extreme multi-label text classification. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. 115–124.Google ScholarDigital Library
- [37] . 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of ACM International Conference on Multimedia. 970–978.Google ScholarDigital Library
- [38] . 2021. Learning dual low-rank representation for multi-label micro-video classification. IEEE Transactions on Multimedia 25 (2021), 77–89.
DOI: Google ScholarCross Ref - [39] . 2023. A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification. IEEE Signal Processing Letters 30 (2023), 60–64.
DOI: Google ScholarCross Ref - [40] . 2019. Flexible online multi-modal hashing for large-scale multimedia retrieval. In Proceedings of the 27th ACM International Conference on Multimedia (MM). 1129–1137.Google ScholarDigital Library
- [41] . 2022. Beyond shared subspace: A view-specific fusion for multi-view multi-label learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 7647–7654.Google ScholarCross Ref
- [42] . 2018. Implicit and explicit concept relations in deep neural networks for multi-label video/image annotation. IEEE Transactions on Circuits and Systems for Video Technology 29, 6 (2018), 1631–1644.Google ScholarCross Ref
- [43] . 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18, 10 (2006), 1338–1351.Google ScholarDigital Library
- [44] . 2011. Multimodal deep learning. In Proceedings of International Conference on Machine Learning. 689–696.Google Scholar
- [45] . 2017. VIREO@ TRECVID 2017: Video-to-text, Ad-hoc video search and video hyperlinking. In 2017 TREC Video Retrieval Evaluation (TRECVID 2017). National Institute of Standards and Technology (NIST).Google Scholar
- [46] . 2017. Enhancing micro-video understanding by harnessing external sounds. In Proceedings of ACM International Conference on Multimedia. 1192–1200.Google ScholarDigital Library
- [47] . 2014. Softening quantization in bag-of-audio-words. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 1370–1374.Google ScholarCross Ref
- [48] . 2018. Cross-domain sentiment classification with target domain specific information. In Proceedings of Annual Meeting of the Association for Computational Linguistics. 2505–2513.Google ScholarCross Ref
- [49] . 2016. Extending long short-term memory for multi-view structured learning. In Proceedings of European Conference on Computer Vision. 338–353.Google ScholarCross Ref
- [50] . 2011. Classifier chains for multi-label classification. Machine Learning 85, 3 (2011), 333.Google ScholarDigital Library
- [51] . 2012. Action bank: A high-level representation of activity in video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1234–1241.Google ScholarDigital Library
- [52] . 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of ACM International Conference on Multimedia. 357–360.Google ScholarDigital Library
- [53] . 2014. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research 15, 1 (2014), 2949–2980.Google ScholarDigital Library
- [54] . 2015. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google Scholar
- [55] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of IEEE international Conference on Computer Vision. 4489–4497.Google Scholar
- [56] . 2019. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5552–5561.Google ScholarCross Ref
- [57] . 2023. Implicit temporal modeling with learnable alignment for video recognition. arXiv preprint arXiv:2304.10465 (2023).
DOI: Google ScholarCross Ref - [58] . 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605.Google Scholar
- [59] . 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 6000–6010.Google Scholar
- [60] . 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).Google Scholar
- [61] . 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5005–5013.Google Scholar
- [62] . 2020. Dual relation semi-supervised multi-label learning. In Proceedings of AAAI Conference on Artificial Intelligence. 6227–6234.Google ScholarCross Ref
- [63] . 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4305–4314.Google Scholar
- [64] . 2019. Porn streamer recognition in live video streaming via attention-gated multimodal deep features. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2019), 4876–4886.Google ScholarDigital Library
- [65] . 2021. Semantic supplementary network with prior information for multi-label image classification. IEEE Transactions on Circuits and Systems for Video Technology 32, 4 (2021), 1848–1859.Google ScholarCross Ref
- [66] . 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29 (2019), 1–14.Google ScholarCross Ref
- [67] . 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of ACM International Conference on Multimedia. 1437–1445.Google ScholarDigital Library
- [68] . 2022. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In Proceedings of International Conference on Machine Learning. PMLR, 24043–24055.Google Scholar
- [69] . 2020. A multimodal variational encoder-decoder framework for micro-video popularity prediction. In Proceedings of The Web Conference. 2542–2548.Google ScholarDigital Library
- [70] . 2022. Disentangled representation learning for multimodal emotion recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 1642–1651.Google ScholarDigital Library
- [71] . 2017. Deep multimodal representation learning from temporal data. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5447–5455.Google Scholar
- [72] . 2020. Attention-driven dynamic graph convolutional network for multi-label image recognition. In Proceedings of European Conference on Computer Vision. 649–665.Google ScholarDigital Library
- [73] . 2017. Learning deep latent space for multi-label classification. In Proceedings of AAAI Conference on Artificial Intelligence. 2838–2844.Google ScholarCross Ref
- [74] . 2017. Central moment discrepancy (CMD) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811 (2017).Google Scholar
- [75] . 2019. Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognition 95 (2019), 136–150.Google ScholarDigital Library
- [76] . 2013. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2013), 1819–1837.Google ScholarCross Ref
- [77] . 2012. A convex formulation for learning task relationships in multi-task learning. arXiv preprint arXiv:1203.3536 (2012).Google Scholar
- [78] . 2022. Non-aligned multi-view multi-label classification via learning view-specific labels. IEEE Transactions on Multimedia (2022).Google Scholar
- [79] . 2021. Multi-label image classification via category prototype compositional learning. IEEE Transactions on Circuits and Systems for Video Technology (2021).Google Scholar
- [80] . 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5513–5522.Google Scholar
- [81] . 2020. Deep collaborative multi-view hashing for large-scale image search. IEEE Trans. Image Process. 29 (2020), 4643–4655.Google ScholarDigital Library
- [82] . 2023. Multi-modal hashing for efficient multimedia retrieval: A survey. IEEE Transactions on Knowledge and Data Engineering (2023), 1–20.
DOI: Google ScholarDigital Library - [83] . 2023. Dynamic ensemble learning for multi-label classification. Information Sciences 623 (2023), 94–111.Google ScholarDigital Library
- [84] . 2017. Multi-label learning with global and local label correlation. IEEE Transactions on Knowledge and Data Engineering 30, 6 (2017), 1081–1094.Google ScholarCross Ref
Index Terms
- Multimodal Attentive Representation Learning for Micro-video Multi-label Classification
Recommendations
Multimodal Representation Learning For Real-World Applications
ICMI '22: Proceedings of the 2022 International Conference on Multimodal InteractionMultimodal representation learning has shown tremendous improvements in recent years. An extensive set of works for fusing multiple modalities have shown promising results on the public benchmarks. However, most famous works target unrealistic settings ...
Supervised representation learning for multi-label classification
AbstractRepresentation learning is one of the most important aspects of multi-label learning because of the intricate nature of multi-label data. Current research on representation learning either fails to consider label knowledge or is affected by the ...
Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification
ICCV '13: Proceedings of the 2013 IEEE International Conference on Computer VisionIn graph-based semi-supervised learning approaches, the classification rate is highly dependent on the size of the availabel labeled data, as well as the accuracy of the similarity measures. Here, we propose a semi-supervised multi-class/multi-label ...
Comments