research-article

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

Authors:
Peiguang Jing

Tianjin University, Tianjin, China

Tianjin University, Tianjin, China

0000-0003-2648-7358
Search about this author

,
Xianyi Liu

Tianjin University, Tianjin, China

Tianjin University, Tianjin, China

0000-0001-6284-9470
Search about this author

,
Lijuan Zhang

Tianjin University, Tianjin, China

Tianjin University, Tianjin, China

0009-0000-1892-9617
Search about this author

,
Yun Li

Guangxi University of Finance and Economics, Nanning, China

Guangxi University of Finance and Economics, Nanning, China

0000-0002-5784-1877
Search about this author

,
Yu Liu

Tianjin University, Tianjin, China

Tianjin University, Tianjin, China

0000-0002-5949-6587
Search about this author

,
Yuting Su

Tianjin University, Tianjin, China

Tianjin University, Tianjin, China

0000-0001-5165-204X
Search about this author

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20 Issue 6Article No.: 182pp 1–23https://doi.org/10.1145/3643888

Published:08 March 2024Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

As one of the representative types of user-generated contents (UGCs) in social platforms, micro-videos have been becoming popular in our daily life. Although micro-videos naturally exhibit multimodal features that are rich enough to support representation learning, the complex correlations across modalities render valuable information difficult to integrate. In this paper, we introduced a multimodal attentive representation network (MARNET) to learn complete and robust representations to benefit micro-video multi-label classification. To address the commonly missing modality issue, we presented a multimodal information aggregation mechanism module to integrate multimodal information, where latent common representations are obtained by modeling the complementarity and consistency in terms of visual-centered modality groupings instead of single modalities. For the label correlation issue, we designed an attentive graph neural network module to adaptively learn the correlation matrix and representations of labels for better compatibility with training data. In addition, a cross-modal multi-head attention module is developed to make the learned common representations label-aware for multi-label classification. Experiments conducted on two micro-video datasets demonstrate the superior performance of MARNET compared with state-of-the-art methods.

REFERENCES

[1] Andrew Galen, Arora Raman, Bilmes Jeff, and Livescu Karen. 2013. Deep canonical correlation analysis. In Proceedings of International Conference on Machine Learning. 1247–1255.Google ScholarDigital Library
[2] Baltrušaitis Tadas, Ahuja Chaitanya, and Morency Louis-Philippe. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423–443.Google ScholarDigital Library
[3] Boutell Matthew R., Luo Jiebo, Shen Xipeng, and Brown Christopher M.. 2004. Learning multi-label scene classification. Pattern Recognition 37, 9 (2004), 1757–1771.Google ScholarCross Ref
[4] Brox Thomas, Bruhn Andrés, Papenberg Nils, and Weickert Joachim. 2004. High accuracy optical flow estimation based on a theory for warping. In Proceedings of European Conference on Computer Vision. Springer, 25–36.Google ScholarCross Ref
[5] Cai Desheng, Qian Shengsheng, Fang Quan, and Xu Changsheng. 2021. Heterogeneous hierarchical feature aggregation network for personalized micro-video recommendation. IEEE Transactions on Multimedia (2021).Google Scholar
[6] Chatfield Ken, Simonyan Karen, Vedaldi Andrea, and Zisserman Andrew. 2014. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014).Google Scholar
[7] Chen Guibin, Ye Deheng, Xing Zhenchang, Chen Jieshan, and Cambria Erik. 2017. Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In Proceedings of International Joint Conference on Neural Networks. 2377–2383.Google ScholarCross Ref
[8] Chen Hui, Ding Guiguang, Liu Xudong, Lin Zijia, Liu Ji, and Han Jungong. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 12655–12663.Google Scholar
[9] Chen Jingyuan, Song Xuemeng, Nie Liqiang, Wang Xiang, Zhang Hanwang, and Chua Tat-Seng. 2016. Micro tells macro: Predicting the popularity of micro-videos via a transductive model. In Proceedings of ACM International Conference on Multimedia. 898–907.Google ScholarDigital Library
[10] Chen Shizhe, Zhao Yida, Jin Qin, and Wu Qi. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10638–10647.Google ScholarCross Ref
[11] Chen Tianshui, Xu Muxin, Hui Xiaolu, Wu Hefeng, and Lin Liang. 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of IEEE International Conference on Computer Vision. 522–531.Google Scholar
[12] Chen Xusong, Liu Dong, Xiong Zhiwei, and Zha Zheng-Jun. 2020. Learning and fusing multiple user interest representations for micro-video and movie recommendations. IEEE Transactions on Multimedia 23 (2020), 484–496.Google ScholarCross Ref
[13] Chen Zhao-Min, Wei Xiu-Shen, Wang Peng, and Guo Yanwen. 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5177–5186.Google Scholar
[14] Cheng Zhi-Qi, Dai Qi, Li Siyao, Mitamura Teruko, and Hauptmann Alexander. 2022. GSRFormer: Grounded situation recognition transformer with alternate semantic attention refinement. In Proceedings of ACM International Conference on Multimedia. 3272–3281.Google ScholarDigital Library
[15] Cheng Zhi-Qi, Liu Yang, Wu Xiao, and Hua Xian-Sheng. 2016. Video eCommerce: Towards online video advertising. In Proceedings of the 24th ACM International Conference on Multimedia. 1365–1374.Google ScholarDigital Library
[16] Cheng Zhi-Qi, Wu Xiao, Huang Siyu, Li Jun-Xiu, Hauptmann Alexander G., and Peng Qiang. 2018. Learning to transfer: Generalizable attribute learning with multitask neural model search. In Proceedings of ACM International Conference on Multimedia. 90–98.Google ScholarDigital Library
[17] Cheng Zhi-Qi, Wu Xiao, Liu Yang, and Hua Xian-Sheng. 2017. Video2Shop: Exact matching clothes in videos to online shopping images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4048–4056.Google Scholar
[18] Cheng Zhi-Qi, Zhang Hao, Wu Xiao, and Ngo Chong-Wah. 2017. On the selection of anchors and targets for video hyperlinking. In Proceedings of ACM on International Conference on Multimedia Retrieval. 287–293.Google ScholarDigital Library
[19] Ding Zhengming and Fu Yun. 2016. Robust multi-view subspace learning through dual low-rank decompositions. In Proceedings of AAAI Conference on Artificial Intelligence. 1181–1187.Google ScholarDigital Library
[20] Durand Thibaut, Mehrasa Nazanin, and Mori Greg. 2019. Learning a deep ConvNet for multi-label classification with partial labels. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 647–657.Google Scholar
[21] Fan Wenqi, Ma Yao, Xu Han, Liu Xiaorui, Wang Jianping, Li Qing, and Tang Jiliang. 2020. Deep adversarial canonical correlation analysis. In Proceedings of SIAM International Conference on Data Mining. 352–360.Google ScholarCross Ref
[22] Frome Andrea, Corrado Greg S., Shlens Jon, Bengio Samy, Dean Jeff, Ranzato Marc'Aurelio, and Mikolov Tomas. 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of Advances in Neural Information Processing Systems, Vol. 26.Google Scholar
[23] Fürnkranz Johannes, Hüllermeier Eyke, Mencía Eneldo Loza, and Brinker Klaus. 2008. Multilabel classification via calibrated label ranking. Machine Learning 73, 2 (2008), 133–153.Google ScholarDigital Library
[24] Gao Ying, Feng Xiaohan, Zhang Tiange, Rigall Eric, Zhou Huiyu, Qi Lin, and Dong Junyu. 2021. Wallpaper texture generation and style transfer based on multi-label semantics. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 1552–1563.Google ScholarCross Ref
[25] He Jun-Yan, Wu Xiao, Cheng Zhi-Qi, Yuan Zhaoquan, and Jiang Yu-Gang. 2021. DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition. Neurocomputing 444 (2021), 319–331.Google ScholarCross Ref
[26] Hershey Shawn, Chaudhuri Sourish, Ellis Daniel P. W., Gemmeke Jort F., Jansen Aren, Moore R. Channing, Plakal Manoj, Platt Devin, Saurous Rif A., and Seybold Bryan. 2017. CNN architectures for large-scale audio classification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 131–135.Google ScholarDigital Library
[27] Huang Wei, Xiao Tong, Liu Qi, Huang Zhenya, Ma Jianhui, and Chen Enhong. 2023. HMNet: A hierarchical multi-modal network for educational video concept prediction. International Journal of Machine Learning and Cybernetics (2023), 1–12.Google Scholar
[28] Jiang Qing-Yuan and Li Wu-Jun. 2017. Deep cross-modal hashing. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 3232–3240.Google Scholar
[29] Jiang Yu-Gang, Wu Zuxuan, Wang Jun, Xue Xiangyang, and Chang Shih-Fu. 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 2 (2017), 352–364.Google ScholarDigital Library
[30] Jiang Zheheng, Zhou Feixiang, Zhao Aite, Li Xin, Li Ling, Tao Dacheng, Li Xuelong, and Zhou Huiyu. 2021. Muti-view mouse social behaviour recognition with deep graphic model. IEEE Transactions on Image Processing 30 (2021), 5490–5504.Google ScholarCross Ref
[31] Jing Peiguang, Su Yuting, Nie Liqiang, Bai Xu, Liu Jing, and Wang Meng. 2017. Low-rank multi-view embedding learning for micro-video popularity prediction. IEEE Transactions on Knowledge and Data Engineering 30, 8 (2017), 1519–1532.Google ScholarDigital Library
[32] Karpathy Andrej, Toderici George, Shetty Sanketh, Leung Thomas, Sukthankar Rahul, and Fei-Fei Li. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1725–1732.Google Scholar
[33] Li Cheng, Wang Bingyu, Pavlu Virgil, and Aslam Javed. 2016. Conditional Bernoulli mixtures for multi-label classification. In Proceedings of International Conference on Machine Learning. 2482–2491.Google Scholar
[34] Li Xiang and Chen Songcan. 2021. A concise yet effective model for non-aligned incomplete multi-view and missing multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2021), 5918–5932.Google ScholarDigital Library
[35] Li Yun, Liu Shuyi, Wang Xuejun, and Jing Peiguang. 2023. Self-supervised deep partial adversarial network for micro-video multimodal classification. Information Sciences 630 (2023), 356–369. DOI:Google ScholarDigital Library
[36] Liu Jingzhou, Chang Wei-Cheng, Wu Yuexin, and Yang Yiming. 2017. Deep learning for extreme multi-label text classification. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. 115–124.Google ScholarDigital Library
[37] Liu Meng, Nie Liqiang, Wang Meng, and Chen Baoquan. 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of ACM International Conference on Multimedia. 970–978.Google ScholarDigital Library
[38] Lu Wei, Li Desheng, Nie Liqiang, Jing Peiguang, and Su Yuting. 2021. Learning dual low-rank representation for multi-label micro-video classification. IEEE Transactions on Multimedia 25 (2021), 77–89. DOI:Google ScholarCross Ref
[39] Lu Wei, Lin Jiaxin, Jing Peiguang, and Su Yuting. 2023. A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification. IEEE Signal Processing Letters 30 (2023), 60–64. DOI:Google ScholarCross Ref
[40] Lu Xu, Zhu Lei, Cheng Zhiyong, Li Jingjing, Nie Xiushan, and Zhang Huaxiang. 2019. Flexible online multi-modal hashing for large-scale multimedia retrieval. In Proceedings of the 27th ACM International Conference on Multimedia (MM). 1129–1137.Google ScholarDigital Library
[41] Lyu Gengyu, Deng Xiang, Wu Yanan, and Feng Songhe. 2022. Beyond shared subspace: A view-specific fusion for multi-view multi-label learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 7647–7654.Google ScholarCross Ref
[42] Markatopoulou Foteini, Mezaris Vasileios, and Patras Ioannis. 2018. Implicit and explicit concept relations in deep neural networks for multi-label video/image annotation. IEEE Transactions on Circuits and Systems for Video Technology 29, 6 (2018), 1631–1644.Google ScholarCross Ref
[43] Minling Zhang and Zhihua Zhou. 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18, 10 (2006), 1338–1351.Google ScholarDigital Library
[44] Ngiam Jiquan, Khosla Aditya, Kim Mingyu, Nam Juhan, Lee Honglak, and Ng Andrew Y.. 2011. Multimodal deep learning. In Proceedings of International Conference on Machine Learning. 689–696.Google Scholar
[45] Nguyen Phuong Anh, Li Qing, Cheng Zhi-Qi, Lu Yi-Jie, Zhang Hao, Wu Xiao, and Ngo Chong-Wah. 2017. VIREO@ TRECVID 2017: Video-to-text, Ad-hoc video search and video hyperlinking. In 2017 TREC Video Retrieval Evaluation (TRECVID 2017). National Institute of Standards and Technology (NIST).Google Scholar
[46] Nie Liqiang, Wang Xiang, Zhang Jianglong, He Xiangnan, Zhang Hanwang, Hong Richang, and Tian Qi. 2017. Enhancing micro-video understanding by harnessing external sounds. In Proceedings of ACM International Conference on Multimedia. 1192–1200.Google ScholarDigital Library
[47] Pancoast Stephanie and Akbacak Murat. 2014. Softening quantization in bag-of-audio-words. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 1370–1374.Google ScholarCross Ref
[48] Peng Minlong, Zhang Qi, Jiang Yu-gang, and Huang Xuan-Jing. 2018. Cross-domain sentiment classification with target domain specific information. In Proceedings of Annual Meeting of the Association for Computational Linguistics. 2505–2513.Google ScholarCross Ref
[49] Rajagopalan Shyam Sundar, Morency Louis-Philippe, Baltrusaitis Tadas, and Goecke Roland. 2016. Extending long short-term memory for multi-view structured learning. In Proceedings of European Conference on Computer Vision. 338–353.Google ScholarCross Ref
[50] Read Jesse, Pfahringer Bernhard, Holmes Geoff, and Frank Eibe. 2011. Classifier chains for multi-label classification. Machine Learning 85, 3 (2011), 333.Google ScholarDigital Library
[51] Sadanand Sreemanananth and Corso Jason J.. 2012. Action bank: A high-level representation of activity in video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1234–1241.Google ScholarDigital Library
[52] Scovanner Paul, Ali Saad, and Shah Mubarak. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of ACM International Conference on Multimedia. 357–360.Google ScholarDigital Library
[53] Srivastava Nitish and Salakhutdinov Ruslan. 2014. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research 15, 1 (2014), 2949–2980.Google ScholarDigital Library
[54] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google Scholar
[55] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of IEEE international Conference on Computer Vision. 4489–4497.Google Scholar
[56] Tran Du, Wang Heng, Torresani Lorenzo, and Feiszli Matt. 2019. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5552–5561.Google ScholarCross Ref
[57] Tu Shuyuan, Dai Qi, Wu Zuxuan, Cheng Zhi-Qi, Hu Han, and Jiang Yu-Gang. 2023. Implicit temporal modeling with learnable alignment for video recognition. arXiv preprint arXiv:2304.10465 (2023). DOI:Google ScholarCross Ref
[58] Maaten Laurens van der and Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605.Google Scholar
[59] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 6000–6010.Google Scholar
[60] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Lio Pietro, and Bengio Yoshua. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).Google Scholar
[61] Wang Liwei, Li Yin, and Lazebnik Svetlana. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5005–5013.Google Scholar
[62] Wang Lichen, Liu Yunyu, Qin Can, Sun Gan, and Fu Yun. 2020. Dual relation semi-supervised multi-label learning. In Proceedings of AAAI Conference on Artificial Intelligence. 6227–6234.Google ScholarCross Ref
[63] Wang Limin, Qiao Yu, and Tang Xiaoou. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4305–4314.Google Scholar
[64] Wang Liyuan, Zhang Jing, Tian Qi, Li Chenhao, and Zhuo Li. 2019. Porn streamer recognition in live video streaming via attention-gated multimodal deep features. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2019), 4876–4886.Google ScholarDigital Library
[65] Wang Zhe, Fang Zhongli, Li Dongdong, Yang Hai, and Du Wenli. 2021. Semantic supplementary network with prior information for multi-label image classification. IEEE Transactions on Circuits and Systems for Video Technology 32, 4 (2021), 1848–1859.Google ScholarCross Ref
[66] Wei Yinwei, Wang Xiang, Guan Weili, Nie Liqiang, Lin Zhouchen, and Chen Baoquan. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29 (2019), 1–14.Google ScholarCross Ref
[67] Wei Yinwei, Wang Xiang, Nie Liqiang, He Xiangnan, Hong Richang, and Chua Tat-Seng. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of ACM International Conference on Multimedia. 1437–1445.Google ScholarDigital Library
[68] Wu Nan, Jastrzebski Stanislaw, Cho Kyunghyun, and Geras Krzysztof J.. 2022. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In Proceedings of International Conference on Machine Learning. PMLR, 24043–24055.Google Scholar
[69] Xie Jiayi, Zhu Yaochen, Zhang Zhibin, Peng Jian, Yi Jing, Hu Yaosi, Liu Hongyi, and Chen Zhenzhong. 2020. A multimodal variational encoder-decoder framework for micro-video popularity prediction. In Proceedings of The Web Conference. 2542–2548.Google ScholarDigital Library
[70] Yang Dingkang, Huang Shuai, Kuang Haopeng, Du Yangtao, and Zhang Lihua. 2022. Disentangled representation learning for multimodal emotion recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 1642–1651.Google ScholarDigital Library
[71] Yang Xitong, Ramesh Palghat, Chitta Radha, Madhvanath Sriganesh, Bernal Edgar A., and Luo Jiebo. 2017. Deep multimodal representation learning from temporal data. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5447–5455.Google Scholar
[72] Ye Jin, He Junjun, Peng Xiaojiang, Wu Wenhao, and Qiao Yu. 2020. Attention-driven dynamic graph convolutional network for multi-label image recognition. In Proceedings of European Conference on Computer Vision. 649–665.Google ScholarDigital Library
[73] Yeh Chih-Kuan, Wu Wei-Chieh, Ko Wei-Jen, and Wang Yu-Chiang Frank. 2017. Learning deep latent space for multi-label classification. In Proceedings of AAAI Conference on Artificial Intelligence. 2838–2844.Google ScholarCross Ref
[74] Zellinger Werner, Grubinger Thomas, Lughofer Edwin, Natschläger Thomas, and Saminger-Platz Susanne. 2017. Central moment discrepancy (CMD) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811 (2017).Google Scholar
[75] Zhang Jia, Luo Zhiming, Li Candong, Zhou Changen, and Li Shaozi. 2019. Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognition 95 (2019), 136–150.Google ScholarDigital Library
[76] Zhang Min-Ling and Zhou Zhi-Hua. 2013. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2013), 1819–1837.Google ScholarCross Ref
[77] Zhang Yu and Yeung Dit-Yan. 2012. A convex formulation for learning task relationships in multi-task learning. arXiv preprint arXiv:1203.3536 (2012).Google Scholar
[78] Zhao Dawei, Gao Qingwei, Lu Yixiang, and Sun Dong. 2022. Non-aligned multi-view multi-label classification via learning view-specific labels. IEEE Transactions on Multimedia (2022).Google Scholar
[79] Zhou Fengtao, Huang Sheng, Liu Bo, and Yang Dan. 2021. Multi-label image classification via category prototype compositional learning. IEEE Transactions on Circuits and Systems for Video Technology (2021).Google Scholar
[80] Zhu Feng, Li Hongsheng, Ouyang Wanli, Yu Nenghai, and Wang Xiaogang. 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5513–5522.Google Scholar
[81] Zhu Lei, Lu Xu, Cheng Zhiyong, Li Jingjing, and Zhang Huaxiang. 2020. Deep collaborative multi-view hashing for large-scale image search. IEEE Trans. Image Process. 29 (2020), 4643–4655.Google ScholarDigital Library
[82] Zhu Lei, Zheng Chaoqun, Guan Weili, Li Jingjing, Yang Yang, and Shen Heng Tao. 2023. Multi-modal hashing for efficient multimedia retrieval: A survey. IEEE Transactions on Knowledge and Data Engineering (2023), 1–20. DOI:Google ScholarDigital Library
[83] Zhu Xiaoyan, Li Jiaxuan, Ren Jingtao, Wang Jiayin, and Wang Guangtao. 2023. Dynamic ensemble learning for multi-label classification. Information Sciences 623 (2023), 94–111.Google ScholarDigital Library
[84] Zhu Yue, Kwok James T., and Zhou Zhi-Hua. 2017. Multi-label learning with global and local label correlation. IEEE Transactions on Knowledge and Data Engineering 30, 6 (2017), 1081–1094.Google ScholarCross Ref

Index Terms

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Multimodal Representation Learning For Real-World Applications
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

Multimodal representation learning has shown tremendous improvements in recent years. An extensive set of works for fusing multiple modalities have shown promising results on the public benchmarks. However, most famous works target unrealistic settings ...
Read More
Supervised representation learning for multi-label classification
Abstract
Representation learning is one of the most important aspects of multi-label learning because of the intricate nature of multi-label data. Current research on representation learning either fails to consider label knowledge or is affected by the ...
Read More
Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification
ICCV '13: Proceedings of the 2013 IEEE International Conference on Computer Vision

In graph-based semi-supervised learning approaches, the classification rate is highly dependent on the size of the availabel labeled data, as well as the accuracy of the similarity measures. Here, we propose a semi-supervised multi-class/multi-label ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 6
June 2024
715 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3613638
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 March 2024
- Online AM: 6 February 2024
- Accepted: 27 January 2024
- Revised: 4 January 2024
- Received: 2 June 2023
Published in tomm Volume 20, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Micro-video
multimodal representations
multi-label
graph network
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 212
  Total Downloads
- Downloads (Last 12 months)212
- Downloads (Last 6 weeks)67
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Multimodal Representation Learning For Real-World Applications

Supervised representation learning for multi-label classification

Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Multimodal Representation Learning For Real-World Applications

Supervised representation learning for multi-label classification

Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media