skip to main content
10.1145/3240508.3240614acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Joint Multimodal Representation with Adversarial Attention Networks

Published: 15 October 2018 Publication History

Abstract

Recently, learning a joint representation for the multimodal data (e.g., containing both visual content and text description) has attracted extensive research interests. Usually, the features of different modalities are correlational and compositive, and thus a joint representation capturing the correlation is more effective than a subset of the features. Most of existing multimodal representation learning methods suffer from lack of additional constraints to enhance the robustness of the learned representations. In this paper, a novel Adversarial Attention Networks (AAN) is proposed to incorporate both the attention mechanism and the adversarial networks for effective and robust multimodal representation learning. Specifically, a visual-semantic attention model with siamese learning strategy is proposed to encode the fine-grained correlation between visual and textual modalities. Meanwhile, the adversarial learning model is employed to regularize the generated representation by matching the posterior distribution of the representation to the given priors. Then, the two modules are incorporated into a integrated learning framework to learn the joint multimodal representation. Experimental results in two tasks, i.e., multi-label classification and tag recommendation, show that the proposed model outperforms state-of-the-art representation learning methods.

References

[1]
Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proceedings of ICML (JMLRWorkshop and Conference Proceedings), Vol. 28. JMLR.org, 1247--1255. http://jmlr.org/proceedings/ papers/v28/andrew13.html
[2]
Soheil Bahrampour, Nasser M. Nasrabadi, Asok Ray, and William Kenneth Jenkins. 2016. Multimodal Task-Driven Dictionary Learning for Image Classification. IEEE Trans. Image Processing 25, 1 (2016), 24--38.
[3]
David M. Blei and Michael I. Jordan. 2003. Modeling annotated data. In Proceedings of ACM SIGIR. ACM, 127--134.
[4]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore. In Proceedings of CIVR. ACM.
[5]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In CVPR. IEEE Computer Society, 248--255.
[6]
Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. 2016. Adversarial Feature Learning. CoRR abs/1605.09782 (2016). arXiv:1605.09782 http://arxiv.org/abs/ 1605.09782
[7]
Mark Everingham, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. IJCV 88, 2 (2010), 303--338.
[8]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal Retrieval with Correspondence Autoencoder. In Proceedings of ACM MM. ACM, 7--16.
[9]
Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual- Semantic Embedding Model. In NIPS. 2121--2129. http://papers.nips.cc/paper/ 5204-devise-a-deep-visual-semantic-embedding-model
[10]
John Glover. 2016. Modeling documents with Generative Adversarial Networks. CoRR abs/1612.09122 (2016). arXiv:1612.09122 http://arxiv.org/abs/1612.09122
[11]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS. 2672--2680.
[12]
Mark J Huiskes and Michael S Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval. ACM, 39--43.
[13]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. CoRR abs/1411.2539 (2014). http://arxiv.org/abs/1411.2539
[14]
Linghui Li, Sheng Tang, Lixi Deng, Yongdong Zhang, and Qi Tian. 2017. Image Caption with Global-Local Attention. In Proceedings of AAAI. AAAI Press, 4133-- 4139. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14880
[15]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In NIPS. 289--297. http://papers.nips.cc/paper/ 6202-hierarchical-question-image-co-attention-for-visual-question-answering
[16]
Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J. Goodfellow. 2015. Adversarial Autoencoders. CoRR abs/1511.05644 (2015). arXiv:1511.05644 http: //arxiv.org/abs/1511.05644
[17]
Julian J. McAuley and Jure Leskovec. 2012. Image Labeling on a Network: Using Social-Network Metadata for Image Classification. In ECCV (Lecture Notes in Computer Science), Vol. 7575. Springer, 828--841.
[18]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In Proceedings of ICML. Omnipress, 689--696.
[19]
Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. 2017. Areas of attention for image captioning. In ICCV.
[20]
Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks. In Proceedings of IJCAI. IJCAI/AAAI Press, 3846--3853. http://www.ijcai.org/Abstract/16/541
[21]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. ACL, 1532--1543. http: //aclweb.org/anthology/D/D14/D14--1162.pdf
[22]
Georgios Petkos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2012. Social event detection using multimodal clustering and integrating supervisory signals. In ICMR, Horace Ho-Shing Ip and Yong Rui (Eds.). ACM, 23.
[23]
Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR abs/1511.06434 (2015). arXiv:1511.06434 http://arxiv.org/abs/1511.06434
[24]
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of ACM MM. ACM, 251--260.
[25]
Yogesh Singh Rawat and Mohan S. Kankanhalli. 2016. ConTagNet: Exploiting User Context for Image Tag Recommendation. In Proceedings of ACM MM. ACM, 1102--1106.
[26]
Börkur Sigurbjörnsson and Roelof van Zwol. 2008. Flickr tag recommendation based on collective knowledge. In Proceedings of WWW. ACM, 327--336.
[27]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). arXiv:1409.1556 http://arxiv.org/abs/1409.1556
[28]
Nitish Srivastava and Ruslan Salakhutdinov. 2012. Multimodal Learning with Deep Boltzmann Machines. In NIPS. 2231--2239. http://papers.nips.cc/paper/ 4683-multimodal-learning-with-deep-boltzmann-machines
[29]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of ACM MM. ACM, 154--162.
[30]
Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. WSABIE: Scaling Up to Large Vocabulary Image Annotation. In IJCAI. IJCAI/AAAI, 2764--2770.
[31]
Xi-ZhuWu and Zhi-Hua Zhou. 2016. A Unified View of Multi-Label Performance Measures. CoRR abs/1609.00288 (2016).
[32]
Eric P. Xing, Rong Yan, and Alexander G. Hauptmann. 2005. Mining Associated Text and Images with Dual-Wing Harmoniums. In UAI. AUAI Press, 633-- 641. https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2& amp;article_id=1184&proceeding_id=21
[33]
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In Proceedings of ACM MM. ACM, 1645--1653.
[34]
Huijuan Xu and Kate Saenko. 2016. Ask, Attend and Answer: Exploring Question- Guided Spatial Attention for Visual Question Answering. In ECCV (Lecture Notes in Computer Science), Vol. 9911. Springer, 451--466.
[35]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML (JMLR Workshop and Conference Proceedings), Vol. 37. JMLR.org, 2048--2057. http: //jmlr.org/proceedings/papers/v37/xuc15.html
[36]
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. IEEE Computer Society, 3441--3450.
[37]
Xiaodong Yang, Pavlo Molchanov, and Jan Kautz. 2016. Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification. In Proceedings of ACM MM. ACM, 978--987.
[38]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. In CVPR. IEEE Computer Society, 4651--4659.
[39]
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In Proceedings of AAAI. AAAI Press, 2852--2858. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/ 14344
[40]
Zhou Zhao, Jinghao Lin, Xinghua Jiang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video Question Answering via Hierarchical Dual-Level Attention Network Learning. In Proceedings of ACM MM. ACM, 1050--1058.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adversarial networks
  2. attention model
  3. multimodal
  4. representation learning
  5. siamese learning

Qualifiers

  • Research-article

Funding Sources

  • State Key Laboratory of Software Development Environment
  • National Natural Science Foundation of China
  • Beijing Natural Science Foundation of China

Conference

MM '18
Sponsor:
MM '18: ACM Multimedia Conference
October 22 - 26, 2018
Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Self-Contained Entity Discovery from Captioned VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3583138Online publication date: 7-Feb-2023
  • (2023)Cross-Lingual Text Image Recognition via Multi-Hierarchy Cross-Modal MimicIEEE Transactions on Multimedia10.1109/TMM.2022.318338625(4830-4841)Online publication date: 1-Jan-2023
  • (2023)Quantifying QualiaEmotional Machines10.1007/978-3-658-37641-3_11(279-294)Online publication date: 2-Sep-2023
  • (2022)Visual Enhancement Capsule Network for Aspect-based Multimodal Sentiment AnalysisApplied Sciences10.3390/app12231214612:23(12146)Online publication date: 28-Nov-2022
  • (2022)Evaluating multimodal strategies for multi-label movie genre classification2022 29th International Conference on Systems, Signals and Image Processing (IWSSIP)10.1109/IWSSIP55020.2022.9854451(1-4)Online publication date: 1-Jun-2022
  • (2021)Deep Attentive Multimodal Network Representation Learning for Social Media ImagesACM Transactions on Internet Technology10.1145/341729521:3(1-17)Online publication date: 16-Jun-2021
  • (2021)Adversarial Learning With Multi-Modal Attention for Visual Question AnsweringIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2020.301608332:9(3894-3908)Online publication date: Sep-2021
  • (2021)Image-Text Multimodal Emotion Classification via Multi-View Attentional NetworkIEEE Transactions on Multimedia10.1109/TMM.2020.303527723(4014-4026)Online publication date: 2021
  • (2021)Robust Multimodal Representation Learning With Evolutionary Adversarial Attention NetworksIEEE Transactions on Evolutionary Computation10.1109/TEVC.2021.306628525:5(856-868)Online publication date: Oct-2021
  • (2020)Adversarial Attentive Multi-Modal Embedding Learning for Image-Text MatchingIEEE Access10.1109/ACCESS.2020.29964078(96237-96248)Online publication date: 2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media