research-article

Learning Joint Multimodal Representation with Adversarial Attention Networks

Authors:

Xiaoming Zhang,

Zhoujun LiAuthors Info & Claims

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 1874 - 1882

https://doi.org/10.1145/3240508.3240614

Published: 15 October 2018 Publication History

Abstract

Recently, learning a joint representation for the multimodal data (e.g., containing both visual content and text description) has attracted extensive research interests. Usually, the features of different modalities are correlational and compositive, and thus a joint representation capturing the correlation is more effective than a subset of the features. Most of existing multimodal representation learning methods suffer from lack of additional constraints to enhance the robustness of the learned representations. In this paper, a novel Adversarial Attention Networks (AAN) is proposed to incorporate both the attention mechanism and the adversarial networks for effective and robust multimodal representation learning. Specifically, a visual-semantic attention model with siamese learning strategy is proposed to encode the fine-grained correlation between visual and textual modalities. Meanwhile, the adversarial learning model is employed to regularize the generated representation by matching the posterior distribution of the representation to the given priors. Then, the two modules are incorporated into a integrated learning framework to learn the joint multimodal representation. Experimental results in two tasks, i.e., multi-label classification and tag recommendation, show that the proposed model outperforms state-of-the-art representation learning methods.

References

[1]

Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proceedings of ICML (JMLRWorkshop and Conference Proceedings), Vol. 28. JMLR.org, 1247--1255. http://jmlr.org/proceedings/ papers/v28/andrew13.html

Digital Library

[2]

Soheil Bahrampour, Nasser M. Nasrabadi, Asok Ray, and William Kenneth Jenkins. 2016. Multimodal Task-Driven Dictionary Learning for Image Classification. IEEE Trans. Image Processing 25, 1 (2016), 24--38.

[3]

David M. Blei and Michael I. Jordan. 2003. Modeling annotated data. In Proceedings of ACM SIGIR. ACM, 127--134.

Digital Library

[4]

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore. In Proceedings of CIVR. ACM.

Digital Library

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In CVPR. IEEE Computer Society, 248--255.

[6]

Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. 2016. Adversarial Feature Learning. CoRR abs/1605.09782 (2016). arXiv:1605.09782 http://arxiv.org/abs/ 1605.09782

[7]

Mark Everingham, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. IJCV 88, 2 (2010), 303--338.

Digital Library

[8]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal Retrieval with Correspondence Autoencoder. In Proceedings of ACM MM. ACM, 7--16.

Digital Library

[9]

Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual- Semantic Embedding Model. In NIPS. 2121--2129. http://papers.nips.cc/paper/ 5204-devise-a-deep-visual-semantic-embedding-model

Digital Library

[10]

John Glover. 2016. Modeling documents with Generative Adversarial Networks. CoRR abs/1612.09122 (2016). arXiv:1612.09122 http://arxiv.org/abs/1612.09122

[11]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS. 2672--2680.

Digital Library

[12]

Mark J Huiskes and Michael S Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval. ACM, 39--43.

Digital Library

[13]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. CoRR abs/1411.2539 (2014). http://arxiv.org/abs/1411.2539

[14]

Linghui Li, Sheng Tang, Lixi Deng, Yongdong Zhang, and Qi Tian. 2017. Image Caption with Global-Local Attention. In Proceedings of AAAI. AAAI Press, 4133-- 4139. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14880

[15]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In NIPS. 289--297. http://papers.nips.cc/paper/ 6202-hierarchical-question-image-co-attention-for-visual-question-answering

Digital Library

[16]

Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J. Goodfellow. 2015. Adversarial Autoencoders. CoRR abs/1511.05644 (2015). arXiv:1511.05644 http: //arxiv.org/abs/1511.05644

[17]

Julian J. McAuley and Jure Leskovec. 2012. Image Labeling on a Network: Using Social-Network Metadata for Image Classification. In ECCV (Lecture Notes in Computer Science), Vol. 7575. Springer, 828--841.

Digital Library

[18]

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In Proceedings of ICML. Omnipress, 689--696.

Digital Library

[19]

Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. 2017. Areas of attention for image captioning. In ICCV.

[20]

Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks. In Proceedings of IJCAI. IJCAI/AAAI Press, 3846--3853. http://www.ijcai.org/Abstract/16/541

Digital Library

[21]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. ACL, 1532--1543. http: //aclweb.org/anthology/D/D14/D14--1162.pdf

[22]

Georgios Petkos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2012. Social event detection using multimodal clustering and integrating supervisory signals. In ICMR, Horace Ho-Shing Ip and Yong Rui (Eds.). ACM, 23.

Digital Library

[23]

Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR abs/1511.06434 (2015). arXiv:1511.06434 http://arxiv.org/abs/1511.06434

[24]

Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of ACM MM. ACM, 251--260.

Digital Library

[25]

Yogesh Singh Rawat and Mohan S. Kankanhalli. 2016. ConTagNet: Exploiting User Context for Image Tag Recommendation. In Proceedings of ACM MM. ACM, 1102--1106.

Digital Library

[26]

Börkur Sigurbjörnsson and Roelof van Zwol. 2008. Flickr tag recommendation based on collective knowledge. In Proceedings of WWW. ACM, 327--336.

Digital Library

[27]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). arXiv:1409.1556 http://arxiv.org/abs/1409.1556

[28]

Nitish Srivastava and Ruslan Salakhutdinov. 2012. Multimodal Learning with Deep Boltzmann Machines. In NIPS. 2231--2239. http://papers.nips.cc/paper/ 4683-multimodal-learning-with-deep-boltzmann-machines

Digital Library

[29]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of ACM MM. ACM, 154--162.

Digital Library

[30]

Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. WSABIE: Scaling Up to Large Vocabulary Image Annotation. In IJCAI. IJCAI/AAAI, 2764--2770.

Digital Library

[31]

Xi-ZhuWu and Zhi-Hua Zhou. 2016. A Unified View of Multi-Label Performance Measures. CoRR abs/1609.00288 (2016).

[32]

Eric P. Xing, Rong Yan, and Alexander G. Hauptmann. 2005. Mining Associated Text and Images with Dual-Wing Harmoniums. In UAI. AUAI Press, 633-- 641. https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2& amp;article_id=1184&proceeding_id=21

Digital Library

[33]

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In Proceedings of ACM MM. ACM, 1645--1653.

Digital Library

[34]

Huijuan Xu and Kate Saenko. 2016. Ask, Attend and Answer: Exploring Question- Guided Spatial Attention for Visual Question Answering. In ECCV (Lecture Notes in Computer Science), Vol. 9911. Springer, 451--466.

[35]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML (JMLR Workshop and Conference Proceedings), Vol. 37. JMLR.org, 2048--2057. http: //jmlr.org/proceedings/papers/v37/xuc15.html

Digital Library

[36]

Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. IEEE Computer Society, 3441--3450.

[37]

Xiaodong Yang, Pavlo Molchanov, and Jan Kautz. 2016. Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification. In Proceedings of ACM MM. ACM, 978--987.

Digital Library

[38]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. In CVPR. IEEE Computer Society, 4651--4659.

[39]

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In Proceedings of AAAI. AAAI Press, 2852--2858. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/ 14344

[40]

Zhou Zhao, Jinghao Lin, Xinghua Jiang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video Question Answering via Hierarchical Dual-Level Attention Network Learning. In Proceedings of ACM MM. ACM, 1050--1058.

Digital Library

Cited By

Ayoughi MMettes PGroth P(2023)Self-Contained Entity Discovery from Captioned VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3583138Online publication date: 7-Feb-2023
https://doi.org/10.1145/3583138
Chen ZYin FYang QLiu C(2023)Cross-Lingual Text Image Recognition via Multi-Hierarchy Cross-Modal MimicIEEE Transactions on Multimedia10.1109/TMM.2022.318338625(4830-4841)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3183386
Okulov J(2023)Quantifying QualiaEmotional Machines10.1007/978-3-658-37641-3_11(279-294)Online publication date: 2-Sep-2023
https://doi.org/10.1007/978-3-658-37641-3_11
Show More Cited By

Index Terms

Learning Joint Multimodal Representation with Adversarial Attention Networks
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
  2. World Wide Web
    1. Web applications
      1. Social networks

Recommendations

Multimodal Network Embedding via Attention based Multi-view Variational Autoencoder
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Learning the embedding for social media data has attracted extensive research interests as well as boomed a lot of applications, such as classification and link prediction. In this paper, we examine the scenario of a multimodal network with nodes ...
Adversarial Multimodal Representation Learning for Click-Through Rate Prediction
WWW '20: Proceedings of The Web Conference 2020

For better user experience and business effectiveness, Click-Through Rate (CTR) prediction has been one of the most important tasks in E-commerce. Although extensive CTR prediction models have been proposed, learning good representation of items from ...
Learning Social Image Embedding with Deep Multimodal Attention Networks
Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain both link ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '18: Proceedings of the 26th ACM international conference on Multimedia

October 2018

2167 pages

ISBN:9781450356657

DOI:10.1145/3240508

General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

State Key Laboratory of Software Development Environment
National Natural Science Foundation of China
Beijing Natural Science Foundation of China

Conference

MM '18

Sponsor:

SIGMM

MM '18: ACM Multimedia Conference

October 22 - 26, 2018

Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
717
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ayoughi MMettes PGroth P(2023)Self-Contained Entity Discovery from Captioned VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3583138Online publication date: 7-Feb-2023
https://doi.org/10.1145/3583138
Chen ZYin FYang QLiu C(2023)Cross-Lingual Text Image Recognition via Multi-Hierarchy Cross-Modal MimicIEEE Transactions on Multimedia10.1109/TMM.2022.318338625(4830-4841)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3183386
Okulov J(2023)Quantifying QualiaEmotional Machines10.1007/978-3-658-37641-3_11(279-294)Online publication date: 2-Sep-2023
https://doi.org/10.1007/978-3-658-37641-3_11
Zhang YZhang ZFeng SWang D(2022)Visual Enhancement Capsule Network for Aspect-based Multimodal Sentiment AnalysisApplied Sciences10.3390/app12231214612:23(12146)Online publication date: 28-Nov-2022
https://doi.org/10.3390/app122312146
Paulino MCosta YFeltrim V(2022)Evaluating multimodal strategies for multi-label movie genre classification2022 29th International Conference on Systems, Signals and Image Processing (IWSSIP)10.1109/IWSSIP55020.2022.9854451(1-4)Online publication date: 1-Jun-2022
https://doi.org/10.1109/IWSSIP55020.2022.9854451
Huang FLi CGao BLiu YAlotaibi SChen H(2021)Deep Attentive Multimodal Network Representation Learning for Social Media ImagesACM Transactions on Internet Technology10.1145/341729521:3(1-17)Online publication date: 16-Jun-2021
https://dl.acm.org/doi/10.1145/3417295
Liu YZhang XHuang FCheng LLi Z(2021)Adversarial Learning With Multi-Modal Attention for Visual Question AnsweringIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2020.301608332:9(3894-3908)Online publication date: Sep-2021
https://doi.org/10.1109/TNNLS.2020.3016083
Yang XFeng SWang DZhang Y(2021)Image-Text Multimodal Emotion Classification via Multi-View Attentional NetworkIEEE Transactions on Multimedia10.1109/TMM.2020.303527723(4014-4026)Online publication date: 2021
https://doi.org/10.1109/TMM.2020.3035277
Huang FJolfaei ABashir A(2021)Robust Multimodal Representation Learning With Evolutionary Adversarial Attention NetworksIEEE Transactions on Evolutionary Computation10.1109/TEVC.2021.306628525:5(856-868)Online publication date: Oct-2021
https://doi.org/10.1109/TEVC.2021.3066285
Wei KZhou Z(2020)Adversarial Attentive Multi-Modal Embedding Learning for Image-Text MatchingIEEE Access10.1109/ACCESS.2020.29964078(96237-96248)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2996407
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten