research-article

Learning Semantic Structure-preserved Embeddings for Cross-modal Retrieval

Authors:

Qingming HuangAuthors Info & Claims

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 825 - 833

https://doi.org/10.1145/3240508.3240521

Published: 15 October 2018 Publication History

Abstract

This paper learns semantic embeddings for multi-label cross-modal retrieval. Our method exploits the structure in semantics represented by label vectors to guide the learning of embeddings. First, we construct a semantic graph based on label vectors which incorporates data from both modalities, and enforce the embeddings to preserve the local structure of this semantic graph. Second, we enforce the embeddings to well reconstruct the labels, i.e., the global semantic structure. In addition, we encourage the embeddings to preserve local geometric structure of each modality. Accordingly, the local and global semantic structure consistencies as well as the local geometric structure consistency are enforced, simultaneously. The mappings between inputs and embeddings are designed to be nonlinear neural network with larger capacity and more flexibility. The overall objective function is optimized by stochastic gradient descent to gain the scalability on large datasets. Experiments conducted on three real world datasets clearly demonstrate the superiority of our proposed approach over the state-of-the-art methods.

References

[1]

Mart'in Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et almbox. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).

[2]

S. Abhishek, K. Abhishek, H. Daume, and D. W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In CVPR. 2160--2167.

Digital Library

[3]

Galen Andrew, Raman Arora, Jeff A Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In ICML (3). 1247--1255.

Digital Library

[4]

T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore. In CIVR. 48.

Digital Library

[5]

Dragovs M Cvetković, Peter Rowlinson, and Slobodan Simic. 1997. Eigenspaces of graphs. Number 66. Cambridge University Press.

[6]

C. Deng, J. Tang, X.and Yan, W. Liu, and X. Gao. 2016. Discriminative Dictionary Learning With Common Label Alignment for Cross-Modal Retrieval. TMM, Vol. 18, 2 (2016), 208--218.

Digital Library

[7]

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The pascal visual object classes (voc) challenge. IJCV, Vol. 88, 2 (2010), 303--338.

Digital Library

[8]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In ACM MM. 7--16.

Digital Library

[9]

Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, Vol. 106, 2 (2014), 210--233.

Digital Library

[10]

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD. 855--864.

Digital Library

[11]

Xiaofei He and Partha Niyogi. 2002. Locality Preserving Projections (LPP). NIPS, Vol. 16, 1 (2002), 186--197.

[12]

Yonghao He, Shiming Xiang, Cuicui Kang, Jian Wang, and Chunhong Pan. 2016. Cross-modal retrieval via deep and bidirectional representation learning. TMM, Vol. 18, 7 (2016), 1363--1377.

Digital Library

[13]

H. Hotelling. 1936. Relations between two sets of variates. Biometrika, Vol. 28, 3/4 (1936), 321--377.

[14]

Mark J Huiskes and Michael S Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval. 39--43.

Digital Library

[15]

S. J. Hwang and K. Grauman. 2010. Accounting for the Relative Importance of Objects in Image Retrieval. In BMVC, Vol. 1. 5.

[16]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM MM. 675--678.

Digital Library

[17]

C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. TMM, Vol. 17, 3 (2015), 370--381.

Digital Library

[18]

Shaishav Kumar and Raghavendra Udupa. 2011. Learning hash functions for cross-view similarity search. In IJCAI, Vol. 22. 1360.

Digital Library

[19]

Zechao Li, Jing Liu, Jinhui Tang, and Hanqing Lu. 2015. Robust structured subspace learning for data representation. TPAMI, Vol. 37, 10 (2015), 2085--2098.

Digital Library

[20]

J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber. 2014. Multimodal similarity-preserving hashing. TPAMI, Vol. 36, 4 (2014), 824--830.

Digital Library

[21]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. 3111--3119.

Digital Library

[22]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In KDD. 701--710.

Digital Library

[23]

V. Ranjan, N. Rasiwasia, and CV Jawahar. 2015. Multi-Label Cross-modal Retrieval. In ICCV. 4094--4102.

Digital Library

[24]

N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. RG Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACM MM. 251--260.

Digital Library

[25]

R. Rosipal and N. Kr"amer. 2006. Overview and recent advances in partial least squares. In Subspace, latent structure and feature selection. Springer, 34--51.

Digital Library

[26]

Guoli Song, Shuhui Wang, Qingming Huang, and Qi Tian. 2015. Similarity gaussian process latent variable model for multi-modal data analysis. In ICCV. 4050--4058.

Digital Library

[27]

Jingkuan Song, Yang Yang, Yi Yang, Zi Huang, and Heng Tao Shen. 2013. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In SIGMOD. 785--796.

Digital Library

[28]

Nitish Srivastava and Ruslan R Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In NIPS. 2222--2230.

Digital Library

[29]

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In ICML. 1096--1103.

Digital Library

[30]

K. Wang, R. He, W. Wang, L. Wang, and T. Tan. 2013. Learning coupled feature spaces for cross-modal matching. In ICCV. 2088--2095.

Digital Library

[31]

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In CVPR. 5005--5013.

[32]

Fei Wu, Xinyan Lu, Zhongfei Zhang, Shuicheng Yan, Yong Rui, and Yueting Zhuang. 2013. Cross-media semantic representation via bi-directional learning to rank. In ACM MM. 877--886.

Digital Library

[33]

Yiling Wu, Shuhui Wang, and Qingming Huang. 2017. Online Asymmetric Similarity Learning for Cross-Modal Retrieval. In CVPR. 4269--4278.

[34]

Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. 3441--3450.

[35]

Shuicheng Yan, Dong Xu, Benyu Zhang, and Hong-Jiang Zhang. 2005. Graph embedding: A general framework for dimensionality reduction. In CVPR, Vol. 2. 830--837.

Digital Library

[36]

Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis. In ICCV. 28--36.

Digital Library

[37]

Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. 2003. Learning with local and global consistency. In NIPS, Vol. 16. 321--328.

Digital Library

Cited By

Li BWu YLi ZGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Team HUGE: Image-Text Matching via Hierarchical and Unified Graph EnhancingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658001(704-712)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658001
Liu XHe YCheung YXu XWang N(2024)Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image–Text MatchingIEEE Transactions on Cybernetics10.1109/TCYB.2022.317902054:2(948-961)Online publication date: Feb-2024
https://doi.org/10.1109/TCYB.2022.3179020
Wang TLi FZhu LLi JZhang ZShen H(2024)Cross-Modal Retrieval: A Systematic Review of Methods and Future DirectionsProceedings of the IEEE10.1109/JPROC.2024.3525147112:11(1716-1754)Online publication date: Nov-2024
https://doi.org/10.1109/JPROC.2024.3525147
Show More Cited By

Index Terms

Learning Semantic Structure-preserved Embeddings for Cross-modal Retrieval
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Learning latent representations
      2. Neural networks
2. Information systems
  1. Information retrieval

Recommendations

Adversarial Cross-Modal Retrieval
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly ...
Cross-modal Retrieval with Label Completion
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Cross-modal retrieval has been attracting increasing attention because of the explosion of multi-modal data, e.g., texts and images. Most supervised cross-modal retrieval methods learn discriminant common subspaces minimizing the heterogeneity of ...
Semantic consistency hashing for cross-modal retrieval

The task of cross-modal retrieval is to query similar objects in dataset of multi-modality, such as using text to query images and vice versa. However, most of existing works suffer from high computational complexity and storage cost in large-scale ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '18: Proceedings of the 26th ACM international conference on Multimedia

October 2018

2167 pages

ISBN:9781450356657

DOI:10.1145/3240508

General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '18

Sponsor:

SIGMM

MM '18: ACM Multimedia Conference

October 22 - 26, 2018

Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
510
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)8

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li BWu YLi ZGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Team HUGE: Image-Text Matching via Hierarchical and Unified Graph EnhancingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658001(704-712)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658001
Liu XHe YCheung YXu XWang N(2024)Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image–Text MatchingIEEE Transactions on Cybernetics10.1109/TCYB.2022.317902054:2(948-961)Online publication date: Feb-2024
https://doi.org/10.1109/TCYB.2022.3179020
Wang TLi FZhu LLi JZhang ZShen H(2024)Cross-Modal Retrieval: A Systematic Review of Methods and Future DirectionsProceedings of the IEEE10.1109/JPROC.2024.3525147112:11(1716-1754)Online publication date: Nov-2024
https://doi.org/10.1109/JPROC.2024.3525147
Zhang CMa WXiao JZhang HShao JZhuang YChen L(2023)VL-NMS: Breaking Proposal Bottlenecks in Two-stage Visual-language MatchingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357909519:5s(1-24)Online publication date: 7-Jun-2023
https://dl.acm.org/doi/10.1145/3579095
Zhang KMao ZLiu AZhang Y(2023)Unified Adaptive Relevance Distinguishable Attention Network for Image-Text MatchingIEEE Transactions on Multimedia10.1109/TMM.2022.314160325(1320-1332)Online publication date: 2023
https://doi.org/10.1109/TMM.2022.3141603
Liao LYang MZhang B(2023)Deep Supervised Dual Cycle Adversarial Network for Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.320324733:2(920-934)Online publication date: Feb-2023
https://doi.org/10.1109/TCSVT.2022.3203247
Zhang BSun XLi XWang SLiu DJia J(2023)Similarity Contrastive Capsule Transformation for Image-Text Matching2023 9th International Conference on Mechatronics and Robotics Engineering (ICMRE)10.1109/ICMRE56789.2023.10106583(84-85)Online publication date: 10-Feb-2023
https://doi.org/10.1109/ICMRE56789.2023.10106583
Ding QZhang HWang XLi W(2023)Cross-modal retrieval of remote sensing images and text based on self-attention unsupervised deep common feature spaceInternational Journal of Remote Sensing10.1080/01431161.2023.222570544:12(3892-3909)Online publication date: 12-Jul-2023
https://doi.org/10.1080/01431161.2023.2225705
Xie XLi ZTang ZYao DMa H(2023)Unifying knowledge iterative dissemination and relational reconstruction network for image–text matchingInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10315460:1Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.ipm.2022.103154
Cheng QGuo QGu X(2023)Adversarial pre-optimized graph representation learning with double-order sampling for cross-modal retrievalExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120731231:COnline publication date: 20-Sep-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120731
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten