skip to main content
10.1145/3240508.3240521acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Semantic Structure-preserved Embeddings for Cross-modal Retrieval

Published: 15 October 2018 Publication History

Abstract

This paper learns semantic embeddings for multi-label cross-modal retrieval. Our method exploits the structure in semantics represented by label vectors to guide the learning of embeddings. First, we construct a semantic graph based on label vectors which incorporates data from both modalities, and enforce the embeddings to preserve the local structure of this semantic graph. Second, we enforce the embeddings to well reconstruct the labels, i.e., the global semantic structure. In addition, we encourage the embeddings to preserve local geometric structure of each modality. Accordingly, the local and global semantic structure consistencies as well as the local geometric structure consistency are enforced, simultaneously. The mappings between inputs and embeddings are designed to be nonlinear neural network with larger capacity and more flexibility. The overall objective function is optimized by stochastic gradient descent to gain the scalability on large datasets. Experiments conducted on three real world datasets clearly demonstrate the superiority of our proposed approach over the state-of-the-art methods.

References

[1]
Mart'in Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et almbox. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
[2]
S. Abhishek, K. Abhishek, H. Daume, and D. W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In CVPR. 2160--2167.
[3]
Galen Andrew, Raman Arora, Jeff A Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In ICML (3). 1247--1255.
[4]
T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore. In CIVR. 48.
[5]
Dragovs M Cvetković, Peter Rowlinson, and Slobodan Simic. 1997. Eigenspaces of graphs. Number 66. Cambridge University Press.
[6]
C. Deng, J. Tang, X.and Yan, W. Liu, and X. Gao. 2016. Discriminative Dictionary Learning With Common Label Alignment for Cross-Modal Retrieval. TMM, Vol. 18, 2 (2016), 208--218.
[7]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The pascal visual object classes (voc) challenge. IJCV, Vol. 88, 2 (2010), 303--338.
[8]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In ACM MM. 7--16.
[9]
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, Vol. 106, 2 (2014), 210--233.
[10]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD. 855--864.
[11]
Xiaofei He and Partha Niyogi. 2002. Locality Preserving Projections (LPP). NIPS, Vol. 16, 1 (2002), 186--197.
[12]
Yonghao He, Shiming Xiang, Cuicui Kang, Jian Wang, and Chunhong Pan. 2016. Cross-modal retrieval via deep and bidirectional representation learning. TMM, Vol. 18, 7 (2016), 1363--1377.
[13]
H. Hotelling. 1936. Relations between two sets of variates. Biometrika, Vol. 28, 3/4 (1936), 321--377.
[14]
Mark J Huiskes and Michael S Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval. 39--43.
[15]
S. J. Hwang and K. Grauman. 2010. Accounting for the Relative Importance of Objects in Image Retrieval. In BMVC, Vol. 1. 5.
[16]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM MM. 675--678.
[17]
C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. TMM, Vol. 17, 3 (2015), 370--381.
[18]
Shaishav Kumar and Raghavendra Udupa. 2011. Learning hash functions for cross-view similarity search. In IJCAI, Vol. 22. 1360.
[19]
Zechao Li, Jing Liu, Jinhui Tang, and Hanqing Lu. 2015. Robust structured subspace learning for data representation. TPAMI, Vol. 37, 10 (2015), 2085--2098.
[20]
J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber. 2014. Multimodal similarity-preserving hashing. TPAMI, Vol. 36, 4 (2014), 824--830.
[21]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. 3111--3119.
[22]
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In KDD. 701--710.
[23]
V. Ranjan, N. Rasiwasia, and CV Jawahar. 2015. Multi-Label Cross-modal Retrieval. In ICCV. 4094--4102.
[24]
N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. RG Lanckriet, R. Levy, and N. Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACM MM. 251--260.
[25]
R. Rosipal and N. Kr"amer. 2006. Overview and recent advances in partial least squares. In Subspace, latent structure and feature selection. Springer, 34--51.
[26]
Guoli Song, Shuhui Wang, Qingming Huang, and Qi Tian. 2015. Similarity gaussian process latent variable model for multi-modal data analysis. In ICCV. 4050--4058.
[27]
Jingkuan Song, Yang Yang, Yi Yang, Zi Huang, and Heng Tao Shen. 2013. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In SIGMOD. 785--796.
[28]
Nitish Srivastava and Ruslan R Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In NIPS. 2222--2230.
[29]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In ICML. 1096--1103.
[30]
K. Wang, R. He, W. Wang, L. Wang, and T. Tan. 2013. Learning coupled feature spaces for cross-modal matching. In ICCV. 2088--2095.
[31]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In CVPR. 5005--5013.
[32]
Fei Wu, Xinyan Lu, Zhongfei Zhang, Shuicheng Yan, Yong Rui, and Yueting Zhuang. 2013. Cross-media semantic representation via bi-directional learning to rank. In ACM MM. 877--886.
[33]
Yiling Wu, Shuhui Wang, and Qingming Huang. 2017. Online Asymmetric Similarity Learning for Cross-Modal Retrieval. In CVPR. 4269--4278.
[34]
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. 3441--3450.
[35]
Shuicheng Yan, Dong Xu, Benyu Zhang, and Hong-Jiang Zhang. 2005. Graph embedding: A general framework for dimensionality reduction. In CVPR, Vol. 2. 830--837.
[36]
Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis. In ICCV. 28--36.
[37]
Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. 2003. Learning with local and global consistency. In NIPS, Vol. 16. 321--328.

Cited By

View all
  • (2024)Team HUGE: Image-Text Matching via Hierarchical and Unified Graph EnhancingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658001(704-712)Online publication date: 30-May-2024
  • (2024)Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image–Text MatchingIEEE Transactions on Cybernetics10.1109/TCYB.2022.317902054:2(948-961)Online publication date: Feb-2024
  • (2024)Cross-Modal Retrieval: A Systematic Review of Methods and Future DirectionsProceedings of the IEEE10.1109/JPROC.2024.3525147112:11(1716-1754)Online publication date: Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal retrieval
  2. graph embeddings
  3. semantic embeddings

Qualifiers

  • Research-article

Conference

MM '18
Sponsor:
MM '18: ACM Multimedia Conference
October 22 - 26, 2018
Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)8
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Team HUGE: Image-Text Matching via Hierarchical and Unified Graph EnhancingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658001(704-712)Online publication date: 30-May-2024
  • (2024)Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image–Text MatchingIEEE Transactions on Cybernetics10.1109/TCYB.2022.317902054:2(948-961)Online publication date: Feb-2024
  • (2024)Cross-Modal Retrieval: A Systematic Review of Methods and Future DirectionsProceedings of the IEEE10.1109/JPROC.2024.3525147112:11(1716-1754)Online publication date: Nov-2024
  • (2023)VL-NMS: Breaking Proposal Bottlenecks in Two-stage Visual-language MatchingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357909519:5s(1-24)Online publication date: 7-Jun-2023
  • (2023)Unified Adaptive Relevance Distinguishable Attention Network for Image-Text MatchingIEEE Transactions on Multimedia10.1109/TMM.2022.314160325(1320-1332)Online publication date: 2023
  • (2023)Deep Supervised Dual Cycle Adversarial Network for Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.320324733:2(920-934)Online publication date: Feb-2023
  • (2023)Similarity Contrastive Capsule Transformation for Image-Text Matching2023 9th International Conference on Mechatronics and Robotics Engineering (ICMRE)10.1109/ICMRE56789.2023.10106583(84-85)Online publication date: 10-Feb-2023
  • (2023)Cross-modal retrieval of remote sensing images and text based on self-attention unsupervised deep common feature spaceInternational Journal of Remote Sensing10.1080/01431161.2023.222570544:12(3892-3909)Online publication date: 12-Jul-2023
  • (2023)Unifying knowledge iterative dissemination and relational reconstruction network for image–text matchingInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10315460:1Online publication date: 1-Jan-2023
  • (2023)Adversarial pre-optimized graph representation learning with double-order sampling for cross-modal retrievalExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120731231:COnline publication date: 20-Sep-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media