short-paper

Cross-modal Image-Text Retrieval with Multitask Learning

Authors:

Min YangAuthors Info & Claims

CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Pages 2309 - 2312

https://doi.org/10.1145/3357384.3358104

Published: 03 November 2019 Publication History

Abstract

In this paper, we propose a multi-task learning approach for cross-modal image-text retrieval. First, a correlation network is proposed for relation recognition task, which helps learn the complicated relations and common information of different modalities. Then, we propose a correspondence cross-modal autoencoder for cross-modal input reconstruction task, which helps correlate the hidden representations of two uni-modal autoencoders. In addition, to further improve the performance of cross-modal retrieval, two regularization terms (variance and consistency constraints) are introduced to the cross-modal embeddings such that the learned common information has large variance and is modality invariant. Finally, to enable large-scale cross-modal similarity search, a flexible binary transform network is designed to convert the text and image embeddings into binary codes. Extensive experiments on two benchmark datasets demonstrate that our model has robust superiority over the compared strong baseline methods. Source code is available at \urlhttps://github.com/daerv/DAEVR.

References

[1]

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In ICML .

[2]

Michael M. Bronstein, Alexander M. Bronstein, Fabrice Michel, and Nikos Paragios. 2010. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In CVPR . 3594--3601.

[3]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal Retrieval with Correspondence Autoencoder. 7--16.

[4]

David R. Hardoon, Sandor R. Szedmak, and John R. Shawe-Taylor. 2004. Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Computation, Vol. 16, 12 (2004), 2639--2664.

Digital Library

[5]

Mark J. Huiskes and Michael S. Lew. 2008. The MIR flickr retrieval evaluation. In ACM International Conference on Multimedia Information Retrieval. 39--43.

[6]

Qing-Yuan Jiang and Wu-Jun Li. 2016. Deep cross-modal hashing. CoRR (2016).

[7]

Shaishav Kumar and Raghavendra Udupa. 2011. Learning hash functions for cross-view similarity search. In IJCAI. 1360--1365.

[8]

Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval. (2018).

[9]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740--755.

[10]

D. Putthividhy, Hagai T Attias, and Srikantan S Nagarajan. 2010. Topic regression multi-modal latent dirichlet allocation for image annotation. CVPR (2010).

[11]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[12]

Bruce Thompson. 2005. Canonical correlation analysis. Encyclopedia of statistics in behavioral science (2005).

[13]

Di Wang, Xinbo Gao, Xiumei Wang, and Lihuo He. 2015b. Semantic topic multimodal hashing for cross-media retrieval. In AAAI. 3890--3896.

[14]

Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2017. Learning Two-Branch Neural Networks for Image-Text Matching Tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PP, 99 (2017), 1--1.

[15]

Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015a. On deep multi-view representation learning. In ICML. 1083--1092.

Cited By

Lee DLee WSerra ESpezzano F(2024)Learning Prompt-Level Quality Variance for Cost-Effective Text-to-Image GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679954(3847-3851)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679954
Jin MHu WZhu LWang XHong R(2024)Based on Spatial and Temporal Implicit Semantic Relational Inference for Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341129834:11(11286-11298)Online publication date: Nov-2024
https://doi.org/10.1109/TCSVT.2024.3411298
Xia XWang LSun JNakagawa A(2024)Towards Generated Image Provenance Analysis via Conceptual-Similar-Guided-SLIP RetrievalIEEE Signal Processing Letters10.1109/LSP.2024.338895831(1419-1423)Online publication date: 2024
https://doi.org/10.1109/LSP.2024.3388958
Show More Cited By

Index Terms

Cross-modal Image-Text Retrieval with Multitask Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Scalable Deep Multimodal Learning for Cross-Modal Retrieval
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Cross-modal retrieval takes one type of data as the query to retrieve relevant data of another type. Most of existing cross-modal retrieval approaches were proposed to learn a common subspace in a joint manner, where the data from all modalities have to ...
Adversarial Cross-Modal Retrieval
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly ...
Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning
ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval

Cross-modal retrieval extends the ability of search engines to deal with the massive cross-modal data. The goal of image-text cross-modal retrieval is to search images (texts) by using text (image) queries by computing the similarities of images and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

November 2019

3373 pages

ISBN:9781450369763

DOI:10.1145/3357384

General Chairs:
Wenwu Zhu
Tsinghua University, China
,
Dacheng Tao
University of Massachusetts, USA
,
Xueqi Cheng
Institute of Computing Technology, CAS, China
,
Program Chairs:
Peng Cui
Tsinghua University, China
,
Elke Rundensteiner
Worcester Polytechnic Institute, USA
,
David Carmel
Amazon Research, USA
,
Qi He
LinkedIn, USA
,
Jeffrey Xu Yu
Chinese University of Hong Kong, China

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

CCF-Tencent RhinoBird Young Faculty Open Research Fund
National Natural Science Foundation of China
Youth Innovation Promotion Association CAS

Conference

CIKM '19

Sponsor:

CIKM '19: The 28th ACM International Conference on Information and Knowledge Management

November 3 - 7, 2019

Beijing, China

Acceptance Rates

CIKM '19 Paper Acceptance Rate 202 of 1,031 submissions, 20%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
559
Total Downloads

Downloads (Last 12 months)55
Downloads (Last 6 weeks)3

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee DLee WSerra ESpezzano F(2024)Learning Prompt-Level Quality Variance for Cost-Effective Text-to-Image GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679954(3847-3851)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679954
Jin MHu WZhu LWang XHong R(2024)Based on Spatial and Temporal Implicit Semantic Relational Inference for Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341129834:11(11286-11298)Online publication date: Nov-2024
https://doi.org/10.1109/TCSVT.2024.3411298
Xia XWang LSun JNakagawa A(2024)Towards Generated Image Provenance Analysis via Conceptual-Similar-Guided-SLIP RetrievalIEEE Signal Processing Letters10.1109/LSP.2024.338895831(1419-1423)Online publication date: 2024
https://doi.org/10.1109/LSP.2024.3388958
Wang TLi FZhu LLi JZhang ZShen H(2024)Cross-Modal Retrieval: A Systematic Review of Methods and Future DirectionsProceedings of the IEEE10.1109/JPROC.2024.3525147112:11(1716-1754)Online publication date: Nov-2024
https://doi.org/10.1109/JPROC.2024.3525147
Kulkarni SPal S(2024)A Review on Language-Independent Search on Speech and its ApplicationsIEEE Access10.1109/ACCESS.2024.352039412(194182-194202)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3520394
Wang YMeng QLi YHou H(2024)Touch-text answer for human-robot interaction via supervised adversarial learningExpert Systems with Applications10.1016/j.eswa.2023.122738242(122738)Online publication date: May-2024
https://doi.org/10.1016/j.eswa.2023.122738
Qin XLi LTang JHao FGe MPang G(2024)Multi-Task Visual Semantic Embedding Network for Image-Text RetrievalJournal of Computer Science and Technology10.1007/s11390-024-4125-139:4(811-826)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11390-024-4125-1
Wang WGao JYang XXu C(2023)Many Hands Make Light Work: Transferring Knowledge From Auxiliary Tasks for Video-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2022.314971625(2661-2674)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3149716
Li WWang YSu YLi XLiu AZhang Y(2023)Multi-Scale Fine-Grained Alignments for Image and Sentence MatchingIEEE Transactions on Multimedia10.1109/TMM.2021.312874425(543-556)Online publication date: 2023
https://doi.org/10.1109/TMM.2021.3128744
Ebaid DMadbouly MEl-Zoghabi A(2023)Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and ChallengesInternational Journal of Computational Intelligence Systems10.1007/s44196-023-00260-316:1Online publication date: 12-May-2023
https://doi.org/10.1007/s44196-023-00260-3
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten