skip to main content
10.1145/3357384.3358104acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Cross-modal Image-Text Retrieval with Multitask Learning

Published: 03 November 2019 Publication History

Abstract

In this paper, we propose a multi-task learning approach for cross-modal image-text retrieval. First, a correlation network is proposed for relation recognition task, which helps learn the complicated relations and common information of different modalities. Then, we propose a correspondence cross-modal autoencoder for cross-modal input reconstruction task, which helps correlate the hidden representations of two uni-modal autoencoders. In addition, to further improve the performance of cross-modal retrieval, two regularization terms (variance and consistency constraints) are introduced to the cross-modal embeddings such that the learned common information has large variance and is modality invariant. Finally, to enable large-scale cross-modal similarity search, a flexible binary transform network is designed to convert the text and image embeddings into binary codes. Extensive experiments on two benchmark datasets demonstrate that our model has robust superiority over the compared strong baseline methods. Source code is available at \urlhttps://github.com/daerv/DAEVR.

References

[1]
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In ICML .
[2]
Michael M. Bronstein, Alexander M. Bronstein, Fabrice Michel, and Nikos Paragios. 2010. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In CVPR . 3594--3601.
[3]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal Retrieval with Correspondence Autoencoder. 7--16.
[4]
David R. Hardoon, Sandor R. Szedmak, and John R. Shawe-Taylor. 2004. Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Computation, Vol. 16, 12 (2004), 2639--2664.
[5]
Mark J. Huiskes and Michael S. Lew. 2008. The MIR flickr retrieval evaluation. In ACM International Conference on Multimedia Information Retrieval. 39--43.
[6]
Qing-Yuan Jiang and Wu-Jun Li. 2016. Deep cross-modal hashing. CoRR (2016).
[7]
Shaishav Kumar and Raghavendra Udupa. 2011. Learning hash functions for cross-view similarity search. In IJCAI. 1360--1365.
[8]
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval. (2018).
[9]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740--755.
[10]
D. Putthividhy, Hagai T Attias, and Srikantan S Nagarajan. 2010. Topic regression multi-modal latent dirichlet allocation for image annotation. CVPR (2010).
[11]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[12]
Bruce Thompson. 2005. Canonical correlation analysis. Encyclopedia of statistics in behavioral science (2005).
[13]
Di Wang, Xinbo Gao, Xiumei Wang, and Lihuo He. 2015b. Semantic topic multimodal hashing for cross-media retrieval. In AAAI. 3890--3896.
[14]
Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2017. Learning Two-Branch Neural Networks for Image-Text Matching Tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PP, 99 (2017), 1--1.
[15]
Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015a. On deep multi-view representation learning. In ICML. 1083--1092.

Cited By

View all
  • (2024)Learning Prompt-Level Quality Variance for Cost-Effective Text-to-Image GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679954(3847-3851)Online publication date: 21-Oct-2024
  • (2024)Based on Spatial and Temporal Implicit Semantic Relational Inference for Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341129834:11(11286-11298)Online publication date: Nov-2024
  • (2024)Towards Generated Image Provenance Analysis via Conceptual-Similar-Guided-SLIP RetrievalIEEE Signal Processing Letters10.1109/LSP.2024.338895831(1419-1423)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Cross-modal Image-Text Retrieval with Multitask Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management
    November 2019
    3373 pages
    ISBN:9781450369763
    DOI:10.1145/3357384
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 November 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. correlation network
    2. correspondence autoencoder
    3. cross-modal retrieval
    4. variance constraint

    Qualifiers

    • Short-paper

    Funding Sources

    • CCF-Tencent RhinoBird Young Faculty Open Research Fund
    • National Natural Science Foundation of China
    • Youth Innovation Promotion Association CAS

    Conference

    CIKM '19
    Sponsor:

    Acceptance Rates

    CIKM '19 Paper Acceptance Rate 202 of 1,031 submissions, 20%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)55
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Learning Prompt-Level Quality Variance for Cost-Effective Text-to-Image GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679954(3847-3851)Online publication date: 21-Oct-2024
    • (2024)Based on Spatial and Temporal Implicit Semantic Relational Inference for Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341129834:11(11286-11298)Online publication date: Nov-2024
    • (2024)Towards Generated Image Provenance Analysis via Conceptual-Similar-Guided-SLIP RetrievalIEEE Signal Processing Letters10.1109/LSP.2024.338895831(1419-1423)Online publication date: 2024
    • (2024)Cross-Modal Retrieval: A Systematic Review of Methods and Future DirectionsProceedings of the IEEE10.1109/JPROC.2024.3525147112:11(1716-1754)Online publication date: Nov-2024
    • (2024)A Review on Language-Independent Search on Speech and its ApplicationsIEEE Access10.1109/ACCESS.2024.352039412(194182-194202)Online publication date: 2024
    • (2024)Touch-text answer for human-robot interaction via supervised adversarial learningExpert Systems with Applications10.1016/j.eswa.2023.122738242(122738)Online publication date: May-2024
    • (2024)Multi-Task Visual Semantic Embedding Network for Image-Text RetrievalJournal of Computer Science and Technology10.1007/s11390-024-4125-139:4(811-826)Online publication date: 1-Jul-2024
    • (2023)Many Hands Make Light Work: Transferring Knowledge From Auxiliary Tasks for Video-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2022.314971625(2661-2674)Online publication date: 1-Jan-2023
    • (2023)Multi-Scale Fine-Grained Alignments for Image and Sentence MatchingIEEE Transactions on Multimedia10.1109/TMM.2021.312874425(543-556)Online publication date: 2023
    • (2023)Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and ChallengesInternational Journal of Computational Intelligence Systems10.1007/s44196-023-00260-316:1Online publication date: 12-May-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media