research-article

Learning Click-Based Deep Structure-Preserving Embeddings with Visual Attention

Authors:

Tao MeiAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 15, Issue 3

Article No.: 78, Pages 1 - 19

https://doi.org/10.1145/3328994

Published: 08 August 2019 Publication History

Abstract

One fundamental problem in image search is to learn the ranking functions (i.e., the similarity between query and image). Recent progress on this topic has evolved through two paradigms: the text-based model and image ranker learning. The former relies on image surrounding texts, making the similarity sensitive to the quality of textual descriptions. The latter may suffer from the robustness problem when human-labeled query-image pairs cannot represent user search intent precisely. We demonstrate in this article that the preceding two limitations can be well mitigated by learning a cross-view embedding that leverages click data. Specifically, a novel click-based Deep Structure-Preserving Embeddings with visual Attention (DSPEA) model is presented, which consists of two components: deep convolutional neural networks followed by image embedding layers for learning visual embedding, and a deep neural networks for generating query semantic embedding. Meanwhile, visual attention is incorporated at the top of the convolutional neural network to reflect the relevant regions of the image to the query. Furthermore, considering the high dimension of the query space, a new click-based representation on a query set is proposed for alleviating this sparsity problem. The whole network is end-to-end trained by optimizing a large margin objective that combines cross-view ranking constraints with in-view neighborhood structure preservation constraints. On a large-scale click-based image dataset with 11.7 million queries and 1 million images, our model is shown to be powerful for keyword-based image search with superior performance over several state-of-the-art methods and achieves, to date, the best reported NDCG@25 of 52.21%.

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 2015 International Conference on Learning Representations.

[2]

Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, and Mehryar Mohri. 2009. Polynomial semantic indexing. In Advances in Neural Information Processing Systems. 64--72.

Digital Library

[3]

Yalong Bai, Wei Yu, Tianjun Xiao, Chang Xu, Kuiyuan Yang, Wei-Ying Ma, and Tiejun Zhao. 2014. Bag-of-words based deep neural network for image retrieval. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 229--232.

Digital Library

[4]

Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. 2010. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research 11 (2010), 1109--1135.

Digital Library

[5]

Zheng Fang and Zhongfei Mark Zhang. 2013. Discriminative feature selection for multi-view cross-domain learning. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 1321--1330.

Digital Library

[6]

Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121--2129.

Digital Library

[7]

Kenji Fukumizu, Francis R. Bach, and Arthur Gretton. 2007. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research 8 (2007), 361--383.

Digital Library

[8]

Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. International Journal of Computer Vision 106, 2 (2014), 210--233.

Digital Library

[9]

David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664.

Digital Library

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[11]

Yonghao He, Shiming Xiang, Cuicui Kang, Jian Wang, and Chunhong Pan. 2016. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia 18, 7 (2016), 1363--1377.

Digital Library

[12]

Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 2000. Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classifiers. 115--132.

[13]

Xian-Sheng Hua, Linjun Yang, Jingdong Wang, Jing Wang, Ming Ye, Kuansan Wang, Yong Rui, and Jin Li. 2013. Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 243--252.

Digital Library

[14]

Vidit Jain and Manik Varma. 2011. Learning to re-rank: Query-dependent image re-ranking using click data. In Proceedings of the 20th International Conference on World Wide Web. ACM, New York, NY, 277--286.

Digital Library

[15]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 675--678.

Digital Library

[16]

Meina Kan, Shiguang Shan, Haihong Zhang, Shihong Lao, and Xilin Chen. 2016. Multi-view discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (2016), 188--194.

Digital Library

[17]

Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia 17, 3 (2015), 370--381.

Digital Library

[18]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.

Digital Library

[19]

Yehao Li, Ting Yao, Tao Mei, Hongyang Chao, and Yong Rui. 2016. Share-and-chat: Achieving human-level video commenting by search and multi-view embedding. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, New York, NY, 928--937.

Digital Library

[20]

Yuan Liu, Zhongchao Shi, Xue Li, and Gang Wang. 2015. Click-through-based deep visual-semantic embedding for image search. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, New York, NY, 955--958.

Digital Library

[21]

Stefano Melacci and Mikhail Belkin. 2011. Laplacian support vector machines trained in the primal. Journal of Machine Learning Research 12 (2011), 1149--1184.

Digital Library

[22]

Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 717--726.

Digital Library

[23]

Yingwei Pan, Ting Yao, Xinmei Tian, Houqiang Li, and Chong-Wah Ngo. 2014. Click-through-based subspace learning for image search. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 233--236.

Digital Library

[24]

Yingwei Pan, Ting Yao, Kuiyuan Yang, Houqiang Li, Chong-Wah Ngo, Jingdong Wang, and Tao Mei. 2013. Image search by graph-based label propagation with image representation from DNN. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 397--400.

Digital Library

[25]

S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. 1994. Okapi at TREC-3. In Proceedings of the 3rd Text REtrieval Conference.

[26]

Joseph P. Romano. 1990. On the behavior of randomization tests without a group invariance assumption. Journal of the American Statistical Association 85, 411 (1990), 686--692.

[27]

Roman Rosipal and Nicole Krämer. 2005. Overview and recent advances in partial least squares. In Proceedings of the 2005 International Conference on Subspace, Latent Structure, and Feature Selection. 34--51.

Digital Library

[28]

Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY.

Digital Library

[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.

Digital Library

[30]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the International Conference on Learning Representations.

[31]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.

[32]

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212.

[33]

Wei Wu, Hang Li, and Jun Xu. 2013. Learning query and document similarities from click-through bipartite graph with metadata. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. ACM, New York, NY, 687--696.

Digital Library

[34]

Hongtao Xie, Zhendong Mao, Yongdong Zhang, Han Deng, Chenggang Yan, and Zhineng Chen. 2018. Double-bit quantization and index hashing for nearest neighbor search. IEEE Transactions on Multimedia 21, 5 (2018), 1248--1260.

Digital Library

[35]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.

Digital Library

[36]

Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3441--3450.

[37]

Yang Yang, Yi Yang, and Heng Tao Shen. 2013. Effective transfer tagging from image to video. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 2 (2013), 14.

Digital Library

[38]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.

[39]

Ting Yao, Tao Mei, and Chong-Wah Ngo. 2010. Co-reranking by mutual reinforcement for image search. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, New York, NY, 34--41.

Digital Library

[40]

Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis. In Proceedings of the IEEE International Conference on Computer Vision. 28--36.

Digital Library

[41]

Ting Yao, Tao Mei, Chong-Wah Ngo, and Shipeng Li. 2013. Annotation for free: Video tagging by mining user search behavior. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 977--986.

Digital Library

[42]

Ting Yao, Chong-Wah Ngo, and Tao Mei. 2013. Circular reranking for visual search. IEEE Transactions on Image Processing 22, 4 (2013), 1644--1655.

Digital Library

[43]

Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, and Tao Mei. 2015. Semi-supervised domain adaptation with subspace learning for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2142--2150.

[44]

Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214.

Digital Library

[45]

Hanwang Zhang, Xindi Shang, Huanbo Luan, Meng Wang, and Tat-Seng Chua. 2017. Learning from collective intelligence: Feature learning using social images and tags. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 1 (2017), 1.

Digital Library

[46]

Lei Zhang, Yongdong Zhang, Xiaoguang Gu, Jinhui Tang, and Qi Tian. 2014. Scalable similarity search with topology preserving hashing. IEEE Transactions on Image Processing 23, 7 (2014), 3025--3039.

Cited By

Han JLee Y(2024)Image sentiment considering color palette recommendations based on influence scores for image advertisementElectronic Commerce Research10.1007/s10660-024-09851-4Online publication date: 9-May-2024
https://doi.org/10.1007/s10660-024-09851-4
Kang TPeng HXu WSun YPeng X(2023)Deep Learning-Based State-Dependent ARX Modeling and Predictive Control of Nonlinear SystemsIEEE Access10.1109/ACCESS.2023.326318011(32579-32594)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3263180
Dubey S(2022)A Decade Survey of Content Based Image Retrieval Using Deep LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.308092032:5(2687-2704)Online publication date: May-2022
https://doi.org/10.1109/TCSVT.2021.3080920
Show More Cited By

Index Terms

Learning Click-Based Deep Structure-Preserving Embeddings with Visual Attention
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
      1. Query log analysis
    2. Retrieval models and ranking
      1. Similarity measures

Recommendations

Learning to re-rank: query-dependent image re-ranking using click data
WWW '11: Proceedings of the 20th international conference on World wide web

Our objective is to improve the performance of keyword based image search engines by re-ranking their original results. To this end, we address three limitations of existing search engines in this paper. First, there is no straight-forward, fully ...
Click-through-based Deep Visual-Semantic Embedding for Image Search
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

The problem of image search is mostly considered from the perspectives of feature-based vector model and image ranker learning. A fundamental issue that underlies the success of these approaches is the similarity learning between query and image. The ...
Random walks on the click graph
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Search engines can record which documents were clicked for which query, and use these query-document pairs as "soft" relevance judgments. However, compared to the true judgments, click logs give noisy and sparse relevance information. We apply a Markov ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 3

August 2019

331 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3352586

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 August 2019

Accepted: 01 April 2019

Revised: 01 March 2019

Received: 01 September 2018

Published in TOMM Volume 15, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Guangzhou Science and Technology Program, China
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
190
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Han JLee Y(2024)Image sentiment considering color palette recommendations based on influence scores for image advertisementElectronic Commerce Research10.1007/s10660-024-09851-4Online publication date: 9-May-2024
https://doi.org/10.1007/s10660-024-09851-4
Kang TPeng HXu WSun YPeng X(2023)Deep Learning-Based State-Dependent ARX Modeling and Predictive Control of Nonlinear SystemsIEEE Access10.1109/ACCESS.2023.326318011(32579-32594)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3263180
Dubey S(2022)A Decade Survey of Content Based Image Retrieval Using Deep LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.308092032:5(2687-2704)Online publication date: May-2022
https://doi.org/10.1109/TCSVT.2021.3080920
Pan YChen YBao QZhang NYao TLiu JMei T(2021)Smart Director: An Event-Driven Directing System for Live BroadcastingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/344898117:4(1-18)Online publication date: 12-Nov-2021
https://dl.acm.org/doi/10.1145/3448981
Yang SLi LWang SZhang WHuang QTian Q(2021)Graph Regularized Encoder-Decoder Networks for Image Representation LearningIEEE Transactions on Multimedia10.1109/TMM.2020.302069723(3124-3136)Online publication date: 2021
https://doi.org/10.1109/TMM.2020.3020697
Tembhurne JDiwan T(2021)Sentiment analysis in textual, visual and multimodal inputs using recurrent neural networksMultimedia Tools and Applications10.1007/s11042-020-10037-x80:5(6871-6910)Online publication date: 1-Feb-2021
https://dl.acm.org/doi/10.1007/s11042-020-10037-x
Liu YGu KLi XZhang Y(2020)Blind Image Quality Assessment by Natural Scene Statistics and Perceptual CharacteristicsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/341483716:3(1-91)Online publication date: 25-Aug-2020
https://dl.acm.org/doi/10.1145/3414837
Li YYao TPan YChao HMei T(2020)Deep Metric Learning With Density AdaptivityIEEE Transactions on Multimedia10.1109/TMM.2019.293971122:5(1285-1297)Online publication date: May-2020
https://doi.org/10.1109/TMM.2019.2939711

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents