skip to main content
research-article

Learning Click-Based Deep Structure-Preserving Embeddings with Visual Attention

Published: 08 August 2019 Publication History

Abstract

One fundamental problem in image search is to learn the ranking functions (i.e., the similarity between query and image). Recent progress on this topic has evolved through two paradigms: the text-based model and image ranker learning. The former relies on image surrounding texts, making the similarity sensitive to the quality of textual descriptions. The latter may suffer from the robustness problem when human-labeled query-image pairs cannot represent user search intent precisely. We demonstrate in this article that the preceding two limitations can be well mitigated by learning a cross-view embedding that leverages click data. Specifically, a novel click-based Deep Structure-Preserving Embeddings with visual Attention (DSPEA) model is presented, which consists of two components: deep convolutional neural networks followed by image embedding layers for learning visual embedding, and a deep neural networks for generating query semantic embedding. Meanwhile, visual attention is incorporated at the top of the convolutional neural network to reflect the relevant regions of the image to the query. Furthermore, considering the high dimension of the query space, a new click-based representation on a query set is proposed for alleviating this sparsity problem. The whole network is end-to-end trained by optimizing a large margin objective that combines cross-view ranking constraints with in-view neighborhood structure preservation constraints. On a large-scale click-based image dataset with 11.7 million queries and 1 million images, our model is shown to be powerful for keyword-based image search with superior performance over several state-of-the-art methods and achieves, to date, the best reported NDCG@25 of 52.21%.

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 2015 International Conference on Learning Representations.
[2]
Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, and Mehryar Mohri. 2009. Polynomial semantic indexing. In Advances in Neural Information Processing Systems. 64--72.
[3]
Yalong Bai, Wei Yu, Tianjun Xiao, Chang Xu, Kuiyuan Yang, Wei-Ying Ma, and Tiejun Zhao. 2014. Bag-of-words based deep neural network for image retrieval. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 229--232.
[4]
Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. 2010. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research 11 (2010), 1109--1135.
[5]
Zheng Fang and Zhongfei Mark Zhang. 2013. Discriminative feature selection for multi-view cross-domain learning. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 1321--1330.
[6]
Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121--2129.
[7]
Kenji Fukumizu, Francis R. Bach, and Arthur Gretton. 2007. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research 8 (2007), 361--383.
[8]
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. International Journal of Computer Vision 106, 2 (2014), 210--233.
[9]
David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[11]
Yonghao He, Shiming Xiang, Cuicui Kang, Jian Wang, and Chunhong Pan. 2016. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia 18, 7 (2016), 1363--1377.
[12]
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 2000. Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classifiers. 115--132.
[13]
Xian-Sheng Hua, Linjun Yang, Jingdong Wang, Jing Wang, Ming Ye, Kuansan Wang, Yong Rui, and Jin Li. 2013. Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 243--252.
[14]
Vidit Jain and Manik Varma. 2011. Learning to re-rank: Query-dependent image re-ranking using click data. In Proceedings of the 20th International Conference on World Wide Web. ACM, New York, NY, 277--286.
[15]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 675--678.
[16]
Meina Kan, Shiguang Shan, Haihong Zhang, Shihong Lao, and Xilin Chen. 2016. Multi-view discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (2016), 188--194.
[17]
Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia 17, 3 (2015), 370--381.
[18]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
[19]
Yehao Li, Ting Yao, Tao Mei, Hongyang Chao, and Yong Rui. 2016. Share-and-chat: Achieving human-level video commenting by search and multi-view embedding. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, New York, NY, 928--937.
[20]
Yuan Liu, Zhongchao Shi, Xue Li, and Gang Wang. 2015. Click-through-based deep visual-semantic embedding for image search. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, New York, NY, 955--958.
[21]
Stefano Melacci and Mikhail Belkin. 2011. Laplacian support vector machines trained in the primal. Journal of Machine Learning Research 12 (2011), 1149--1184.
[22]
Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 717--726.
[23]
Yingwei Pan, Ting Yao, Xinmei Tian, Houqiang Li, and Chong-Wah Ngo. 2014. Click-through-based subspace learning for image search. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 233--236.
[24]
Yingwei Pan, Ting Yao, Kuiyuan Yang, Houqiang Li, Chong-Wah Ngo, Jingdong Wang, and Tao Mei. 2013. Image search by graph-based label propagation with image representation from DNN. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 397--400.
[25]
S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. 1994. Okapi at TREC-3. In Proceedings of the 3rd Text REtrieval Conference.
[26]
Joseph P. Romano. 1990. On the behavior of randomization tests without a group invariance assumption. Journal of the American Statistical Association 85, 411 (1990), 686--692.
[27]
Roman Rosipal and Nicole Krämer. 2005. Overview and recent advances in partial least squares. In Proceedings of the 2005 International Conference on Subspace, Latent Structure, and Feature Selection. 34--51.
[28]
Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY.
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.
[30]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the International Conference on Learning Representations.
[31]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.
[32]
Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212.
[33]
Wei Wu, Hang Li, and Jun Xu. 2013. Learning query and document similarities from click-through bipartite graph with metadata. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. ACM, New York, NY, 687--696.
[34]
Hongtao Xie, Zhendong Mao, Yongdong Zhang, Han Deng, Chenggang Yan, and Zhineng Chen. 2018. Double-bit quantization and index hashing for nearest neighbor search. IEEE Transactions on Multimedia 21, 5 (2018), 1248--1260.
[35]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.
[36]
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3441--3450.
[37]
Yang Yang, Yi Yang, and Heng Tao Shen. 2013. Effective transfer tagging from image to video. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 2 (2013), 14.
[38]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.
[39]
Ting Yao, Tao Mei, and Chong-Wah Ngo. 2010. Co-reranking by mutual reinforcement for image search. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, New York, NY, 34--41.
[40]
Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis. In Proceedings of the IEEE International Conference on Computer Vision. 28--36.
[41]
Ting Yao, Tao Mei, Chong-Wah Ngo, and Shipeng Li. 2013. Annotation for free: Video tagging by mining user search behavior. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 977--986.
[42]
Ting Yao, Chong-Wah Ngo, and Tao Mei. 2013. Circular reranking for visual search. IEEE Transactions on Image Processing 22, 4 (2013), 1644--1655.
[43]
Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, and Tao Mei. 2015. Semi-supervised domain adaptation with subspace learning for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2142--2150.
[44]
Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214.
[45]
Hanwang Zhang, Xindi Shang, Huanbo Luan, Meng Wang, and Tat-Seng Chua. 2017. Learning from collective intelligence: Feature learning using social images and tags. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 1 (2017), 1.
[46]
Lei Zhang, Yongdong Zhang, Xiaoguang Gu, Jinhui Tang, and Qi Tian. 2014. Scalable similarity search with topology preserving hashing. IEEE Transactions on Image Processing 23, 7 (2014), 3025--3039.

Cited By

View all
  • (2024)Image sentiment considering color palette recommendations based on influence scores for image advertisementElectronic Commerce Research10.1007/s10660-024-09851-4Online publication date: 9-May-2024
  • (2023)Deep Learning-Based State-Dependent ARX Modeling and Predictive Control of Nonlinear SystemsIEEE Access10.1109/ACCESS.2023.326318011(32579-32594)Online publication date: 2023
  • (2022)A Decade Survey of Content Based Image Retrieval Using Deep LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.308092032:5(2687-2704)Online publication date: May-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 3
August 2019
331 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3352586
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 August 2019
Accepted: 01 April 2019
Revised: 01 March 2019
Received: 01 September 2018
Published in TOMM Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CNN
  2. Cross-view embedding
  3. click data
  4. image search

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Image sentiment considering color palette recommendations based on influence scores for image advertisementElectronic Commerce Research10.1007/s10660-024-09851-4Online publication date: 9-May-2024
  • (2023)Deep Learning-Based State-Dependent ARX Modeling and Predictive Control of Nonlinear SystemsIEEE Access10.1109/ACCESS.2023.326318011(32579-32594)Online publication date: 2023
  • (2022)A Decade Survey of Content Based Image Retrieval Using Deep LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.308092032:5(2687-2704)Online publication date: May-2022
  • (2021)Smart Director: An Event-Driven Directing System for Live BroadcastingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/344898117:4(1-18)Online publication date: 12-Nov-2021
  • (2021)Graph Regularized Encoder-Decoder Networks for Image Representation LearningIEEE Transactions on Multimedia10.1109/TMM.2020.302069723(3124-3136)Online publication date: 2021
  • (2021)Sentiment analysis in textual, visual and multimodal inputs using recurrent neural networksMultimedia Tools and Applications10.1007/s11042-020-10037-x80:5(6871-6910)Online publication date: 1-Feb-2021
  • (2020)Blind Image Quality Assessment by Natural Scene Statistics and Perceptual CharacteristicsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/341483716:3(1-91)Online publication date: 25-Aug-2020
  • (2020)Deep Metric Learning With Density AdaptivityIEEE Transactions on Multimedia10.1109/TMM.2019.293971122:5(1285-1297)Online publication date: May-2020

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media