skip to main content
10.1145/2939672.2939728acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Images Don't Lie: Transferring Deep Visual Semantic Features to Large-Scale Multimodal Learning to Rank

Published: 13 August 2016 Publication History

Abstract

Search is at the heart of modern e-commerce. As a result, the task of ranking search results automatically (learning to rank) is a multibillion dollar machine learning problem. Traditional models optimize over a few hand-constructed features based on the item's text. In this paper, we introduce a multimodal learning to rank model that combines these traditional features with visual semantic features transferred from a deep convolutional neural network. In a large scale experiment using data from the online marketplace Etsy, we verify that moving to a multimodal representation significantly improves ranking quality. We show how image features can capture fine-grained style information not available in a text-only representation. In addition, we show concrete examples of how image information can successfully disentangle pairs of highly different items that are ranked similarly by a text-only model.

References

[1]
Aytar, Y., and Zisserman, A. Tabula rasa: Model transfer for object category detection. In Computer Vision (ICCV), 2011 IEEE International Conference on (2011), IEEE, pp. 2252--2259.
[2]
Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Chapelle, O., and Weinberger, K. Learning to rank with (a lot of) word features. Information retrieval 13, 3 (2010), 291--314.
[3]
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning (2005), ACM, pp. 89--96.
[4]
Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014).
[5]
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (2013), pp. 2121--2129.
[6]
Gatys, L. A., Ecker, A. S., and Bethge, M. Texture synthesis and the controlled generation of natural stimuli using convolutional neural networks. arXiv preprint arXiv:1505.07376 (2015).
[7]
Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., and Lazebnik, S. Improving image-sentence embeddings using large weakly annotated photo collections. In Computer Vision--ECCV 2014. Springer, 2014, pp. 529--545.
[8]
Guillaumin, M., Verbeek, J., and Schmid, C. Multimodal semi-supervised learning for image classification. In CVPR 2010--23rd IEEE Conference on Computer Vision & Pattern Recognition (2010), IEEE Computer Society, pp. 902--909.
[9]
Hang, L. A short introduction to learning to rank. IEICE TRANSACTIONS on Information and Systems 94, 10 (2011), 1854--1862.
[10]
Herbrich, R., Graepel, T., and Obermayer, K. Large margin rank boundaries for ordinal regression. Advances in neural information processing systems (1999), 115--132.
[11]
Jarvelin, K., and Kekalainen, J. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422--446.
[12]
Joachims, T. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (2002), ACM, pp. 133--142.
[13]
Kannan, A., Talukdar, P. P., Rasiwasia, N., and Ke, Q. Improving product classification using images. In Data Mining (ICDM), 2011 IEEE 11th International Conference on (2011), IEEE, pp. 310--319.
[14]
Karpathy, A., Joulin, A., and Li, F. F. F. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems (2014), pp. 1889--1897.
[15]
Kiros, R., Salakhutdinov, R., and Zemel, R. Multimodal neural language models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (2014), pp. 595--603.
[16]
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (2012), pp. 1097--1105.
[17]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (2013), pp. 3111--3119.
[18]
Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning and transferring mid-level image representations using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (2014), IEEE, pp. 1717--1724.
[19]
Pan, S. J., and Yang, Q. A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on 22, 10 (2010), 1345--1359.
[20]
Pereira, J. C., and Vasconcelos, N. On the regularization of image semantics by modal expansion. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (2012), IEEE, pp. 3093--3099.
[21]
Radlinski, F., and Joachims, T. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proceedings of the National Conference on Artificial Intelligence (2006), vol. 21, Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, p. 1406.
[22]
Razavian, A., Azizpour, H., Sullivan, J., and Carlsson, S. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2014), pp. 806--813.
[23]
Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[24]
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929--1958.
[25]
Tommasi, T., Orabona, F., and Caputo, B. Learning categories from few examples with multi model knowledge transfer. Pattern Analysis and Machine Intelligence, IEEE Transactions on 36, 5 (2014), 928--941.
[26]
Wang, G., Hoiem, D., and Forsyth, D. Building text features for object image classification. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (2009), IEEE, pp. 1367--1374.
[27]
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (2009), ACM, pp. 1113--1120.
[28]
Weston, J., Bengio, S., and Usunier, N. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning 81, 1 (2010), 21--35.
[29]
Zeiler, M. D., and Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision--ECCV 2014. Springer, 2014, pp. 818--833.

Cited By

View all
  • (2024)Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and InsightsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680068(4858-4865)Online publication date: 21-Oct-2024
  • (2024)CMLsearch: Semantic visual search and simulation through segmented colour, material, and lighting in interior imageJournal of Computational Design and Engineering10.1093/jcde/qwae11412:1(179-299)Online publication date: 30-Dec-2024
  • (2024)Graph neural networks-based preference learning method for object rankingInternational Journal of Approximate Reasoning10.1016/j.ijar.2024.109131(109131)Online publication date: Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. computer vision
  2. deep learning
  3. learning to rank

Qualifiers

  • Research-article

Conference

KDD '16
Sponsor:

Acceptance Rates

KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)3
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and InsightsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680068(4858-4865)Online publication date: 21-Oct-2024
  • (2024)CMLsearch: Semantic visual search and simulation through segmented colour, material, and lighting in interior imageJournal of Computational Design and Engineering10.1093/jcde/qwae11412:1(179-299)Online publication date: 30-Dec-2024
  • (2024)Graph neural networks-based preference learning method for object rankingInternational Journal of Approximate Reasoning10.1016/j.ijar.2024.109131(109131)Online publication date: Jan-2024
  • (2023)Nearest Neighbor-Based Strategy to Optimize Multi-View Triplet Network for Classification of Small-Sample Medical Imaging DataIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.305963534:2(586-600)Online publication date: Feb-2023
  • (2022)From computer vision to short text understanding: Applying similar approaches into different disciplinesIntelligent and Converged Networks10.23919/ICN.2022.00103:2(161-172)Online publication date: Jun-2022
  • (2022)Explorative Application of Fusion Techniques for Multimodal Hate Speech DetectionSN Computer Science10.1007/s42979-021-01007-73:2Online publication date: 10-Jan-2022
  • (2021)Advertising Popularity Feature Collaborative Recommendation Algorithm Based on Attention-LSTM ModelSecurity and Communication Networks10.1155/2021/99402322021Online publication date: 18-Dec-2021
  • (2020)Spatial-Content Image Search in Complex Scenes2020 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV45572.2020.9093427(2492-2500)Online publication date: Mar-2020
  • (2019)VISEProceedings of the VLDB Endowment10.14778/3352063.335208012:12(1842-1845)Online publication date: 1-Aug-2019
  • (2019)Learning Compositional, Visual and Relational Representations for CTR Prediction in Sponsored SearchProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357833(2851-2859)Online publication date: 3-Nov-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media