VETE: improving visual embeddings through text descriptions for eCommerce search engines

Martínez, Guillermo; Saavedra, Jose M.; Murrugara-Llerena, Nils

doi:10.1007/s11042-023-14595-8

VETE: improving visual embeddings through text descriptions for eCommerce search engines

Published: 29 March 2023

Volume 82, pages 41343–41379, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Guillermo Martínez¹,
Jose M. Saavedra ORCID: orcid.org/0000-0002-9644-5164² &
Nils Murrugara-Llerena³

214 Accesses
Explore all metrics

Abstract

A search engine is a critical component in the success of eCommerce. Searching for a particular product can be frustrating when users want specific product features that cannot be easily represented by a simple text search or catalog filter. Due to the advances in artificial intelligence and deep learning, content-based visual search engines are included in eCommerce search bars. A visual search is instantaneous, just take a picture and search; and it is fully expressive of image details. However, visual search in eCommerce still undergoes a large semantic gap. Traditionally, visual search models are trained in a supervised manner with large collections of images that do not represent well the semantic of a target eCommerce catalog. Therefore, we propose VETE (Visual Embedding modulated by TExt) to boost visual embeddings in eCommerce leveraging textual information of products in the target catalog. with real eCommerce data. Our proposal improves the baseline visual space for global and fine-grained categories in real-world eCommerce data. We achieved an average improvement of 3.48% for catalog-like queries, and 3.70% for noisy ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Recommendation system based on deep learning methods: a systematic review and new directions

Article 03 August 2019

Data Availability

The datasets generated during and/or analysed during the current study are available in https://github.com/jmsaavedrar/vete.

Notes

References

Babenko A, Slesarev A, Chigorin A, Lempitsky V (2014) Neural codes for image retrieval. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision – ECCV 2014. Springer International Publishing, pp 584–599
Baevski A, Hsu W-N, Xu Q, Babu A, Gu J, Auli M (2022) data2vec: a general framework for self-supervised learning in speech, vision and language. CoRR arXiv:2202.03555
Bui T, Ribeiro L, Ponti M, Collomosse J (2018) Sketching out the details: sketch-based image retrieval using convolutional neural networks with multi-stage regression, vol 71
Cao Z, Sun Z, Long M, Wang J, Yu P S (2018) Deep priority hashing. In: 2018 ACM Multimedia Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, pp 1653–1661
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). Springer, Germany
Chen X, He K (2021) Exploring simple siamese representation learning. In: IEEE conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, New York, pp 15750–15758
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol 1. IEEE Computer Society, New York, pp 886–893
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Pennsylvania, pp 4171–4186
Dubey S R (2022) A decade survey of content based image retrieval using deep learning. IEEE Trans Circuits Syst Video Technol 32(5):2687–2704
Article Google Scholar
Ericsson L, Gouk H, Loy C C, Hospedales T M (2022) Self-supervised representation learning: Introduction, advances, and challenges. IEEE Signal Process Mag 39:42–62
Article Google Scholar
Gonzaga V M, Murrugarra-Llerena N, Marcacini R (2021) Multimodal intent classification with incomplete modalities using text embedding propagation. In: Proceedings of the Brazilian Symposium on Multimedia and the Web. Association for Computing Machinery, New York, pp 217–220
Grill J-B, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Avila Pires B, Guo Z, Gheshlaghi Azar M, Piot B, kavukcuoglu, Munos R, Valko M (2020) Bootstrap your own latent - a new approach to self-supervised learning. In: Larochelle H, Ranzato M, Hadsell R, Balcan M F, Lin H (eds) Advances in Neural Information Processing Systems. Curran Associates Inc., Red Hook, pp 21271–21284
Görlich D (2022) Societal xr–a vision paper. ParadigmPlus 3 (2):1–10
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society, New York, pp 770–778
Hussain Z, Zhang M, Zhang X, Ye K, Thomas C, Agha Z, Ong N, Kovashka A (2017) Automatic understanding of image and video advertisements. In: Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, New York
Krizhevsky A, Sutskever I, Hinton G E (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Kruk J, Lubin J, Sikka K, Lin X, Jurafsky D, Divakaran A (2019) Integrating text and image: determining multimodal document intent in Instagram posts. In: Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China
Li L H, Zhang P, Zhang H, Yang J, Li C, Zhong Y, Wang L, Yuan L, Zhang L, Hwang J-N, Chang K-W, Gao J (2022) Grounded language-image pre-training. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, New York, pp 10955–10965
Liu H, Wang R, Shan S, Chen X (2016) Deep supervised hashing for fast image retrieval. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, New York, pp 2064–2072
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach
Liu Z, Lin W, Shi Y, Zhao J (2021) A robustly optimized bert pre-training approach with post-training Chinese Computational Linguistics: 20th China National Conference, CCL 2021, Hohhot, China, August 13–15, 2021, Proceedings. Springer-Verlag, Berlin, Heidelberg, pp 471–484
Lowe D G (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110
Article Google Scholar
McInnes L, Healy J, Saul N, Großberger L (2018) UMAP: uniform manifold approximation and projection. J Open Source Softw 3(29):861
Article Google Scholar
Mery D, Svec E, Arias M, Riffo V, Saavedra J M, Banerjee S (2017) Modern computer vision techniques for x-ray testing in baggage inspection. IEEE Trans Syst Man Cybern Syst 47(4):682–692
Article Google Scholar
Murrugarra-Llerena N, Kovashka A (2018) Image retrieval with mixed initiative and multimodal feedback. In: British Machine Vision Conference, BMVC. British Machine Vision Association, Durham
Murrugarra-Llerena N, Kovashka A (2019) Cross-modality personalization for retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, New York
Murrugarra-Llerena N, Kovashka A (2021) Image retrieval with mixed initiative and multimodal feedback. Comput Vis Image Underst 207:103204. https://doi.org/10.1016/j.cviu.2021.103204
Article Google Scholar
Parkhi O M, Vedaldi A, Zisserman A (2015) Deep face recognition. In: Proceedings of the British Machine Vision Conference (BMVC). p 41.1–41.12
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, vol 139. PMLR, USA, pp 8748–8763
Sangkloy P, Burnell N, Ham C, Hays J (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans Graph 35(4):12
Article Google Scholar
Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer
Shen Y, Qin J, Chen J, Yu M, Liu L, Zhu F, Shen F, Shao L (2020) Auto-encoding twin-bottleneck hashing. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, New York, pp 2815–2824
Tan M, Pang R, Le Q V (2020) EfficientDet: scalable and efficient object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, New York, pp 10778–10787
Torres P, Saavedra J M (2021) Compact and effective representations for sketch-based image retrieval. In: IEEE conference on computer vision and pattern recognition workshops, CVPR workshops 2021, virtual, June 19–25, 2021, IEEE. IEEE Computer Society, New York, pp 2115–2123
Tsagkias M, King T, Kallumadi S, Murdock V, Rijke M (2020) Challenges and research opportunities in ecommerce search and recommendations. ACM SIGIR Forum 54:1–23
Article Google Scholar
Tyagi V (2017) Content-based image retrieval. ideas, influences and current trends. Springer
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg U V, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc., New York
Veit A, Nickel M, Belongie S, van der Maaten L (2018) Separating self-expression and visual content in hashtag supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Veličković P, Fedus W, Hamilton W L, Liò P, Bengio Y, Hjelm R D (2019) Deep Graph Infomax. In: International Conference on Learning Representations
Wang R, Wang R, Qiao S, Shan S, Chen X (2020) Deep position-aware hashing for semantic continuous image retrieval. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), p 2482–2491
Wang X, Shi Y, Kitani K M (2016) Deep supervised hashing with triplet labels. In: Lai S-H, Lepetit V, Nishino K, Sato Y (eds) Computer vision - ACCV 2016 - 13th Asian conference on computer vision, Taipei, Taiwan, November 20–24, 2016, revised selected papers, Part I, vol 10111. Springer, Germany, pp 70–84
Ye K, Kovashka A (2018) Advise: symbolism and external knowledge for decoding advertisements. In: European Conference on Computer Vision (ECCV). Springer, Germany
Zheng Q, Li S, Han Y, Dong J, Yan L, Qin J (2009) Fundamentals of e-commerce. In: Zheng Q (ed) Introduction to E-commerce. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 3–76

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Chile, Beauchef 850, Santiago, 8370448, RM, Chile
Guillermo Martínez
Facultad de Ingenieria y Ciencias Aplicadas, Universidad de los Andes, Mons. Alvaro del Portillo 124555, Santiago, 7620001, RM, Chile
Jose M. Saavedra
Weber State University, 3848 Harrison Blvd, Ogden, 84408, UT, USA
Nils Murrugara-Llerena

Authors

Guillermo Martínez
View author publications
You can also search for this author in PubMed Google Scholar
Jose M. Saavedra
View author publications
You can also search for this author in PubMed Google Scholar
Nils Murrugara-Llerena
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jose M. Saavedra.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jose M. Saavedra and Nils Murrugara-Llerena contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Martínez, G., Saavedra, J.M. & Murrugara-Llerena, N. VETE: improving visual embeddings through text descriptions for eCommerce search engines. Multimed Tools Appl 82, 41343–41379 (2023). https://doi.org/10.1007/s11042-023-14595-8

Download citation

Received: 03 November 2022
Revised: 10 January 2023
Accepted: 31 January 2023
Published: 29 March 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11042-023-14595-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VETE: improving visual embeddings through text descriptions for eCommerce search engines

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Recommendation system based on deep learning methods: a systematic review and new directions

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

VETE: improving visual embeddings through text descriptions for eCommerce search engines

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Recommendation system based on deep learning methods: a systematic review and new directions

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation