skip to main content
10.1145/3394171.3413558acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Visually Precise Query

Published:12 October 2020Publication History

ABSTRACT

We present the problem of Visually Precise Query (VPQ) generation which enables a more intuitive match between a user's information need and an e-commerce site's product description. Given an image of a fashion item, what is the most optimum search query that will retrieve the exact same or closely related product(s) with high probability. In this paper we introduce the task of VPQ generation which takes a product image and its title as its input and provides aword level extractive summary of the title, containing a list of salient attributes, which can now be used as a query to search for similar products. We collect a large dataset of fashion images and their titles and merge it with an existing research dataset which was created for a different task. Given the image and title pair, VPQ problem is posed as identifying a non-contiguous collection of spans within the title. We provide a dataset of around 400K image, title and corresponding VPQ entries and release it to the research community. We provide a detailed description of the data collection process as well as discuss the future direction of research for the problem introduced in this work. We provide the standard text as well as visual domain baseline comparisons and also provide multi-modal baseline models to analyze the task introduced in this work. Finally, we propose a hybrid fusion model which promises to be the direction of research in the multi-modal community.

Skip Supplemental Material Section

Supplemental Material

3394171.3413558.mp4

mp4

13.3 MB

References

  1. John Arevalo, Thamar Solorio, Manuel Montes, and Fabio González. 2017. Gated Multimodal Units for Information Fusion. ICLR Workshop (2017).Google ScholarGoogle Scholar
  2. Hasan Sait Arslan, Kairit Sirts, Mark Fishel, and Gholamreza Anbarjafari. 2019. Multimodal Sequential Fashion Attribute Prediction. Information 2019 (Oct 2019).Google ScholarGoogle Scholar
  3. Tamara L. Berg, Alexander C. Berg, and Jonathan Shih. 2010. Automatic Attribute Discovery and Characterization from Noisy Web Data (ECCV'10). Springer-Verlag, 663--676.Google ScholarGoogle Scholar
  4. Leonid Boytsov and Bilegsaikhan Naidan. 2013. Engineering Efficient and Effective Non-metric Space Library. In Similarity Search and Applications - 6th International Conference, SISAP 2013.Google ScholarGoogle Scholar
  5. Jason P.C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. ACL 4 (2016), 357--370.Google ScholarGoogle Scholar
  6. Ronan Collobert and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning (ICML '08). ACM, 160--167.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20- 25 June 2009, Miami, Florida, USA. IEEE Computer Society, 248--255. https: //doi.org/10.1109/CVPR.2009.5206848Google ScholarGoogle ScholarCross RefCross Ref
  8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/v1/n19--1423Google ScholarGoogle Scholar
  9. Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2015. A Neural Algorithm of Artistic Style. CoRR abs/1508.06576 (2015). arXiv:1508.06576 http://arxiv.org/abs/1508.06576Google ScholarGoogle Scholar
  10. Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven J. Rennie, and Rogério Schmidt Feris. 2019. The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback. ArXiv abs/1905.12794 (2019).Google ScholarGoogle Scholar
  11. Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic Spatially-Aware Fashion Concept Discovery. In The IEEE International Conference on Computer Vision (ICCV).Google ScholarGoogle Scholar
  12. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition.. In CVPR. IEEE Computer Society, 770--778.Google ScholarGoogle Scholar
  13. V. Iglovikov and A. Shvets. 2018. TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation. ArXiv e-prints (2018). arXiv:1801.05746Google ScholarGoogle Scholar
  14. Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, and Davide Testuggine. 2019. Supervised Multimodal Bitransformers for Classifying Images and Text. CoRR abs/1909.02950 (2019). arXiv:1909.02950 http://arxiv.org/abs/1909.02950Google ScholarGoogle Scholar
  15. Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-Thought Vectors.. In NIPS. 3294-- 3302.Google ScholarGoogle Scholar
  16. Sudhir Kumar and Mithun Das Gupta. 2019. c+GAN: Complementary Fashion Item Recommendation. CoRR abs/1906.05596 (2019). arXiv:1906.05596 http: //arxiv.org/abs/1906.05596Google ScholarGoogle Scholar
  17. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 260--270. https://doi.org/10. 18653/v1/N16--1030Google ScholarGoogle ScholarCross RefCross Ref
  18. Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  19. Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual Attention Model for Name Tagging in Multimodal Social Media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1990--1999. https://doi.org/10.18653/v1/P18--1185Google ScholarGoogle ScholarCross RefCross Ref
  20. Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bidirectional LSTM-CNNs-CRF. arXiv:cs.LG/1603.01354Google ScholarGoogle Scholar
  21. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.). 3111--3119.Google ScholarGoogle Scholar
  22. Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 188--197.Google ScholarGoogle ScholarCross RefCross Ref
  23. Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks.. In CVPR. 1717--1724.Google ScholarGoogle Scholar
  24. Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN Features off-the-shelf: an Astounding Baseline for Recognition. In CVPR.Google ScholarGoogle Scholar
  25. N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, and C. Pal. 2018. Fashion-Gen: The Generative Fashion Dataset and Challenge. ArXiv e-prints (June 2018). arXiv:stat.ML/1806.08317Google ScholarGoogle Scholar
  26. Gil Sadeh, Lior Fritz, Gabi Shalev, and Eduard Oks. 2019. Joint Visual-Textual Embedding for Multimodal Style Search. arXiv:cs.LG/1906.06620Google ScholarGoogle Scholar
  27. Xuemeng Song, Fuli Feng, Jinhuan Liu, Zekun Li, Liqiang Nie, and Jun Ma. 2017. NeuroStylist: Neural Compatibility Modeling for Clothing Matching. In Proceedings of the 25th ACM International Conference on Multimedia (Mountain View, California, USA) (MM '17). ACM, New York, NY, USA, 753--761. https: //doi.org/10.1145/3123266.3123314Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/ 1706.03762Google ScholarGoogle Scholar
  29. Kota Yamaguchi, M. Kiapour, and Tamara Berg. 2013. Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items. Proceedings of the IEEE International Conference on Computer Vision, 3519--3526.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg. 2012. Parsing Clothing in Fashion Photographs. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR '12). IEEE Computer Society, Washington, DC, USA, 3570--3577.Google ScholarGoogle Scholar
  31. Zhilin Yang, Ruslan Salakhutdinov, and William W. Cohen. 2017. Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks. arXiv:cs.CL/1703.06345Google ScholarGoogle Scholar

Index Terms

  1. Visually Precise Query

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '20: Proceedings of the 28th ACM International Conference on Multimedia
          October 2020
          4889 pages
          ISBN:9781450379885
          DOI:10.1145/3394171

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 October 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader