Skip to main content
Log in

Scene designer: compositional sketch-based image retrieval with contrastive learning and an auxiliary synthesis task

  • 1227: Content-based Image Retrieval
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Scene Designer is a novel method for Compositional Sketch-based Image Retrieval (CSBIR) that combines semantic layout synthesis with its main task both to boost performance and enable new creative workflows. While most studies on sketch focus on single-object retrieval, we look to multi-object scenes instead for increased query specificity and flexibility. Our training protocol improves contrastive learning by synthesising harder negative samples and introduces a layout synthesis task that further improves the semantic scene representations. We show that our object-oriented graph neural network (GNN) more than doubles the current SoTA recall@1 on the SketchyCOCO CSBIR benchmark under our novel contrastive learning setting and combined search and synthesis tasks. Furthermore, we introduce the first large-scale sketched scene dataset and benchmark in QuickDrawCOCO.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

Instructions and code for recreating the QuickDrawCOCO dataset described in this paper can be found at the code repository for the model: https://github.com/leosampaio/scene-designer.

References

  1. Abdul-Rashid H, Yuan J, Li B, Lu Y, Schreck T, Bui N-M, Do T-L, Holenderski M, Jarnikov D, Le T-K, Menkovski V, Nguyen K-T, Nguyen T-A, Nguyen V-T, Ninh V-T, Rey LAP, Tran M-T, Wang T (2019) Extended 2D scene image-based 3D scene retrieval. In: 3DOR@Eurographics, https://doi.org/10.1007/978-3-030-01225-0_19

  2. Abe K, Morita H, Hayashi T (2018) Similarity retrieval of trademark images by vector graphics based on shape characteristics of components. ICCAE 2018, pp. 82–86 Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3192975.3192988

  3. Ashual O, Wolf L (2019) Specifying object attributes and relations in interactive scene generation. In: Proceedings of the IEEE international conference on computer vision, pp 4560–4568, https://doi.org/10.1109/ICCV.2019.00466

  4. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds) Advances in neural information processing systems 33: Annual conference on neural information processing systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual, DOI https://doi.org/10.5555/3495724.3495883

  5. Bui T, Ribeiro L, Ponti M, Collomosse J (2017) Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network. CVIU. https://doi.org/10.1016/j.cviu.2017.06.007

  6. Bui T, Ribeiro L, Ponti M, Collomosse J (2018) Sketching out the details: sketch-based image retrieval using convolutional neural networks with multi-stage regression. Comput Graph 71:77–87. https://doi.org/10.1016/j.cag.2017.12.006

    Article  Google Scholar 

  7. Caesar H, Uijlings J, Ferrari V (2018) Coco-stuff: thing and stuff classes in context. In: Proc. CVPR. IEEE, DOI https://doi.org/10.1109/CVPR.2018.00132

  8. Chen T, Cheng M-M, Tan P, Shamir A, Hu S-M (2009) Sketch2photo: internet image montage. Proc ACM SIGGRAPH 28(5):124. https://doi.org/10.1145/1618452.1618470

    Article  Google Scholar 

  9. Chen W, Hays J (2018) Sketchygan: towards diverse and realistic sketch to image synthesis. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR.2018.00981

  10. Dhariwal P, Nichol A (2021) Diffusion models beat GANs on image synthesis. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Vaughan JW (eds) Advances in neural information processing systems, vol 34, pp 8780–8794

  11. Dutta A, Akata Z (2020) Semantically tied paired cycle consistency for any-shot sketch-based image retrieval. Int J Comput Vis 128(10):2684–2703. https://doi.org/10.1007/s11263-020-01350-x

    Article  MATH  Google Scholar 

  12. Dutta T, Singh A, Biswas S (2021) Styleguide: zero-shot sketch-based image retrieval using style-guided image generation. IEEE Trans Multim 23:2833–2842. https://doi.org/10.1109/TMM.2020.3017918

    Article  Google Scholar 

  13. Eitz M, Hildebrand K, Boubekeur T, Alexa M (2011) Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE Trans Vis Comput Graph 17(11):1624–1636. https://doi.org/10.1109/TVCG.2010.266

    Article  Google Scholar 

  14. Gao C, Liu Q, Xu Q, Wang L, Liu J, Zou C (2020) Sketchycoco: image generation from freehand scene sketches. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR42600.2020.00522

  15. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: NeurIPS, DOI https://doi.org/10.5555/2969033.2969125

  16. Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proc. ACM multimedia, pp 765–773, DOI https://doi.org/10.1145/3343031.3350943

  17. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc. CVPR, pp 770–778, DOI https://doi.org/10.1109/CVPR.2016.90

  18. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS 30. https://doi.org/10.5555/3295222.3295408

  19. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861

  20. Hu R, Collomosse J (2013) A performance evaluation of gradient field hog descriptor for sketch based image retrieval. CVIU 117(7):790–806. https://doi.org/10.1016/j.cviu.2013.02.005

    Article  Google Scholar 

  21. Isola P, Zhu J, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proc. CVPR, pp 5967–5976, DOI https://doi.org/10.1109/CVPR.2017.632

  22. Johnson J, Gupta A, Fei-Fei L (2018) Image generation from scene graphs. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1219–1228, DOI https://doi.org/10.1109/CVPR.2018.00133

  23. Johnson J, Krishna R, Stark M, Li LJ, Shamma DA, Bernstein MS, Li FF (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 3668–3678, DOI https://doi.org/10.1109/CVPR.2015.7298990

  24. Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: Proc. ICLR, https://doi.org/11245/1.505367

  25. Liu F, Zou C, Deng X, Zuo R, Lai Y-K, Ma C, Liu Y-J, Wang H (2020) Scenesketcher: fine-grained image retrieval with scene sketches. In: ECCV. Springer, pp 718–734, DOI https://doi.org/10.1007/978-3-030-58529-7_42

  26. Lu Y, Wu S, Tai Y-W, Tang C-K (2018) Image generation from sketch constraint using contextual GAN. In: Proc. ECCV, DOI https://doi.org/10.1007/978-3-030-01270-0_13

  27. Mao X, Li Q, Xie H, Lau RYK, Wang Z (2016) Least squares generative adversarial networks. cite arXiv:1611.04076

  28. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:1411.1784

  29. Pandey A, Mishra A, Verma VK, Mittal A, Murthy HA (2020) Stacked adversarial network for zero-shot sketch based image retrieval. In: IEEE winter conference on applications of computer vision, WACV 2020, Snowmass Village, CO, USA, March 1-5, 2020. IEEE, pp 2529–2538, DOI https://doi.org/10.1109/WACV45572.2020.9093402

  30. Pang K, Song Y-Z, Xiang T, Hospedales TM (2017) Cross-domain generative learning for fine-grained sketch-based image retrieval. In: BMVC, pp 1–12, DOI https://doi.org/10.5244/C.31.46

  31. Park T, Liu M-Y, Wang T-C, Zhu J-Y (2019) Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, DOI https://doi.org/10.1109/CVPR.2019.00244

  32. Peng C, Gao X, Wang N, Tao D, Li X, Li J (2015) Multiple representations-based face sketch–photo synthesis. IEEE Trans Neural Netw Learn Syst 27(11):2201–2215. https://doi.org/10.1109/TNNLS.2015.2464681

    Article  Google Scholar 

  33. Qi Y, Song Y-Z, Zhang H, Liu J (2016) Sketch-based image retrieval via siamese convolutional neural network. In: Proc. ICIP. IEEE, pp 2460–2464, DOI https://doi.org/10.1109/ICIP.2016.7532801

  34. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol 139, pp 8748–8763

  35. Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol 139, pp 8821–8831

  36. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR.2019.00075

  37. Ribeiro LSF, Bui T, Collomosse J, Ponti M (2020) Sketchformer: transformer-based representation for sketched structure. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR42600.2020.01416

  38. Ribeiro LSF, Bui T, Collomosse J, Ponti M (2021) Scene designer: a unified model for scene search and synthesis from sketch. In: Proc. ICCV WS, pp 2424–2433, DOI https://doi.org/10.1109/ICCVW54120.2021.00275

  39. Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2021) High-resolution image synthesis with latent diffusion models. arXiv:2112.10752

  40. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. IJCV 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  41. Sangkloy P, Burnell N, Ham C, Hays J (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans Graph (TOG) 35 (4):119. https://doi.org/10.1145/2897824.2925954

    Article  Google Scholar 

  42. Schönfeld E, Sushko V, Zhang D, Gall J, Schiele B, Khoreva A (2021) You only need adversarial supervision for semantic image synthesis. In: 9th international conference on learning representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. https://doi.org/10.1007/s11263-022-01673-x

  43. Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. arXiv:1803.02155, https://doi.org/10.18653/v1/N18-2074

  44. Shen Y, Liu L, Shen F, Shao L (2018) Zero-shot sketch-image hashing. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, pp 3598–3607, DOI https://doi.org/10.1109/CVPR.2018.00379

  45. Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: Bach F, Blei D (eds) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol 37. PMLR, Lille, France, pp 2256–2265, DOI https://doi.org/10.5555/3045118.3045358

  46. Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) End-to-end memory networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28: Annual conference on neural information processing systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp 2440–2448, DOI https://doi.org/10.5555/2969442.2969512

  47. Sukhbaatar S, Weston J, Fergus R et al (2015) End-to-end memory networks. NeurIPS 28. https://doi.org/10.5555/2969442.2969512

  48. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, pp 4278–4284, https://doi.org/10.5555/3298023.3298188

  49. The Quick Draw! (2018) Dataset. https://github.com/googlecreativelab/quickdraw-dataset. Accessed 11 Oct 2018

  50. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. NeurIPS 30. https://doi.org/10.5555/3295222.3295349

  51. Wang F, Kang L, Li Y (2015) Sketch-based 3D shape retrieval using convolutional neural networks. In: Proc. CVPR. IEEE, pp 1875–1883, https://doi.org/10.1109/CVPR.2015.7298797

  52. Wang T-C, Liu M-Y, Zhu J-Y, Tao A, Kautz J, Catanzaro B (2018) High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proc. CVPR, https://doi.org/10.1109/CVPR.2018.00917

  53. Wang N, Tao D, Gao X, Li X, Li J (2013) Transductive face sketch-photo synthesis. IEEE Trans Neural Netw Learn Syst 24 (9):1364–1376. https://doi.org/10.1109/TNNLS.2013.2258174

    Article  Google Scholar 

  54. Xu P, Joshi CK, Bresson X (2021) Multigraph transformer for free-hand sketch recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3069230

  55. Yelamarthi SK, Reddy SK, Mishra A, Mittal A (2018) A zero-shot framework for sketch based image retrieval. In: Proceedings of the european conference on computer vision (ECCV)

  56. Yu Z, Kovashka A (2020) Syntharch: interactive image search with attribute-conditioned synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 170–171, https://doi.org/10.1109/CVPRW50498.2020.00093

  57. Yu Q, Liu F, Song Y-Z, Xiang T, Hospedales TM, Loy C-C (2016) Sketch me that shoe. In: Proc. CVPR, pp 799–807, https://doi.org/10.1109/CVPR.2016.93

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leo Sampaio Ferraz Ribeiro.

Ethics declarations

This work was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo (grants 2017/22366-8, 2019/02808-1, 2019/07316-0), Conselho Nacional de Desenvolvimento Científico e Tecnológico (304266/2020-5), and a charitable donation from Adobe Inc.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sampaio Ferraz Ribeiro, L., Bui, T., Collomosse, J. et al. Scene designer: compositional sketch-based image retrieval with contrastive learning and an auxiliary synthesis task. Multimed Tools Appl 82, 38117–38139 (2023). https://doi.org/10.1007/s11042-022-14282-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-14282-0

Keywords