Scene designer: compositional sketch-based image retrieval with contrastive learning and an auxiliary synthesis task

Sampaio Ferraz Ribeiro, Leo; Bui, Tu; Collomosse, John; Ponti, Moacir

doi:10.1007/s11042-022-14282-0

Scene designer: compositional sketch-based image retrieval with contrastive learning and an auxiliary synthesis task

1227: Content-based Image Retrieval
Published: 20 December 2022

Volume 82, pages 38117–38139, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Leo Sampaio Ferraz Ribeiro ORCID: orcid.org/0000-0003-1781-2630¹,
Tu Bui²,
John Collomosse^2,3 &
…
Moacir Ponti^1,4

394 Accesses
1 Altmetric
Explore all metrics

Abstract

Scene Designer is a novel method for Compositional Sketch-based Image Retrieval (CSBIR) that combines semantic layout synthesis with its main task both to boost performance and enable new creative workflows. While most studies on sketch focus on single-object retrieval, we look to multi-object scenes instead for increased query specificity and flexibility. Our training protocol improves contrastive learning by synthesising harder negative samples and introduces a layout synthesis task that further improves the semantic scene representations. We show that our object-oriented graph neural network (GNN) more than doubles the current SoTA recall@1 on the SketchyCOCO CSBIR benchmark under our novel contrastive learning setting and combined search and synthesis tasks. Furthermore, we introduce the first large-scale sketched scene dataset and benchmark in QuickDrawCOCO.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

SceneSketcher: Fine-Grained Image Retrieval with Scene Sketches

SketchyScene: Richly-Annotated Scene Sketches

Data Availability

Instructions and code for recreating the QuickDrawCOCO dataset described in this paper can be found at the code repository for the model: https://github.com/leosampaio/scene-designer.

References

Abdul-Rashid H, Yuan J, Li B, Lu Y, Schreck T, Bui N-M, Do T-L, Holenderski M, Jarnikov D, Le T-K, Menkovski V, Nguyen K-T, Nguyen T-A, Nguyen V-T, Ninh V-T, Rey LAP, Tran M-T, Wang T (2019) Extended 2D scene image-based 3D scene retrieval. In: 3DOR@Eurographics, https://doi.org/10.1007/978-3-030-01225-0_19
Abe K, Morita H, Hayashi T (2018) Similarity retrieval of trademark images by vector graphics based on shape characteristics of components. ICCAE 2018, pp. 82–86 Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3192975.3192988
Ashual O, Wolf L (2019) Specifying object attributes and relations in interactive scene generation. In: Proceedings of the IEEE international conference on computer vision, pp 4560–4568, https://doi.org/10.1109/ICCV.2019.00466
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds) Advances in neural information processing systems 33: Annual conference on neural information processing systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual, DOI https://doi.org/10.5555/3495724.3495883
Bui T, Ribeiro L, Ponti M, Collomosse J (2017) Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network. CVIU. https://doi.org/10.1016/j.cviu.2017.06.007
Bui T, Ribeiro L, Ponti M, Collomosse J (2018) Sketching out the details: sketch-based image retrieval using convolutional neural networks with multi-stage regression. Comput Graph 71:77–87. https://doi.org/10.1016/j.cag.2017.12.006
Article Google Scholar
Caesar H, Uijlings J, Ferrari V (2018) Coco-stuff: thing and stuff classes in context. In: Proc. CVPR. IEEE, DOI https://doi.org/10.1109/CVPR.2018.00132
Chen T, Cheng M-M, Tan P, Shamir A, Hu S-M (2009) Sketch2photo: internet image montage. Proc ACM SIGGRAPH 28(5):124. https://doi.org/10.1145/1618452.1618470
Article Google Scholar
Chen W, Hays J (2018) Sketchygan: towards diverse and realistic sketch to image synthesis. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR.2018.00981
Dhariwal P, Nichol A (2021) Diffusion models beat GANs on image synthesis. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Vaughan JW (eds) Advances in neural information processing systems, vol 34, pp 8780–8794
Dutta A, Akata Z (2020) Semantically tied paired cycle consistency for any-shot sketch-based image retrieval. Int J Comput Vis 128(10):2684–2703. https://doi.org/10.1007/s11263-020-01350-x
Article MATH Google Scholar
Dutta T, Singh A, Biswas S (2021) Styleguide: zero-shot sketch-based image retrieval using style-guided image generation. IEEE Trans Multim 23:2833–2842. https://doi.org/10.1109/TMM.2020.3017918
Article Google Scholar
Eitz M, Hildebrand K, Boubekeur T, Alexa M (2011) Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE Trans Vis Comput Graph 17(11):1624–1636. https://doi.org/10.1109/TVCG.2010.266
Article Google Scholar
Gao C, Liu Q, Xu Q, Wang L, Liu J, Zou C (2020) Sketchycoco: image generation from freehand scene sketches. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR42600.2020.00522
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: NeurIPS, DOI https://doi.org/10.5555/2969033.2969125
Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proc. ACM multimedia, pp 765–773, DOI https://doi.org/10.1145/3343031.3350943
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc. CVPR, pp 770–778, DOI https://doi.org/10.1109/CVPR.2016.90
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS 30. https://doi.org/10.5555/3295222.3295408
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861
Hu R, Collomosse J (2013) A performance evaluation of gradient field hog descriptor for sketch based image retrieval. CVIU 117(7):790–806. https://doi.org/10.1016/j.cviu.2013.02.005
Article Google Scholar
Isola P, Zhu J, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proc. CVPR, pp 5967–5976, DOI https://doi.org/10.1109/CVPR.2017.632
Johnson J, Gupta A, Fei-Fei L (2018) Image generation from scene graphs. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1219–1228, DOI https://doi.org/10.1109/CVPR.2018.00133
Johnson J, Krishna R, Stark M, Li LJ, Shamma DA, Bernstein MS, Li FF (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 3668–3678, DOI https://doi.org/10.1109/CVPR.2015.7298990
Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: Proc. ICLR, https://doi.org/11245/1.505367
Liu F, Zou C, Deng X, Zuo R, Lai Y-K, Ma C, Liu Y-J, Wang H (2020) Scenesketcher: fine-grained image retrieval with scene sketches. In: ECCV. Springer, pp 718–734, DOI https://doi.org/10.1007/978-3-030-58529-7_42
Lu Y, Wu S, Tai Y-W, Tang C-K (2018) Image generation from sketch constraint using contextual GAN. In: Proc. ECCV, DOI https://doi.org/10.1007/978-3-030-01270-0_13
Mao X, Li Q, Xie H, Lau RYK, Wang Z (2016) Least squares generative adversarial networks. cite arXiv:1611.04076
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:1411.1784
Pandey A, Mishra A, Verma VK, Mittal A, Murthy HA (2020) Stacked adversarial network for zero-shot sketch based image retrieval. In: IEEE winter conference on applications of computer vision, WACV 2020, Snowmass Village, CO, USA, March 1-5, 2020. IEEE, pp 2529–2538, DOI https://doi.org/10.1109/WACV45572.2020.9093402
Pang K, Song Y-Z, Xiang T, Hospedales TM (2017) Cross-domain generative learning for fine-grained sketch-based image retrieval. In: BMVC, pp 1–12, DOI https://doi.org/10.5244/C.31.46
Park T, Liu M-Y, Wang T-C, Zhu J-Y (2019) Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, DOI https://doi.org/10.1109/CVPR.2019.00244
Peng C, Gao X, Wang N, Tao D, Li X, Li J (2015) Multiple representations-based face sketch–photo synthesis. IEEE Trans Neural Netw Learn Syst 27(11):2201–2215. https://doi.org/10.1109/TNNLS.2015.2464681
Article Google Scholar
Qi Y, Song Y-Z, Zhang H, Liu J (2016) Sketch-based image retrieval via siamese convolutional neural network. In: Proc. ICIP. IEEE, pp 2460–2464, DOI https://doi.org/10.1109/ICIP.2016.7532801
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol 139, pp 8748–8763
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol 139, pp 8821–8831
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR.2019.00075
Ribeiro LSF, Bui T, Collomosse J, Ponti M (2020) Sketchformer: transformer-based representation for sketched structure. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR42600.2020.01416
Ribeiro LSF, Bui T, Collomosse J, Ponti M (2021) Scene designer: a unified model for scene search and synthesis from sketch. In: Proc. ICCV WS, pp 2424–2433, DOI https://doi.org/10.1109/ICCVW54120.2021.00275
Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2021) High-resolution image synthesis with latent diffusion models. arXiv:2112.10752
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. IJCV 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Sangkloy P, Burnell N, Ham C, Hays J (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans Graph (TOG) 35 (4):119. https://doi.org/10.1145/2897824.2925954
Article Google Scholar
Schönfeld E, Sushko V, Zhang D, Gall J, Schiele B, Khoreva A (2021) You only need adversarial supervision for semantic image synthesis. In: 9th international conference on learning representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. https://doi.org/10.1007/s11263-022-01673-x
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. arXiv:1803.02155, https://doi.org/10.18653/v1/N18-2074
Shen Y, Liu L, Shen F, Shao L (2018) Zero-shot sketch-image hashing. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, pp 3598–3607, DOI https://doi.org/10.1109/CVPR.2018.00379
Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: Bach F, Blei D (eds) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol 37. PMLR, Lille, France, pp 2256–2265, DOI https://doi.org/10.5555/3045118.3045358
Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) End-to-end memory networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28: Annual conference on neural information processing systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp 2440–2448, DOI https://doi.org/10.5555/2969442.2969512
Sukhbaatar S, Weston J, Fergus R et al (2015) End-to-end memory networks. NeurIPS 28. https://doi.org/10.5555/2969442.2969512
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, pp 4278–4284, https://doi.org/10.5555/3298023.3298188
The Quick Draw! (2018) Dataset. https://github.com/googlecreativelab/quickdraw-dataset. Accessed 11 Oct 2018
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. NeurIPS 30. https://doi.org/10.5555/3295222.3295349
Wang F, Kang L, Li Y (2015) Sketch-based 3D shape retrieval using convolutional neural networks. In: Proc. CVPR. IEEE, pp 1875–1883, https://doi.org/10.1109/CVPR.2015.7298797
Wang T-C, Liu M-Y, Zhu J-Y, Tao A, Kautz J, Catanzaro B (2018) High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proc. CVPR, https://doi.org/10.1109/CVPR.2018.00917
Wang N, Tao D, Gao X, Li X, Li J (2013) Transductive face sketch-photo synthesis. IEEE Trans Neural Netw Learn Syst 24 (9):1364–1376. https://doi.org/10.1109/TNNLS.2013.2258174
Article Google Scholar
Xu P, Joshi CK, Bresson X (2021) Multigraph transformer for free-hand sketch recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3069230
Yelamarthi SK, Reddy SK, Mishra A, Mittal A (2018) A zero-shot framework for sketch based image retrieval. In: Proceedings of the european conference on computer vision (ECCV)
Yu Z, Kovashka A (2020) Syntharch: interactive image search with attribute-conditioned synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 170–171, https://doi.org/10.1109/CVPRW50498.2020.00093
Yu Q, Liu F, Song Y-Z, Xiang T, Hospedales TM, Loy C-C (2016) Sketch me that shoe. In: Proc. CVPR, pp 799–807, https://doi.org/10.1109/CVPR.2016.93

Download references

Author information

Authors and Affiliations

ICMC, Universidade de São Paulo, São Carlos, SP, Brazil
Leo Sampaio Ferraz Ribeiro & Moacir Ponti
CVSSP, University of Surrey, Guildford, Surrey, UK
Tu Bui & John Collomosse
Creative Intelligence Lab, Adobe Research, San Jose, CA, USA
John Collomosse
Mercado Livre, Osasco, SP, Brazil
Moacir Ponti

Authors

Leo Sampaio Ferraz Ribeiro
View author publications
You can also search for this author inPubMed Google Scholar
Tu Bui
View author publications
You can also search for this author inPubMed Google Scholar
John Collomosse
View author publications
You can also search for this author inPubMed Google Scholar
Moacir Ponti
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Leo Sampaio Ferraz Ribeiro.

Ethics declarations

This work was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo (grants 2017/22366-8, 2019/02808-1, 2019/07316-0), Conselho Nacional de Desenvolvimento Científico e Tecnológico (304266/2020-5), and a charitable donation from Adobe Inc.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sampaio Ferraz Ribeiro, L., Bui, T., Collomosse, J. et al. Scene designer: compositional sketch-based image retrieval with contrastive learning and an auxiliary synthesis task. Multimed Tools Appl 82, 38117–38139 (2023). https://doi.org/10.1007/s11042-022-14282-0

Download citation

Received: 11 May 2022
Revised: 30 September 2022
Accepted: 03 December 2022
Published: 20 December 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11042-022-14282-0

Keywords

Part of a collection:

1227: Content-Based Image Retrieval

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scene designer: compositional sketch-based image retrieval with contrastive learning and an auxiliary synthesis task

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

SceneSketcher: Fine-Grained Image Retrieval with Scene Sketches

SketchyScene: Richly-Annotated Scene Sketches

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now