Abstract
Scene Designer is a novel method for Compositional Sketch-based Image Retrieval (CSBIR) that combines semantic layout synthesis with its main task both to boost performance and enable new creative workflows. While most studies on sketch focus on single-object retrieval, we look to multi-object scenes instead for increased query specificity and flexibility. Our training protocol improves contrastive learning by synthesising harder negative samples and introduces a layout synthesis task that further improves the semantic scene representations. We show that our object-oriented graph neural network (GNN) more than doubles the current SoTA recall@1 on the SketchyCOCO CSBIR benchmark under our novel contrastive learning setting and combined search and synthesis tasks. Furthermore, we introduce the first large-scale sketched scene dataset and benchmark in QuickDrawCOCO.







Similar content being viewed by others
Data Availability
Instructions and code for recreating the QuickDrawCOCO dataset described in this paper can be found at the code repository for the model: https://github.com/leosampaio/scene-designer.
References
Abdul-Rashid H, Yuan J, Li B, Lu Y, Schreck T, Bui N-M, Do T-L, Holenderski M, Jarnikov D, Le T-K, Menkovski V, Nguyen K-T, Nguyen T-A, Nguyen V-T, Ninh V-T, Rey LAP, Tran M-T, Wang T (2019) Extended 2D scene image-based 3D scene retrieval. In: 3DOR@Eurographics, https://doi.org/10.1007/978-3-030-01225-0_19
Abe K, Morita H, Hayashi T (2018) Similarity retrieval of trademark images by vector graphics based on shape characteristics of components. ICCAE 2018, pp. 82–86 Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3192975.3192988
Ashual O, Wolf L (2019) Specifying object attributes and relations in interactive scene generation. In: Proceedings of the IEEE international conference on computer vision, pp 4560–4568, https://doi.org/10.1109/ICCV.2019.00466
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds) Advances in neural information processing systems 33: Annual conference on neural information processing systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual, DOI https://doi.org/10.5555/3495724.3495883
Bui T, Ribeiro L, Ponti M, Collomosse J (2017) Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network. CVIU. https://doi.org/10.1016/j.cviu.2017.06.007
Bui T, Ribeiro L, Ponti M, Collomosse J (2018) Sketching out the details: sketch-based image retrieval using convolutional neural networks with multi-stage regression. Comput Graph 71:77–87. https://doi.org/10.1016/j.cag.2017.12.006
Caesar H, Uijlings J, Ferrari V (2018) Coco-stuff: thing and stuff classes in context. In: Proc. CVPR. IEEE, DOI https://doi.org/10.1109/CVPR.2018.00132
Chen T, Cheng M-M, Tan P, Shamir A, Hu S-M (2009) Sketch2photo: internet image montage. Proc ACM SIGGRAPH 28(5):124. https://doi.org/10.1145/1618452.1618470
Chen W, Hays J (2018) Sketchygan: towards diverse and realistic sketch to image synthesis. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR.2018.00981
Dhariwal P, Nichol A (2021) Diffusion models beat GANs on image synthesis. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Vaughan JW (eds) Advances in neural information processing systems, vol 34, pp 8780–8794
Dutta A, Akata Z (2020) Semantically tied paired cycle consistency for any-shot sketch-based image retrieval. Int J Comput Vis 128(10):2684–2703. https://doi.org/10.1007/s11263-020-01350-x
Dutta T, Singh A, Biswas S (2021) Styleguide: zero-shot sketch-based image retrieval using style-guided image generation. IEEE Trans Multim 23:2833–2842. https://doi.org/10.1109/TMM.2020.3017918
Eitz M, Hildebrand K, Boubekeur T, Alexa M (2011) Sketch-based image retrieval: benchmark and bag-of-features descriptors. IEEE Trans Vis Comput Graph 17(11):1624–1636. https://doi.org/10.1109/TVCG.2010.266
Gao C, Liu Q, Xu Q, Wang L, Liu J, Zou C (2020) Sketchycoco: image generation from freehand scene sketches. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR42600.2020.00522
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: NeurIPS, DOI https://doi.org/10.5555/2969033.2969125
Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proc. ACM multimedia, pp 765–773, DOI https://doi.org/10.1145/3343031.3350943
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc. CVPR, pp 770–778, DOI https://doi.org/10.1109/CVPR.2016.90
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS 30. https://doi.org/10.5555/3295222.3295408
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861
Hu R, Collomosse J (2013) A performance evaluation of gradient field hog descriptor for sketch based image retrieval. CVIU 117(7):790–806. https://doi.org/10.1016/j.cviu.2013.02.005
Isola P, Zhu J, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proc. CVPR, pp 5967–5976, DOI https://doi.org/10.1109/CVPR.2017.632
Johnson J, Gupta A, Fei-Fei L (2018) Image generation from scene graphs. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1219–1228, DOI https://doi.org/10.1109/CVPR.2018.00133
Johnson J, Krishna R, Stark M, Li LJ, Shamma DA, Bernstein MS, Li FF (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 3668–3678, DOI https://doi.org/10.1109/CVPR.2015.7298990
Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: Proc. ICLR, https://doi.org/11245/1.505367
Liu F, Zou C, Deng X, Zuo R, Lai Y-K, Ma C, Liu Y-J, Wang H (2020) Scenesketcher: fine-grained image retrieval with scene sketches. In: ECCV. Springer, pp 718–734, DOI https://doi.org/10.1007/978-3-030-58529-7_42
Lu Y, Wu S, Tai Y-W, Tang C-K (2018) Image generation from sketch constraint using contextual GAN. In: Proc. ECCV, DOI https://doi.org/10.1007/978-3-030-01270-0_13
Mao X, Li Q, Xie H, Lau RYK, Wang Z (2016) Least squares generative adversarial networks. cite arXiv:1611.04076
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:1411.1784
Pandey A, Mishra A, Verma VK, Mittal A, Murthy HA (2020) Stacked adversarial network for zero-shot sketch based image retrieval. In: IEEE winter conference on applications of computer vision, WACV 2020, Snowmass Village, CO, USA, March 1-5, 2020. IEEE, pp 2529–2538, DOI https://doi.org/10.1109/WACV45572.2020.9093402
Pang K, Song Y-Z, Xiang T, Hospedales TM (2017) Cross-domain generative learning for fine-grained sketch-based image retrieval. In: BMVC, pp 1–12, DOI https://doi.org/10.5244/C.31.46
Park T, Liu M-Y, Wang T-C, Zhu J-Y (2019) Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, DOI https://doi.org/10.1109/CVPR.2019.00244
Peng C, Gao X, Wang N, Tao D, Li X, Li J (2015) Multiple representations-based face sketch–photo synthesis. IEEE Trans Neural Netw Learn Syst 27(11):2201–2215. https://doi.org/10.1109/TNNLS.2015.2464681
Qi Y, Song Y-Z, Zhang H, Liu J (2016) Sketch-based image retrieval via siamese convolutional neural network. In: Proc. ICIP. IEEE, pp 2460–2464, DOI https://doi.org/10.1109/ICIP.2016.7532801
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol 139, pp 8748–8763
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol 139, pp 8821–8831
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR.2019.00075
Ribeiro LSF, Bui T, Collomosse J, Ponti M (2020) Sketchformer: transformer-based representation for sketched structure. In: Proc. CVPR, DOI https://doi.org/10.1109/CVPR42600.2020.01416
Ribeiro LSF, Bui T, Collomosse J, Ponti M (2021) Scene designer: a unified model for scene search and synthesis from sketch. In: Proc. ICCV WS, pp 2424–2433, DOI https://doi.org/10.1109/ICCVW54120.2021.00275
Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2021) High-resolution image synthesis with latent diffusion models. arXiv:2112.10752
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. IJCV 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Sangkloy P, Burnell N, Ham C, Hays J (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans Graph (TOG) 35 (4):119. https://doi.org/10.1145/2897824.2925954
Schönfeld E, Sushko V, Zhang D, Gall J, Schiele B, Khoreva A (2021) You only need adversarial supervision for semantic image synthesis. In: 9th international conference on learning representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. https://doi.org/10.1007/s11263-022-01673-x
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. arXiv:1803.02155, https://doi.org/10.18653/v1/N18-2074
Shen Y, Liu L, Shen F, Shao L (2018) Zero-shot sketch-image hashing. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, pp 3598–3607, DOI https://doi.org/10.1109/CVPR.2018.00379
Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: Bach F, Blei D (eds) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol 37. PMLR, Lille, France, pp 2256–2265, DOI https://doi.org/10.5555/3045118.3045358
Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) End-to-end memory networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28: Annual conference on neural information processing systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp 2440–2448, DOI https://doi.org/10.5555/2969442.2969512
Sukhbaatar S, Weston J, Fergus R et al (2015) End-to-end memory networks. NeurIPS 28. https://doi.org/10.5555/2969442.2969512
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, pp 4278–4284, https://doi.org/10.5555/3298023.3298188
The Quick Draw! (2018) Dataset. https://github.com/googlecreativelab/quickdraw-dataset. Accessed 11 Oct 2018
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. NeurIPS 30. https://doi.org/10.5555/3295222.3295349
Wang F, Kang L, Li Y (2015) Sketch-based 3D shape retrieval using convolutional neural networks. In: Proc. CVPR. IEEE, pp 1875–1883, https://doi.org/10.1109/CVPR.2015.7298797
Wang T-C, Liu M-Y, Zhu J-Y, Tao A, Kautz J, Catanzaro B (2018) High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proc. CVPR, https://doi.org/10.1109/CVPR.2018.00917
Wang N, Tao D, Gao X, Li X, Li J (2013) Transductive face sketch-photo synthesis. IEEE Trans Neural Netw Learn Syst 24 (9):1364–1376. https://doi.org/10.1109/TNNLS.2013.2258174
Xu P, Joshi CK, Bresson X (2021) Multigraph transformer for free-hand sketch recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3069230
Yelamarthi SK, Reddy SK, Mishra A, Mittal A (2018) A zero-shot framework for sketch based image retrieval. In: Proceedings of the european conference on computer vision (ECCV)
Yu Z, Kovashka A (2020) Syntharch: interactive image search with attribute-conditioned synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 170–171, https://doi.org/10.1109/CVPRW50498.2020.00093
Yu Q, Liu F, Song Y-Z, Xiang T, Hospedales TM, Loy C-C (2016) Sketch me that shoe. In: Proc. CVPR, pp 799–807, https://doi.org/10.1109/CVPR.2016.93
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
This work was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo (grants 2017/22366-8, 2019/02808-1, 2019/07316-0), Conselho Nacional de Desenvolvimento Científico e Tecnológico (304266/2020-5), and a charitable donation from Adobe Inc.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sampaio Ferraz Ribeiro, L., Bui, T., Collomosse, J. et al. Scene designer: compositional sketch-based image retrieval with contrastive learning and an auxiliary synthesis task. Multimed Tools Appl 82, 38117–38139 (2023). https://doi.org/10.1007/s11042-022-14282-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14282-0