skip to main content
10.1145/3595916.3626394acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Adapting Hierarchical Transformer for Scene-Level Sketch-Based Image Retrieval

Published: 01 January 2024 Publication History

Abstract

Sketch-based image retrieval (SBIR) is an essential application of sketches. Research on object-level SBIR is relatively mature, but the study of more complex scene-level SBIR is still in its early stages. In order to advance this research, we investigate previous works and identify two main shortcomings: (1) insufficient utilization of multi-scale features from sketches and images, and (2) lack of effective modules to eliminate the substantial domain gap between them. To address these issues, we propose SketchRetriever, a hierarchical Transformer-based scene-level SBIR model. In our model, the hierarchical Transformer and compressors are capable of efficiently capturing feature maps at various granularities and compressing them into corresponding feature vectors, and the modality-specific Adapters can project the feature embeddings of sketches and images into the same feature space, thereby closing the domain gap between them. We adopt the adapter-tuning strategy, which not only considerably reduces the number of tunable parameters but also effectively avoids overfitting. Extensive experiments demonstrate that SketchRetriever significantly outperforms state-of-the-art methods on two benchmark datasets with lower fine-tuning overhead.

References

[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[2]
Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. 2008. Speeded-up robust features (SURF). Computer vision and image understanding 110, 3 (2008), 346–359.
[3]
Ayan Kumar Bhunia, Subhadeep Koley, Abdullah Faiz Ur Rahman Khilji, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, and Yi-Zhe Song. 2022. Sketching without worrying: Noise-tolerant sketch-based image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 999–1008.
[4]
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. 2018. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1209–1218.
[5]
Yang Cao, Changhu Wang, Liqing Zhang, and Lei Zhang. 2011. Edgel index for large-scale sketch-based image search. In CVPR 2011. IEEE, 761–768.
[6]
Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Viswanatha Reddy Gajjala, Aneeshan Sain, Tao Xiang, and Yi-Zhe Song. 2022. Partially does it: Towards scene-level fg-sbir with partial input. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2395–2405.
[7]
MMEngine Contributors. 2022. MMEngine: OpenMMLab Foundational Library for Training Deep Learning Models. https://github.com/open-mmlab/mmengine. (2022).
[8]
Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), Vol. 1. Ieee, 886–893.
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[10]
Mathias Eitz, James Hays, and Marc Alexa. 2012. How do humans sketch objects?ACM Transactions on graphics (TOG) 31, 4 (2012), 1–10.
[11]
Mathias Eitz, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. 2010. An evaluation of descriptors for large-scale image retrieval from sketched feature lines. Computers & Graphics 34, 5 (2010), 482–498.
[12]
Mathias Eitz, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. 2010. Sketch-based image retrieval: Benchmark and bag-of-features descriptors. IEEE transactions on visualization and computer graphics 17, 11 (2010), 1624–1636.
[13]
Chengying Gao, Qi Liu, Qi Xu, Limin Wang, Jianzhuang Liu, and Changqing Zou. 2020. Sketchycoco: Image generation from freehand scene sketches. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5174–5183.
[14]
David Ha and Douglas Eck. 2017. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477 (2017).
[15]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
[16]
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning. PMLR, 2790–2799.
[17]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
[18]
Rui Hu and John Collomosse. 2013. A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Computer Vision and Image Understanding 117, 7 (2013), 790–806.
[19]
Rui Hu, Tinghuai Wang, and John Collomosse. 2011. A bag-of-regions approach to sketch-based image retrieval. In 2011 18th IEEE International Conference on Image Processing. IEEE, 3661–3664.
[20]
Shijie Hu, Hongxin Zhang, Sanyuan Zhang, Zishuo Fang, and Qi Huang. 2016. Sketch-Based Retrieval in Large-Scale Image Database via Position-Aware Silhouette Matching. In E-Learning and Games: 10th International Conference, Edutainment 2016, Hangzhou, China, April 14-16, 2016, Revised Selected Papers 10. Springer, 243–256.
[21]
Md Amirul Islam, Sen Jia, and Neil DB Bruce. 2020. How much position information do convolutional neural networks encode?arXiv preprint arXiv:2001.08248 (2020).
[22]
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021).
[23]
Yi Li, Timothy M Hospedales, Yi-Zhe Song, and Shaogang Gong. 2014. Fine-grained sketch-based image retrieval by matching deformable part models. (2014).
[24]
Fang Liu, Xiaoming Deng, Changqing Zou, Yu-Kun Lai, Keqi Chen, Ran Zuo, Cuixia Ma, Yong-Jin Liu, and Hongan Wang. 2022. SceneSketcher-v2: Fine-grained scene-level sketch-based image retrieval using adaptive GCNs. IEEE Transactions on Image Processing 31 (2022), 3737–3751.
[25]
Fang Liu, Changqing Zou, Xiaoming Deng, Ran Zuo, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, and Hongan Wang. 2020. Scenesketcher: Fine-grained image retrieval with scene sketches. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16. Springer, 718–734.
[26]
David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2004), 91–110.
[27]
Bryan James Prosser, Wei-Shi Zheng, Shaogang Gong, Tao Xiang, Q Mary, 2010. Person re-identification by support vector ranking. In Bmvc, Vol. 2. 6.
[28]
Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. 2016. Sketch-based image retrieval via siamese convolutional neural network. In 2016 IEEE international conference on image processing (ICIP). IEEE, 2460–2464.
[29]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211–252.
[30]
Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. 2016. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG) 35, 4 (2016), 1–12.
[31]
Jifei Song, Qian Yu, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. 2017. Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In Proceedings of the IEEE international conference on computer vision. 5551–5560.
[32]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.
[33]
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. 2021. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34 (2021), 12077–12090.
[34]
Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M Hospedales, and Chen-Change Loy. 2016. Sketch me that shoe. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 799–807.
[35]
Qian Yu, Yongxin Yang, Feng Liu, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. 2017. Sketch-a-net: A deep neural network that beats humans. International journal of computer vision 122 (2017), 411–425.
[36]
Qian Yu, Yongxin Yang, Yi-Zhe Song, Tao Xiang, and Timothy Hospedales. 2015. Sketch-a-net that beats humans. arXiv preprint arXiv:1501.07873 (2015).
[37]
Zhaolong Zhang, Yuejie Zhang, Rui Feng, Tao Zhang, and Weiguo Fan. 2020. Zero-shot sketch-based image retrieval via graph convolution network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12943–12950.
[38]
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2019. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision 127 (2019), 302–321.
[39]
Changqing Zou, Qian Yu, Ruofei Du, Haoran Mo, Yi-Zhe Song, Tao Xiang, Chengying Gao, Baoquan Chen, and Hao Zhang. 2018. Sketchyscene: Richly-annotated scene sketches. In Proceedings of the european conference on computer vision (ECCV). 421–436.

Index Terms

  1. Adapting Hierarchical Transformer for Scene-Level Sketch-Based Image Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
    December 2023
    745 pages
    ISBN:9798400702051
    DOI:10.1145/3595916
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. hierarchical Transformer
    2. parameter-efficient fine-tuning
    3. scene sketch
    4. sketch-based image retrieval

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    MMAsia '23
    Sponsor:
    MMAsia '23: ACM Multimedia Asia
    December 6 - 8, 2023
    Tainan, Taiwan

    Acceptance Rates

    Overall Acceptance Rate 59 of 204 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 66
      Total Downloads
    • Downloads (Last 12 months)44
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media