1 Introduction

Multimedia data in various modalities, such as text, images, audio, and video, are growing explosively, and fast and efficient retrieval of data in different modalities is essential [1, 2]. Image retrieval input modes take various forms, such as text keywords, images, and sketches. With the development of image retrieval technology, researchers have proposed a variety of methods to meet users’ needs for different types of queries. Early image retrieval methods, such as TBIR and SBIR, mostly focused on using a single modality input as a query. These methods have shown certain advantages in specific scenarios but also face significant challenges, especially when user query expressions are vague or diverse.

Early TBIR methods relied heavily on manual annotations [3], where users described the desired target image by entering keywords or phrases, and the system matched them through image tags or predefined categories, which were suitable for describing the qualitative attributes of the target [4]. However, traditional text retrieval methods are cumbersome and inefficient for describing multiple objects and complex positional relationships. With the development of deep learning and natural language processing technologies, end-to-end retrieval methods based on models such as convolutional neural networks (CNNs) [5] and recurrent neural networks (RNNs) [6] have gradually become mainstream. These methods achieve more efficient image retrieval by learning the correspondence between text and images and mapping both into a shared feature space [7, 8]. In recent years, the introduction of pre-trained models such as BERT [9] and CLIP [10] has further promoted the development of TBIR, enabling the model to understand richer text semantics and improve the ability of cross-modal alignment. However, TBIR still faces challenges in some cases, such as when the text description is not precise enough or too abstract, and it is difficult for the system to capture the user’s complete intention.

SBIR is a unique form of content-based image retrieval (CBIR) that allows users to search for images using hand-drawn sketches as queries [11]. Its intuitive and natural interaction paradigm has also attracted widespread attention. Early SBIR methods mostly relied on manual features, such as shape context and SIFT features [12, 13], to perform retrieval by measuring the similarity between sketches and images. However, these manual features make it difficult to capture the complex correspondence between sketches and images, resulting in less-than-ideal retrieval results. With the rise of deep learning, end-to-end SBIR methods based on CNNs have gradually become mainstream [14,15,16]. These methods effectively narrow the domain gap between sketches and natural images by building deep neural networks to learn the feature representation of sketches and images. Despite significant advancements in SBIR, the task remains inherently challenging due to the abstract, ambiguous, and often incomplete nature of sketches, which lack critical visual attributes such as color and texture [17]. These limitations create a substantial domain gap between sketches and real-world images, complicating the process of feature extraction and matching. Additionally, variations in abstraction levels, artistic styles, and user drawing skills further exacerbate these challenges, leading to inconsistent retrieval performance. Furthermore, relying solely on a single sketch query often fails to fully capture the user’s retrieval intent, limiting the system’s ability to deliver precise results.

With the depth of research, more and more researchers realize the limitation of single modal input, and multimodal retrieval gradually becomes the direction to solve this problem [18,19,20]. Multimodal retrieval methods aim to combine inputs from several different sources, such as sketches, text, and images, to capture user intent more comprehensively and thus improve retrieval accuracy. CLIP model has demonstrated exceptional capability in learning joint representations of images and text without requiring task-specific training data [10]. CLIP achieves this by training on a large and diverse set of image-text pairs from the internet, enabling it to learn rich visual and semantic features. These capabilities enable CLIP to excel in various downstream tasks [21]. Given CLIP’s ability to encode images and text into a unified embedding space, it is exceptionally well-suited for tasks requiring interpreting the interplay between visual and textual information. This makes it particularly effective for research like SBIR, which incorporates textual descriptions to enhance retrieval accuracy.

However, in the CLIP-based multimodal image retrieval task, how to effectively fuse sketch and text features to more accurately represent the combined features remains an urgent problem to be solved. Simple linear combination methods cannot fully utilize the complementarity of sketches and texts at the semantic level. To address this problem, we propose an image retrieval framework (CAMIR) that combines the CLIP model and a multi-head cross-attention mechanism. The goal of this study is to explore the effect of the CLIP model and this fusion method in image retrieval tasks based on sketches and texts and to verify its advantages in retrieval recall by comparing it with existing methods. Previous studies have also explored the joint use of sketches and text in image retrieval, but these were limited to pre-trained models on large-scale image datasets and focused more on coarse-grained image retrieval tasks [22]. However, sketches often reflect more fine-grained information; thus, exploring fine-grained retrieval better captures the value of sketches [15, 23,24,25]. Utilizing CLIP’s prior knowledge, we treat sketches and text as complementary while preserving the semantic information of each modality. For example, when you enter text to describe the target image (the moon under the deep blue sky), the widely used image retrieval platform will return many semantically consistent results. But if the user wants a more specific location and richer details of the moon, the sketch can be a better supplement, and the color attributes of the text description can also make up for the shortcomings of the sketch. The research pays attention to the association between sketches and texts. It utilizes a multi-head cross-attention mechanism for sketch-text fusion to obtain combined weighted features that contain rich semantics and are complementary.

Last but not least, in addition to addressing the challenges of fine-grained image retrieval, more research needs to be done on the search efficiency of SBIR. Performing rapid SBIR on large image datasets using limited computational and memory resources is crucial for practical applications. To address these issues, we introduce a novel indexing approach for image retrieval, combining sketches and text in this paper. We utilize Faiss, an open-source library by Meta AI Research, for indexing feature vectors, demonstrating impressive improvements in the efficiency and accuracy of image retrieval systems [26, 27].

The main contributions of this work are as follows:

  1. 1.

    We propose an image retrieval framework based on a fine-tuned CLIP model, employing multimodal feature fusion to combine information from hand-drawn sketches and textual descriptions for enhanced retrieval performance. The framework includes a fine-tuned CLIP feature module, a sketch-text feature fusion module, a contrastive learning module, and a Faiss indexing module.

  2. 2.

    We design a multi-head cross-attention module to realize the feature fusion of different modalities, which further enhances the combined feature representation capability.

  3. 3.

    Comprehensive experiments on the benchmark dataset Sketchy demonstrate the superiority of the proposed framework, and the results are rationally analyzed and interpreted. The experiments consistently show the superior performance of CAMIR compared to state-of-the-art methods.

The structure of this paper is summarized as follows: Sect. 2 reviews the research work in related areas. Section 3 describes the proposed CAMIR framework in detail. Section 4 discusses the experimental design, results, and their analysis, aiming to validate the effectiveness of the proposed framework. In Sect. 5, we summarize the conclusions of this study.

2 Related work

Several approaches for SBIR have been proposed, ranging from traditional hand-crafted feature-based methods to deep learning-based techniques [28]. Early SBIR methods concentrated on extracting low-level features from sketches and images, such as shape descriptors and texture features, often struggling with the abstract nature of sketches and high intra-class variability [13, 29,30,31]. In recent years, deep learning has revolutionized SBIR, enabling the extraction of more discriminative features from sketches and images [11]. Many methods utilize siamese or triplet network architectures to learn similarity metrics directly from data [32,33,34,35,36]. These deep learning-based methods have shown promising results in improving retrieval accuracy, particularly when trained on large-scale datasets with diverse sketch-image pairs [17]. Some research has also investigated the integration of CLIP with SBIR. However, much of this work has primarily concentrated on zero-shot retrieval, aiming to retrieve images from categories that were not seen during training [37, 38].

TBIR employs natural language queries to search for relevant images and has become predominant in coarse-grained, inter-class retrieval tasks [39]. Over the years, research in this area has focused on learning a joint embedding space through ranking loss [40,41,42], enabling effective image retrieval based on textual queries. Recent optimization efforts for TBIR have primarily aimed at enhancing pairwise loss functions. With the advent of models like CLIP [10] and ALIGN [43], the latest advancements in this field have shifted towards integrating more complex models and methodologies to further improve retrieval accuracy [44]. explored the effectiveness of CLIP features in TBIR, addressing the limitations of traditional image query methods. Text-driven image retrieval provides a more accurate representation of user intent. CEITM [45] fine-tuned the CLIP model on the Flickr8k dataset, a well-known dataset for image-text matching tasks. This fine-tuning is crucial for enhancing the model’s performance in retrieving images based on text queries. The introduced cosine-enhanced image-text matching framework evaluates image retrieval tasks by measuring the semantic relationship between queries and captions, thereby offering a more efficient retrieval mechanism.

The CLIP model is trained on various image-text pairs from the Internet, enabling it to capture rich semantic information from both modalities. It has become a robust framework for learning joint representations of images and text [10]. CLIP has achieved state-of-the-art performance in various computer vision tasks, including image classification [46], object detection [47, 48], and content-based image retrieval [49, 50]. One of CLIP’s main advantages is its ability to understand images and text in a unified embedding space, allowing it to perform tasks that require reasoning about the relationships between visual and textual content. The effectiveness of CLIP in capturing semantic information from images and text makes it a promising candidate for improving SBIR systems, especially when used with text descriptions. By leveraging the representations learned from CLIP, we can address some challenges in SBIR, such as handling the ambiguity of sketches and incorporating contextual information from the text to improve retrieval accuracy.

The attention mechanism aids models in focusing on the most representative parts of the input data, establishing stronger connections between different modalities, and playing a pivotal role in feature fusion. This approach enhances the quality and relevance of the extracted features [51]. introduces a weakly supervised fusion network that leverages attention mechanisms to improve the fusion process, utilizing channel attention interaction and spatial attention interaction modules for effective feature fusion and detail preservation in infrared and visible light images. In [52], attention is used to selectively focus on relevant features in multimodal data, enhancing feature fusion and improving the performance and prediction accuracy of short video recommendations. The “Distract Your Attention” model [53] employs a multi-head attention mechanism to simultaneously focus on multiple facial regions, introducing a sophisticated facial expression recognition method that addresses the challenges of distinguishing subtle differences between similar expressions and the need for comprehensive feature extraction across various facial areas. The MCAM model [54] proposes using cross-attention mechanisms for effective feature fusion between text and image modalities, combining text features extracted from Albert with image features derived from DenseNet121, thereby improving the accuracy of sentiment analysis. Based on these preliminary research findings, we have decided to adopt a multi-head cross-attention mechanism to integrate the extracted sketch and text features and evaluate whether it provides a more representative joint feature representation.

Previous studies have explored various methods for retrieval tasks by combining multimodal features. A common approach is to extract visual features from reference images and textual features from captions and combine them, using contrastive learning to improve retrieval results [49, 55]. This approach has shown good results in tasks like compositional condition retrieval, where understanding the relationship between images and text is crucial. However, suitable reference images to illustrate user intent are only sometimes available. Quadruplet networks explicitly consider sketch and text inputs, which is relevant to our research, but unfortunately, we could not access the data used or the trained models for quantitative comparison [22]. TASK-former [4] addresses image retrieval using text and sketch queries, proposing TASK-former, which combines text and sketch inputs for retrieval. This model follows a late-fusion dual-encoder approach similar to CLIP, demonstrating improved recall for input sketches. A Sketch and Text Duet [39] introduces a new compositional framework that uses a pre-trained CLIP model to combine sketches and text, highlighting the utility of the compositionality constraint. However, these studies focus on the combination with the pre-trained CLIP model. The combination of different features in the multimodal fusion layer is concatenated chiefly, which is too direct to represent the multimodal combination features accurately. In addition, the research has not paid attention to the efficiency of indexing and the portability in some devices.

3 Methodology

Sketch-text image retrieval aims to retrieve the best matching image given a multimodal input consisting of a sketch-text pair. Specifically, given a hand-drawn sketch and descriptive text relative to the abstract sketch, the retrieval aims to find the best matching image that meets the visual similarity of the sketch and the semantic content expressed by the text description. For effective retrieval, the system must understand both the semantics of the sketch and the meaning of the text. In this task, we map the combined representation of sketches and text into a shared embedding space where each point corresponds to an image. By understanding and parsing the sketch and text information, we can find the image that best matches the given sketch and text in this embedding space. This process performs image retrieval in the learned embedding space to find the image most similar to the input sketch and text description.

Fig. 1
figure 1

Schematic Diagram of the CAMIR Framework for Sketch and Text-based Image Retrieval

In this paper, we propose a multimodal image retrieval framework (CAMIR), as shown in Fig. 1. The framework consists of four modules: multimodal feature extraction module, multimodal feature fusion module, contrastive learning module, and indexing module. In the feature extraction phase, we use a fine-tuned CLIP model to extract sketch and text features. The feature fusion phase incorporates an attention mechanism to learn the similarities and relationships of sketch and text features in different subspaces, which are dynamically weighted to enhance the multimodal feature representation. The contrastive learning phase encourages consistency between combined features and visual features in the joint feature space, narrowing the distance between positive examples. The Faiss for indexing phase effectively shortens the retrieval time.

3.1 Multimodal feature extraction

Inspired by [55], we use the common embedding space features of images and texts obtained by CLIP as a starting point. First, we fine-tune CLIP’s image and text encoders so that their embedding spaces can match the new downstream tasks [49]. proposed to use addition operations to break the symmetry of the embedding space, which is effective in multimodal retrieval tasks. In their study, the pre-trained CLIP image encoder can be utilized without modification since the reference and target images are from the same domain, and only the weights of the text encoder need to be updated. We use the CLIP encoder to extract features of sketches, texts, and natural images. Then, we take the sketch features and text features element-wise addition and perform L2 normalization to eliminate the effect of feature dimension. Finally, the predicted and target features are input, and the model parameters are updated with the help of contrastive loss. The use of the pre-trained CLIP model, which has been fine-tuned for this task, helps in learning robust representations that capture the semantic essence of sketches, even when the quality or detail varies.

3.2 Multimodal feature fusion

This module combines features from multiple modalities to produce a unified multimodal query representation. The objective of the query, consisting of both sketch and text inputs, is to enrich the representation of multimodal features. This is accomplished through a multi-head cross-attention mechanism and dynamically weighted unimodal features, resulting in a fused feature representation.

After the feature extraction module, we obtain the sketch features \(\:{\varvec{s}}_{i}\) and text features \(\:{\varvec{t}}_{i}\) extracted by fine-tuning CLIP. First, the sketch and text features are projected through two linear layers respectively, changing the feature dimension from the CLIP feature dimension to the projection dimension. In order to improve the nonlinear expression ability, the projection layer is followed by a ReLU activation function. The Dropout layer is used to enhance the robustness of the model and prevent overfitting. The feature representation after projection is:

$${s_i^\prime = Dropout\left( {ReLU\left( {{W_s}{s_i}} \right)} \right)}$$
(1)
$${t_i^\prime = Dropout\left( {ReLU\left( {{W_t}{t_i}} \right)} \right)}$$
(2)

where \(\:{\varvec{W}}_{s}\) and \(\:{\varvec{W}}_{t}\) are the projection matrices of the sketch and text, and \(\:{\varvec{s}}_{i}^{{\prime\:}}\) and \(\:{\varvec{t}}_{i}^{{\prime\:}}\) are the sketch and text features after projection.

Even if global features have been extracted, the relationship between sketches and text may not be simple. Multi-head cross-attention can help the model capture complex relationships between modalities [53, 54, 56] rather than just simple feature concatenation. In order to effectively fuse sketch and text features, a multi-head cross-attention mechanism is used to capture the correlation between the two modalities. Specifically, text features are used as Queries, and sketch features are used as Keys and Values ​​to input into the multi-head attention layer. For each attention head \(\:j\), its output is expressed as:

$${{A_j} = Attention\left( {{Q_j},{K_j},{V_j}} \right)}$$
(3)

where \(\:{\varvec{Q}}_{j}={\varvec{t}}_{i}^{{\prime\:}}{\varvec{W}}_{j}^{Q}\) is the query vector obtained by the linear transformation matrix \(\:{\varvec{W}}_{j}^{Q}\in\:{\mathbb{R}}^{d\times\:{d}_{k}}\); \(\:{\varvec{K}}_{j}={\varvec{s}}_{i}^{{\prime\:}}{\varvec{W}}_{j}^{K}\) is the key vector obtained by the linear transformation matrix\(\:\:{\varvec{W}}_{j}^{K}\in\:{\mathbb{R}}^{d\times\:{d}_{k}}\); \(\:{\varvec{V}}_{j}={\varvec{s}}_{i}^{{\prime\:}}{\varvec{W}}_{j}^{V}\) is the value vector obtained by the linear transformation matrix \(\:{\varvec{W}}_{j}^{V}\in\:{\mathbb{R}}^{d\times\:{d}_{v}}\). The calculation of each attention head is implemented by scaled dot-product attention:

$${Attention\left( {{Q_j},{K_j},{V_j}} \right) = softmax\left( {{Q_j}K_j^T/\sqrt {{d_k}} } \right){V_j}}$$
(4)

where \(\:{\varvec{Q}}_{j}{\varvec{K}}_{j}^{T}\) calculates the dot product similarity between the query and the key. \(\:\sqrt{{d}_{k}}\) is a scaling factor to avoid the dot product value being too large. Softmax normalizes the dot product result. Finally, the attention weight is multiplied by the value vector \(\:{\varvec{V}}_{j}\) to generate the attention output \(\:{\varvec{A}}_{j}\). The multi-head attention mechanism enhances the expressiveness of the model by computing multiple attention heads in parallel. The outputs of all attention heads are concatenated to form the final attention output:

$${MultiHead\left( {Q,K,V} \right) = Concat\left( {{A_1},{A_2}, \ldots \>,{A_n}} \right){W^O}}$$
(5)

where \(\:{\varvec{A}}_{j}\) is the output of the \(\:j\)th attention head, with dimension \(\:{\mathbb{R}}^{T\times\:{d}_{v}}\). \(\:{\varvec{W}}^{O}\) is the linear change matrix of the multi-head attention output, with dimension \(\:{\mathbb{R}}^{{hd}_{v}\times\:{d}_{model}}\), where \(\:{d}_{model}\) is the dimension of the model output. In this method, the text feature \(\:{\varvec{t}}_{i}^{{\prime\:}}\) is regarded as the \(\:\varvec{Q}\), and the sketch feature \(\:{\varvec{s}}_{i}^{{\prime\:}}\) is regarded as the \(\:\varvec{K}\) and \(\:\varvec{V}\). The multi-head cross-attention mechanism captures the fine-grained relationship between sketch and text to generate the fused attention output, expressed as:

$${A = MultiHead\left( {t_i^\prime ,s_i^\prime ,s_i^\prime } \right)}$$
(6)

The generated attention output \(\:\varvec{A}\) is subsequently fused and concatenated with the projected text feature \(\:{\varvec{t}}_{i}^{{\prime\:}}\) into a new joint feature representation:

$${combine{d_{features}} = Concat\left( {t_i^{\prime \>},A} \right)}$$
(7)

This joint feature representation combines the multimodal information of image and text.

In order to flexibly combine sketch and text information, we also designed a dynamic scalar module, which generates a scalar \(\:{\alpha\:}_{i}\) (between 0 and 1) through a series of fully connected layers and activation functions for the weighted fusion of image and text features. The following formula calculates the scalar::

$${{\alpha _i} = Sigmoid\left( {{W_{\alpha \>}} \cdot combine{d_{features}}} \right)}$$
(8)

The final output feature is the weighted sum of sketch features, text features, and combined features:

$${{y_i} = {c_i} + {\alpha _i} \cdot {t_i} + \left( {1 - {\alpha _i}} \right) \cdot {s_i}}$$
(9)

where \(\:{y}_{i}\) is the normalized joint feature representation, and \(\:{\alpha\:}_{i}\) determines the contribution weight of sketch and text features in the fusion feature.

Through the above steps, the network successfully integrates the features of sketches and texts and generates a high-quality joint feature representation that considers both information. The multi-head attention mechanism in our feature fusion module strengthens the representation by focusing on meaningful patterns in the combined sketch-text input, which can help to reduce sensitivity to minor variations in sketch quality. Introducing the multi-head cross-attention mechanism improves the interactivity between different modalities, and the dynamic scalar weighting mechanism flexibly adjusts the weights of image and text features, enhancing the feature representation ability.

3.3 Contrastive learning

The training process uses a contrastive learning strategy to train the model, with the objective of minimizing the distance between the fused predicted features and the target image features. Each batch inputs a triple of sketch, text, and target image. In order to calculate the similarity score of each sample in the batch, the multimodal combined feature \(\:{y}_{i}\) is used to perform a dot product operation with the target image feature \(\:{I}_{i}\). The dot product is used to measure the similarity between the combined feature and the target image feature, expressed as:

$${logits = {y_i} \cdot I_i^T}$$
(10)

where \(\:{\varvec{y}}_{i}\in\:{\mathbb{R}}^{B\times\:D}\) and \(\:{I}_{i}\in\:{\mathbb{R}}^{B\times\:D}\), \(\:B\) is the batch size, and \(\:D\) is the feature dimension. The output \(\:logits\in\:{\mathbb{R}}^{B\times\:B}\) matrix contains the similarity between the multimodal combination features of each sample and the target image features. To implement contrastive learning, we use batch-based classification (BBC) loss as the loss function, similar to [55]:

$${Loss = {1 \over B}\sum\limits_{i = 1}^B {} - {\rm{log}}\left\{ {{{{\rm{exp}}\left\{ {\lambda *\kappa \>\left( {{y_i},I_i^T} \right)} \right\}} \over {\sum {} _{j = 1}^B{\rm{exp}}\left\{ {\lambda *\kappa \left( {{y_i},{I_j}} \right)} \right\}}}} \right\}}$$
(11)

where \(\:{y}_{i}\) is the multimodal combined feature and \(\:{I}_{i}\) is the target image feature. Following the strategy of CLIP [10], we multiply the dot product between the combined feature and the target feature by 100 before computing the loss to help the training process by increasing the dynamic range of the features. The value of the temperature parameter \(\:\lambda\:\) is 100.

3.4 Faiss index

In image retrieval and deep learning-related tasks, the Faiss library provides efficient indexing structures for large-scale similarity search [27]. Faiss supports various indexing types, and this study focuses on two common indexing types, IndexFlatL2 and IndexFlatIP, which are used for nearest neighbor search based on Euclidean distance and inner product, respectively. For comparison, we first conducted our experiments without Faiss, recording the recall of the target image and the query time. The experiments use the cosine similarity distance metric by measuring the angle between two vectors. A smaller value of the distance indicates that the query features are closer to the features in the database, and a more immense value suggests that they are less similar. All experiments were then repeated using Faiss, using two main types of indexes: IndexFlatL2 and IndexFlatIP. The first performs an exact search using the L2 distance, while the latter also performs an exact search but uses a cosine-like inner product distance metric.

4 Experiments

We conducted many experiments on public datasets to verify the effectiveness of our framework in real-world image retrieval scenarios. Section 4.1 introduces the dataset and preprocessing process used in the experiment. Section 4.2 describes the experimental settings and experimental environment in detail. Section 4.3 introduces the evaluation metrics. Section 4.4 is the experimental results and analysis. Section 4.5 is the ablation experiment.

4.1 Dataset

In our experiments, we used Sketchy [34], the largest FG-SBIR dataset. It contains 12,500 natural images of objects and 75,471 sketches drawn by crowd workers based on the images. These images are evenly distributed across 125 categories, with each image corresponding to at least five sketches. All images and sketches were resized to 256 × 256 pixels. In addition to the natural images and hand-drawn sketches, we also constructed 75,471 triplets (sketch, natural image, text) based on the additional information provided by the dataset. The text associated with the sketches comes from the WorkerTag in the Sketchy information table. We followed the guidelines from [22], with 90% of the data used for training and the rest for testing.

The standard image preprocessing process of CLIP was used in the experiment, which involves two main steps: resizing and center cropping. The smaller side of the image is resized to match CLIP’s input size, which is usually a square. After resizing, the image may not be square, so a center crop is performed to obtain a square block. The output size of the center crop typically matches CLIP’s input dimensions, i.e., input dim × input dim.

4.2 Experimental settings

The experiments were conducted on a system equipped with Ubuntu 20.04 as the operating system, an NVIDIA RTX 3090 GPU with 24 GB memory. The software environment included CUDA 11.3, Python 3.8, and PyTorch 1.11.0. Clip trained five ResNets (ResNet50, ResNet101, RN50 × 4, RN50 × 16, RN50 × 64) [10]. We decided to use ResNet50 (RN50) as the base model for our experiments. RN50 takes an input image of 224 × 224 pixels and outputs 1024-dimensional features. The text encoder is a Transformer encoder with a width of 640. In the experiments, we used the AdamW optimizer [57] in the fine-tuning stage, with a learning rate 2e-6 and an epoch of 100. Due to the limitations of the experimental equipment, the batch size was set to 64. The visual and text encoders were kept frozen during contrastive learning training. The dropout rate was set to 0.5. The learning rate was set to 2e-5, and we trained the model for 300 epochs. The batch size of the RN50 experiment was set to 1024.

4.3 Evaluation metrics

We use Recall@K as the performance evaluation metric in our experiments. Given a query sketch, Recall@K is equal to 1 if its relevant image is included in the top K retrieved images and 0 if the relevant image is not in the top K. This metric measures the ranking position of the target image within the top K results. In coarse-grained retrieval, when the returned result belongs to the same category as the target result, the result is correct, and the task is relatively simple. In fine-grained retrieval, only one correct image corresponds to the query sketch and text in the database, and the evaluation indicator is used to show the frequency of the top K correct retrieval results in the test set. The task is relatively complex, so we focus on a broader range of K values.

4.4 Experimental results and analysis

In this section, we first introduce a set of experiments that demonstrate the performance improvement brought by fine-tuning CLIP in the feature extraction stage and the effectiveness of multi-modal features fused with multi-head attention and dynamic weighting. With RN50 as the backbone, we performed two sets of experiments. The first group includes fine-tuning only the image encoder, only the text encoder, and fine-tuning all encoders. The second group includes feature element-wise summation, features without attention fusion, features without dynamic weighting, and our proposed method.

Table 1 Recall@K of different fine-tuning methods. FG_recall@K represents the recall value of fine-grained retrieval, and CG_recall@K represents the recall value of coarse-grained retrieval
Fig. 2
figure 2

Recall curves with epochs for different fine-tuning methods

As shown in Fig. 2; Table 1, the improvement in recall confirms the necessity of fine-tuning the entire encoder in the feature extraction module. CLIP is pre-trained on large-scale image and text alignment to generate general visual-text features. However, sketches have features with large distribution differences compared to natural images. For example, sketches have strong contour information and lack details and color features. Fine-tuning the encoder on sketch data can be adapted to the distribution of sketch features and improve the ability to represent sketches. By fine-tuning the dual encoder, better alignment can be achieved between sketches, texts, and images so that sketches, texts, and target images with the same semantics are closer in a common embedding space. The proposed CAMIR directly benefits from the good starting point brought by fine-tuning CLIP. As shown in Fig. 2 (c) and Fig. 2 (f), the FG_recall@1 and CG_recall@1 results obtained by fine-tuning only the text encoder are 27.37 and 28.56 respectively. This result indicates that the limited information provided by the short text description alone is insufficient for accurate retrieval. However, the performance improves significantly at higher thresholds, reaching 69.25 and 71.91 for FG_recall@5 and CG_recall@5, respectively. This suggests that fine-tuning the text encoder alone, although it is challenging to retrieve exact matches, can display related images in relatively prioritized retrieved results. In comparison, fine-tuning only the image encoder significantly improves all recall metrics, including 63.16 for FG_recall@1 and 65.02 for CG_recall@1. These results highlight the importance of fine-grained visual information in image retrieval. Fine-tuning text and image encoders resulted in the best overall performance, with improvements in both coarse- and fine-grained retrieval values compared to using only the image encoder. This suggests that the image encoder is more effective at capturing the fine-grained visual information required for sketch-based retrieval, that the text encoder can still provide supplementary information even with limited text descriptions, and that fine-tuning the two encoders can provide a more balanced model.

Fig. 3
figure 3

Recall values of different methods at K on the test set

In the multi-modal feature fusion module, the network combines a multi-head attention mechanism and a dynamic weighting mechanism to integrate text and sketch features. As shown in Fig. 3, we evaluate the results of four configurations: the feature representation of element-wise summation of sketch and text features (FT-Both), the feature representation without attention fusion (w/o Attn), the feature representation without dynamic weighting of sketch and text (w/o Weigh), and the entire proposed method (CAMIR). The recall values ​​of w/o Attn and FT-Both are close to CAMIR, with only slight differences. Notably, the coarse-grained retrieval metrics improve slightly, indicating that the dynamic weighting mechanism is more influential than the attention mechanism in some cases. This highlights the advantage of dynamically adjusting the contribution of each modality according to its respective relevance. When the output relies solely on multi-modal features based on the attention mechanism, the results show significant performance degradation. A single attention-based feature may lack the flexibility to adjust according to the importance of text and sketch input. Experimental results show that combining attention mechanisms and dynamic weighting mechanisms is crucial to generating robust feature representations in multi-modal retrieval tasks. While the attention mechanism is valuable for capturing contextual relationships between image and text features, it is insufficient. Dynamic weights are vital features that mediate the influence of each modality, ensuring that the most relevant information is emphasized during feature fusion.

To further demonstrate the effectiveness of our proposed method, we compare the results with those of previous studies using Sketchy as the benchmark. Relevant experimental results are reported in Table 2. Our model achieves the best performance and significantly outperforms all other existing models on Sketchy. Compared with the classic Triplet Network, Quadruplet Network, and AE-Net, recall@1 has improved by 26.53%, 21.47%, and 9.04%, respectively, which illustrates the powerful multi-modal feature extraction capability of the fine-tuned CLIP model. As can be seen from the Table 2, if we compare our results with other results with the same architecture (e.g., RN50, RN101), we notice that only using fine-tuned CLIP brings significant improvements [58]. demonstrates through knowledge distillation that small models can effectively learn from large models and even achieve accuracy close to that of the larger models. The optimization strategy, which incorporates relative triplet loss and batch normalization, makes the model architecture more efficient. This configuration achieved a recall rate of 62.38%. The performance is only slightly lower than our method. However, there is a trade-off between efficiency and performance in the distillation process, and the effectiveness of distillation depends on the quality of the teacher model.

Table 2 Comparison between our method and current state-of-the-art models on the sketchy test set

In another experiment, we compared the impact of Faiss on queries during the indexing phase. In Table 3, recall and query time are used as evaluation metrics to demonstrate the comprehensive performance of Faiss, which are obtained from experiments on all Sketchy test set data. The models compared in recall and retrieval time in the experiments include our proposed CAMIR and its three variants. We use cosine similarity for regular indexing to measure the distance between feature vectors [62]. When introducing Faiss, we use two types of indexing: IndexFlatIP and IndexFlatL2. Table 3 shows that indexing with Faiss significantly reduces query time with constant recall. Both IndexFlatIP and IndexFlatL2 can perform exact queries, but IndexFlatIP is better. The time to build the index is negligible because it only needs to be built once. After the indexing, the model can quickly retrieve the query results. Faiss is designed as an efficient and extensible library that integrates smoothly with existing deep learning frameworks such as PyTorch. Experiments demonstrate that the retrieval task can easily combine Faiss with the feature extraction process to form a complete pipeline for sketch and text-based image retrieval.

Table 3 Comparison of retrieval recall and time using Faiss (IndexFlatIP and IndexFlatL2) and not using Faiss

In addition, Fig. 4 shows an example visualization of the top-10 retrieval results for some given sketches and texts. The retrieval results show that our model performs well for inter-class retrieval (the returned images mainly belong to the same category) and performs well for intra-class image retrieval (the returned images contain the target instance and are ranked high). However, our method also has cases where the target image cannot be successfully retrieved. For example, in the second row of Fig. 4, while the network successfully retrieves the correct image (apple held in child’s hands), the top-ranked result is another image featuring an apple held by a hand, but not that of a child. This could be attributed to the quality of the sketch (partial occlusion of the apple in the target image) or to the weighting strategy used during multimodal feature fusion. The multi-head cross-attention mechanism and dynamic weighting approach might assign disproportionate emphasis to the geometric features of the sketch and the semantic information in the text. Although the network effectively prioritizes images containing hands and apples closer to the query embedding, the prominent visual features of the sketch (apple) may dominate in the feature space, leading the model to favor obvious visual elements while overlooking textual details such as “child’s hands.” In addition to these examples, we generally observe that when the sketch fails to accurately represent the target image, the retrieval results are affected. Nevertheless, these results typically exhibit a visual appearance similar to the input sketch. High-quality sketches and detailed natural language descriptions tend to produce more accurate retrieval results.

Fig. 4
figure 4

Examples of images retrieved by CAMIR, randomly selected from our benchmark. Each query consists of a text description (shown above the image) and an input sketch (shown first in the row of the image). In each case, the image that forms a matching pair with the query is highlighted with a red border

4.5 Ablation study

In this section, we verify the rationality of the architecture shown in Fig. 1 from the module level and show the experimental results of different variants, where all experiments are based on RN50 as the backbone. CAMIR simultaneously introduces a fine-tuned CLIP model, attention-fused multimodal features, and dynamically weighted combined features. In addition, Faiss is introduced to construct the index. To evaluate the contribution of each component, we conducted several comparisons. Table 4 reports the highest recall obtained by each variant during the experiments. For retrieval tasks involving sketch data, fine-tuning the CLIP model is necessary. During pre-training, the CLIP model is mainly exposed to high-resolution natural images, which usually have a rich texture, color, lighting, and other information. However, there are significant modal differences between sketches and natural photos, with sketches often being simplified line drawings lacking color, shading, and intricate detail. This modal difference causes the CLIP model to be unable to fully capture the concise and abstract geometric structure information in the sketch when processing the sketch. Fine-tuning the CLIP model can better adapt to the sketch retrieval task by exposing the model to sketch data and learning sketch-specific representations. In our experiments, the visual and text encoders are fine-tuned simultaneously, enhancing the model’s cross-modal alignment ability and improving its robustness and accuracy in multimodal image retrieval tasks.

It is worth noting that using only the weighted sum of sketch image features and text features as multimodal feature representation shows impressive results, which are close to the results of our proposed model in fine-grained retrieval. In contrast, coarse-grained retrieval is even slightly higher than our method. The dynamic weighting mechanism is crucial for model performance, as evidenced by the significant performance drop when it is removed (CAMIR-w/o Weighting). This suggests that the ability to adaptively balance text and image features is fundamental to our task. This also shows that the multi-head cross-attention mechanism can adaptively focus on different parts of the input features and calculate the correlation between sketches and texts, thereby retaining more contextual information during the feature fusion process. The proposed model has the highest computational complexity due to both attention and dynamic weighting mechanisms. CAMIR-w/o ATTN significantly reduces computation while maintaining performance, but the fine-grained retrieval recall value drops slightly. CAMIR-w/o Weighting, despite lower complexity, suffers substantial performance degradation. The attention mechanism increases computational complexity but provides enhanced interpretability and potential for future expansion. The dynamic weighting mechanism adds minimal computational overhead but is critical to performance. For category-level retrieval tasks and applications where computational resources are severely constrained, CAMIR-w/o ATTN offers an excellent compromise, maintaining performance while reducing complexity. The effectiveness of Faiss can be seen from the experimental results in Table 3 in the previous section, in which IndexFlatIP brings more significant effects.

Table 4 Ablative study of CAMIR and its variants at the module level

4.6 Limitations and Future Work

Table 2; Fig. 4 demonstrate the potential of our CAMIR framework; however, several aspects still require further investigation and refinement. First, the computational complexity of the framework poses challenges for practical use, particularly in real-time applications. To address this issue, future work will focus on simplifying the model structure to enhance efficiency without compromising retrieval accuracy. Second, the current evaluation is limited to the Sketchy dataset. Expanding the analysis to other datasets, such as QuickDraw and additional multimodal benchmarks, will provide a better assessment of the framework’s generalizability and robustness, which remains a key direction for future research. Finally, the structured sketch-photo pairs and concise descriptions in the Sketchy dataset may not fully capture the complexities of real-world data. Further exploration of the framework’s performance on datasets with less structured relationships or more ambiguous text is necessary to ensure broader applicability.

5 Conclusion

This study proposes an image retrieval framework (CAMIR) based on a fine-tuned CLIP model, multimodal feature fusion, and contrastive learning. This framework illustrates how to leverage multimodal information from hand-drawn sketches and textual descriptions for image retrieval, an aspect that remains underexplored in sketch-based image retrieval. Different from previous methods, our CAMIR framework integrates the pre-trained CLIP model as a feature encoder for sketches, text, and natural images, combined with an attention mechanism to improve retrieval performance.

In CAMIR, the fusion of a multi-head cross-attention mechanism and sketch-text dynamic weighted features enhances the consistency and complementarity between different modalities, improving cross-modal understanding. The multi-head attention mechanism effectively captures associations of fine-grained information, enhancing multimodal feature representation. Furthermore, Faiss is introduced as an auxiliary tool to accelerate similarity searches during indexing, reducing computational overhead in feature retrieval and ensuring robustness and scalability for large-scale datasets.