1 Introduction

With the rapid development of Web2.0, first-person engagement and crowdsourcing content creation have boomed as new paradigms of interactions. Digital media resources, especially images, are given abundant subjective and expressive dimensions of cognitive contents, which advocates researches for retrieval of the emotional information. Human perception and understanding of image emotional information are operations mainly on the semantic level. However, “Semantic gap” between low-level image features and high-level emotional semantic can be hardly to bridge completely.

This research aims to purpose an approach for the image retrieval of emotional content based on tags. Tags are image description added by users directly, so image emotional semantic retrieval can be implemented based on text retrieval technology without the extraction of image information. To increase the number of tags, the channels of tag generation are expanded by including the relevant user interactive behaviors.

In this research, a cognitive quantification model of their emotional qualities or of their reception by users is constructed to organize and manipulate social image resources. Meanwhile, the model is applied to emotional semantic tag recommendation, which is beneficial to improve the efficiency of image annotation and the validity of image recommendation.

This research creatively proposes a mechanism for the emotional semantic retrieval of images. The mechanism has the following advantages,

  • Retaining the user’s subjective view of images maximally by using user-generated tags, which ensures the credibility of retrieval and is more lightweight.

  • Binding the behaviors and views of users in image retrieval based on behavior psychology, which expands tag sources and provides more data for the modeling.

  • Mining the potential emotional semantic of pictures based on the cognitive quantification model, which improve the effectiveness of image emotional semantic retrieval. Meanwhile diverse associated tags can be recommended according to the relevant weight in the model, which further improves the emotional image annotation.

The rest of the paper is organized as follows. An overview of the tag-based image emotional semantic image retrieval mechanism is presented in Sect. 2. Section 3 discussed the details of methods used in the mechanism. To verify the rationality of the proposed mechanism, experiments with small sample size have been done. The process and analyses are presented in Sect. 4. Section 5 is the summary and prospect.

2 Concepts and Methods

2.1 Concepts

Image Tag.

Tags are the keywords added by users to describe the image contents. In particular, tags are not only the labels, but also can be the keywords in the titles, comments and so on. As tags are image descriptor, it is easy to recommend images directly based on text retrieval technology and there is no need to extract and analyze information of images.

An image can be added multiple tags, and a tag is also used to describe multiple images. The user is the creator of annotation behaviors, creating an association between images and tags.

At present, annotation behaviors are as follows,

  • Adding the title or labels when uploading images;

  • Adding the labels or grouping when collecting images;

  • Making comments on images.

The tags generated by the above behaviors are “explicit tags”. In fact, the above annotation behaviors are non-essential and costly behaviors for users, who is lacking of motivation. A large portion of the users only view images without leaving a tag.

However, studies have shown that the behaviors of users to retrieve images can reflect how much users agree with the retrieval results, revealing the relevance between the retrieval keywords and images. Using the data of users’ browsing behaviors, “implicit tags” can be made. The details are discussed in Sect. 2.2.

Emotional Quantification Model.

Image semantics has several levels. Emotional semantics lies on the highest level of abstract semantics, which can be defined as the semantics described intensity and type of feelings, moods, affections or sensibility evoked in humans by viewing images. It is usually represented in adjective form, romantic, brilliant etc.

Constructing an image emotional computational model usually involves three parts,

  • extracting image perceptual features that can stimulate users’ emotions;

  • establishing the emotional recognition mechanism to bridge the semantic gap between low-level visual features and high-level emotional semantics;

  • constructing the model to represent image emotional semantics that meet the needs of users’ query.

Visual identity and machine learning are main methods in the first and the second parts. They are aimed to build an association between images and their emotional semantics that can be retrieved easily, which can be implemented just by “tags”.

Based on tags, this paper is focused on the construction of the model to quantify image emotions which users search for.

In general emotional semantic models, the specific emotion is split and associated with the six basic dimensions of the emotion, anger, disgust, fear, joy, sadness, and surprise. The models relying on these six basic dimensions are not enough in emotional fine grain to represent complex emotional semantics or to distinguish between the various emotions clearly.

Learning from this idea, in this paper the emotional semantics are represented by more flexible and more targeted “emotional dimensions” which contain a variety of “emotional elements” associated with a certain relationship. The emotional semantic quantification model is expressed by “emotional dimensions”, and the emotional dimension is extracted from “emotional elements”.

2.2 Methods

Users’ Retrieval Behavior.

As mentioned in the previous section, it is non-essential and costly for users to add image tags and most users only browse images without leaving a tag. Studies have shown that users’ behaviors to retrieve images can reflect the degree of users’ recognition on the search results, in the image retrieval system, that is, the relevance of the search keywords and images.

Using the data of users’ retrieval behaviors, it can be predicted whether the images are associated with the search keyword, and if so, the “search keyword” can be added to the images as an “implicit tag”.

In users’ retrieving images, the operations of generating “implicit tags” are as follows.

  • click to view the image after retrieving;

  • download/save the image after retrieving;

  • snapshot the image after retrieving.

Among them, we remain neutral on the operation of clicking to view, because it is impossible to exclude behaviors that users click images to view due to curiosity and so on rather than recognition.

Combining the behaviors generating “explicit tags” mentioned in the previous section, the relationship between tags generated by users’ behaviors (“explicit tags” and “implicit tags”) and images is divided into three level, related_1, neutral_0.5 non-relevant_0, as follows (Fig. 1).

Fig. 1.
figure 1

Relevance of behavior and tag

A user may have more than one annotation for an image, but the above behaviors are not cumulative in relevance degree up to 1, that is, as long as there is a strong annotation behavior (relevance_1), the tag is added to the image by the user.

Tag Clustering Analysis.

Clustering is a common data analysis tool and a basic algorithm for data mining. The essence of clustering analysis is to divide data into several clusters according to the relevance. Therefore, it has high similarity within clusters and big difference between clusters.

Tag clustering can be used to find semantic-related labels in social annotation systems, Begdman et al. The principle tag can be mostly represented by identifying the subject of the cluster. If the clusters constitute the special emotion, the principle tags of them is “emotional elements”.

The semantic relevance of two tags can be obtained by relying on the semantic knowledge databases, such as Wordnet (for English) and CSC (for Chinese), to build a semantic correlation matrix (Fig. 2).

Fig. 2.
figure 2

Semantic correlation matrix of tags

Based on the semantic relevance coefficient in the matrix to build N-dimensional space, the Euclidean distance formula (1) can be used to calculate the spatial distance of two tags. The closer, the more similar tags can be considered.

$$ Euclid(1,2) = \sqrt {(x_{1} - x_{2} )^{2} + (y_{1} - y_{2} )^{2} + (z_{1} - z_{2} )^{2} } $$
(1)

The shortest two clusters are merged into a large cluster until all small clusters are merged into a large cluster. The whole process can be shown in a form of a tree structure. Any number of semantic groups can be got through hierarchical clustering analysis (Fig. 3).

Fig. 3.
figure 3

Hierarchical clustering

Factor Analysis of Emotional Cognition.

The main purpose of factor analysis is to reduce dimension by transferring lots of indicators into several comprehensive indicators under little information lost.

As tags increases, redundancy and uniqueness should be considered when performing image matching in the emotional space. Using factor analysis, an orthogonal emotional space can be constructed not only to retain the majority of original indicators meaning, but also to ensure the simplification of the model.

At the same time, the weight of each emotion dimension is allocated according to the contribution rate of each factor, rather than the artificial judgment, which makes the model more objective and reasonable.

Creating a tag-image matrix S = {sim} based on website image-tag database() (Fig. 4).

Fig. 4.
figure 4

A tag-image matrix

sim is the score of image Ii on tag Tm, determined by the number of Tm on Ii. Since the number of tags on different images are in a different order of magnitude, it needs to be standardized. For an image Ia and the number of tags N = {n1, n2, …, nm}, its score sai

$$ {{\rm{s}}_{\rm ai}} = {{\rm n}_{\rm i}}{/}\max({{\rm n}_{\rm i}}),\;\;{\rm i} = 1,2,\,{\ldots},{\rm m} $$

After factor analysis, factors F = {F1, F2, …, Fn}, that is the “emotional dimensions”, and their variance contribution rate A = {a1, a2, …, an} can be got. The emotion Y can be represented by emotional cognitive factors F, as (2).

$$ \text{Y} = {\text{a}}_{ 1} {\text{F}}_{ 1} + {\text{ a}}_{ 2} {\text{F}}_{ 2} + \ldots + {\text{ a}}_{\text{n}} {\text{F}}_{\text{n}} $$
(2)

In addition, we obtain a factor load coefficients matrix of tags T and factors F. The rotation factor load coefficient matrix B = {bmn} can be obtained by using Varimax to rotate the initial factor load matrix. The rotation method can keep the factors orthogonal to each other, but the variance difference of each factor is maximized, so it is convenient to explain the factor.

Quantization models of each emotional dimension can be obtained.

$$ {{\rm F}_{\rm i}} = {{\rm{b}}_{{\rm{i1}}}}{{\rm{T}}_{{\rm{p1}}}} + {{\rm{b}}_{{\rm{i2}}}}{{\rm{T}}_{{\rm{p2}}}} + \ldots + {{\rm{b}}_{{\rm{im}}}}{{\rm{T}}_{{\rm{pm}}}}\;\;({\rm i} = 1,2,\,{\ldots},{\rm n}) $$
(3)

(3) into (2), we can get an emotional cognitive quantification model of Y

$$ \text{Y} = {\text{c}}_{ 1} {\text{T}}_{{\text{p} 1}} + {\text{c}}_{ 2} {\text{T}}_{{\text{p} 2}} + \ldots + {\text{ c}}_{\text{m}} {\text{T}}_{{\text{p}{\text{m}}}} $$
(4)
$$ \text{C = AB} $$
(5)

3 Image Emotional Semantic Retrieval System

In the image emotional semantic retrieval systems, it is an important part to find most appropriate images for given tags and find the most appropriate tags for given images. That is, need to find the most appropriate match with each other tag-image pairs.

The system has three main functional modules,

First, recommend the relevant images according to users’ search terms.

Second, recommend the relevant tags for the images that users agree with.

The third, based on users’ feedback on the recommended results, expand the tag- image database.

Given an input tag Ta, the recommended image set IR = {Ir1, Ir2, …, Iri}, and the recommended tag set for image Tr = {Tr1, Tr2, …, Tri}, the flow of the system is divided into the following steps (Fig. 5).

Fig. 5.
figure 5

Image emotional semantic retrieval system

3.1 Emotional Semantic Modeling

Candidate Tag Selection.

Given an initial tag Ta, the tag set T = {T1, T2, …, Ti} where all tags associated with it are included is collected based on the coherence principle. The more images any two tags are annotated on by users at the same time, the more there are cognitive links with them. In order to avoid the tag noise, it is needed to set a threshold, usually an empirical value. There are experimental evidences found that the value set 10 can make the best performance.

Through synonyms to merger these coherence tags, get the candidate tag set Tc = {Tc1, Tc2, …, Tci}

Aggregation of Candidate Tags.

Using the external semantic knowledge database Wordnet (for English) or CSC (for Chinese), establish a semantic association matrix according to the semantics relevance of Tc, shown as below. The correlation value is 0–1, the higher the value is, the higher the relevance is.

Then, cluster the candidate tags by the clustering algorithm to generate several original semantic clusters. A representative label is selected as a representative label to represent each cluster of the class, which forms emotional elements set Tp = {Tp1, Tp2, …, Tpm}.

Emotional Semantic Modeling.

Based on the website image-tag database, making factor analysis on Tp, to get the emotional dimension set F = {F1, F2, …, Fn} and its variance contribution rate A = {a1, a2, …, an}, and then Ta emotional cognitive model can be expressed as follow.

$$ \text{Y} = {\text{a}}_{ 1} {\text{F}}_{ 1} + {\text{ a}}_{ 2} {\text{F}}_{ 2} + \ldots + {\text{ a}}_{\text{n}} {\text{F}}_{\text{n}} $$
(6)

According to the rotation factor load factor matrix B = {bmn}, each factor F quantization model can be expressed as follow.

$$ {{\rm F}_{\rm i}} = {{\rm{b}}_{{\rm{i1}}}}{{\rm{T}}_{{\rm{p1}}}} + {{\rm{b}}_{{\rm{i2}}}}{{\rm{T}}_{{\rm{p2}}}} + \ldots + {{\rm{b}}_{{\rm{im}}}}{{\rm{T}}_{{\rm{pm}}}}\;\;({\rm i} = 1,2,\,{\ldots},{\rm n}) $$
(7)

Combining the above two expressions to obtain the emotional semantic model of Ta with tag elements Tp,

$$ \text{Y} = {\text{c}}_{ 1} {\text{T}}_{{\text{p} 1}} + {\text{c}}_{ 2} {\text{T}}_{{\text{p} 2}} + \ldots + {\text{ c}}_{\text{m}} {\text{T}}_{{\text{p}{\text{m}}}} $$
(8)

3.2 Images Ranking and Tags Selection

Ranking of Images to Recommend.

The value of each image on Tp is smi. The emotional value of each image on tag Ta can be calculated by using the formula (8). Recommended image are sorted by the emotional value from high to low.

Selection of Tags to Recommend.

Each image has the highest score tag Tp1 that is included in the emotional dimension F, where there are other tags Tp {Tp2, Tp3, …, Tpi}. Tags are recommended according to the weight in the formula (7).

Feedback Collection.

According to the user’s browsing behavior, collect user’s various types of annotation activities, which can generate explicit tags or implicit tags to enrich and expand the image-tag library data.

4 Simulation Experiment

4.1 Experiment Setting

“Daqi” was set as the special emotional term, and 25 testers (12 males and 13 females) were invited to participate in the experiment. 150 images from appliances, furniture, transportation, construction, utensils, jewelry and other fields were selected as exciter. Testers were asked to grade the correlation between images and some terms related to “Daqi”, non-relevance 0, neutral 0.5, relevance 1. Using the experimental data to get the “Daqi” emotional semantic quantification model, some validation tests were made.

4.2 Experiment Process

Construct Emotional Quantification Model.

Through the literature, online comments, and website etc., more than 140 adjectives appearing at the same time as “Daqi” were collected. Then, merge these terms into 45 terms through synonyms. Researchers, according to their own professional experience and cognition. A semantic correlation matrix (Fig. 6) of these 45 emotional terms were built on researchers’ professional experience.

Fig. 6.
figure 6

A semantic correlation matrix

Through the clustering analysis, ward method, 16 related emotional terms, emotional elements, on behalf of each cluster respectively were obtained. They were Quality, Generous, Uniform, Smooth, Solemnly, Full, Rounded, Elegant, Simple, Artless, Pretty, Delicate, Angular, Hard, Huge, Uninhibited.

25 testers were asked to score the degree of correlation between 150 stimuli images and the 16 emotional terms, respectively. The result of factor analysis (Fig. 7) on the experimental data is as follow. KMO is 0.789 and the data is suitable for factor analysis.

Fig. 7.
figure 7

Communalities and total variance explained

To ensure a reasonable explanation, we choose the factor combination of which contribute is up to 83.223%. There are five factors in it. And the degree of extraction of each emotional term can reach more than 75%. Get the expression (9) of the five factors i, as follows (Fig. 8).

$$ {\text{Y}} = 0. 4 2 {\text{F}}_{ 1} + 0. 3 1 {\text{F}}_{ 2} + 0. 1 4 {\text{F}}_{ 3} + 0.0 7 {\text{F}}_{ 4} + 0.0 6 {\text{F}}_{ 5} $$
(9)
Fig. 8.
figure 8

Component matrix

According to the composition of the score coefficient matrix, “Daqi” emotional cognition model can be got, as follows.

$$ \begin{aligned} {\text{Y}} = & \,0. 1 {\text{T}}_{ 1} + 0. 1 2 {\text{T}}_{ 2} + 0. 1 2 {\text{T}}_{ 3} + 0. 1 7 {\text{T}}_{ 4} - 0.0 4 {\text{T}}_{ 5} + 0. 1 1 {\text{T}}_{ 6} + 0. 1 7 {\text{T}}_{ 7} + 0. 1 5 {\text{T}}_{ 8} + 0. 1 3 {\text{T}}_{ 9} + 0.0 3 {\text{T}}_{ 10} \\ & \quad + 0.0 1 {\text{T}}_{ 1 1} + 0.0 3 {\text{T}}_{ 1 2} - 0. 1 3 {\text{T}}_{ 1 3} - 0.0 4 {\text{T}}_{ 1 4} + 0.0 1 {\text{T}}_{ 1 5} + 0.0 8 {\text{T}}_{ 1 6} \\ \end{aligned} $$
(10)

Extract Emotional Dimensions.

According to the rotation component matrix, five emotional dimensions can be made sure, and each emotional dimension inside the emotional composition is as follows (Fig. 9).

  • F1 (Quality, Generous, Delicate, Elegant, Smooth, Uniform)

  • F2 (Angular, Rounded, Hard, Full)

  • F3 (Artless, Simple)

  • F4 (Uninhibited, Huge, Pretty)

  • F5 (Solemnly).

Fig. 9.
figure 9

Rotated component matrix

And the correlation matrix between the various emotional components is as follows (Fig. 10).

Fig. 10.
figure 10

Correlation matrix

4.3 Experimental Verification

Verification of Model Calculated Value.

To ensure similar cognition comparisons, “appliances” are served as test. 10 appliance images (Fig. 11) were shown to be scored on the relevance with “Daqi” and we collected the scores data from 100 volunteers (48 male and 52 female). Take the average to rank the image and compare with the theoretical values of model (10) (Fig. 12).

Fig. 11.
figure 11

Appliance image

Fig. 12.
figure 12

Compare compute and test value

There are two ambiguities in the forecasting trend (No. 7 and No. 2), and the coincidence degree is as high as 80%. The model established can basically predict the emotional sensitivity of images.

Verification of Tag Recommended.

Based on the emotional dimensions and the relevance between the terms, recommend the relevant label, some tags were selected to recommended, the maximum number 7, and asked 25 testers (12 male and 13 female) to choose the related ones from the recommended ones and calculated the use rate, the proportion of selected and provided tags.

Taking image No. 10 as an example, the highest-scored tags are Generous, Simple, and Rounded, in F1, F2 and F3 dimension, respectively. Combining the relevance matrix, as follows,

  • Generous → F1 → Quality0.815, Smooth0.701, Uniform0.652, Elegant0.612, Delicate0.524

  • Rounded → F2 → Full0.512

  • Simple → F3 → Artless0.668

The recommended tags are as follow,

Generous, Rounded, Simple, Quality, Smooth, Artless, Uniform.

In this way, the recommended tags for the above 10 images and the average usage rate are as follows (Fig. 13).

Fig. 13.
figure 13

Average usage rate of 10 images

The average the adoption rate of the recommended tags is 89.6%. It is reasonable to recommend tags based on the emotional dimensions and emotional terms’ relevance.

5 Conclusion

This paper initially envisages an image emotional semantic retrieval mechanism based on cognitive quantification model. Its core idea is to use semantic cognitive relevance of tags to divide some specific emotion into other relevant emotional dimensions and construct the emotional semantic cognition model.

At the same time, based on behavior psychology, tag generation channels are expanded by adding the users’ retrieval behaviors which means “recognition”, which provides more data for the modeling and make the model more representative.

As images need a lot of exposure to accumulate data to get a more accurate model, the idea of emotional semantic modeling is limited for cold-start images.

It is foreseeable that the theory of this research can be applied to other social digital resources, like music or video.