1 Introduction

Over the past decade, the area of personalized search has gained much attention in the literature (Cao et al. 2009; Joachims et al. 2005; Chirita et al. 2007; Liu et al. 2004). Providing a personalized service to web search users significantly helps them in satisfying their everyday information needs. Personalized search systems do not retrieve documents that are just relevant to the query but ones that are also relevant to the user’s interests; thus different users may actually receive different results for the same query. A key feature of personalized search system is keeping track of the information needs of their users in order to personalize their searches. Therefore, such systems should have a mechanism to learn about their users’ search interests. The recorded search interests can then be used to tailor the users’ future searches according to their inferred needs.

An important concern in personalized search systems is how to store and represent the gathered usage information. Some systems store this information in an individualized user profile (or user model) (Zhang and Koren 2007; Speretta and Gauch 2005), while other systems maintain an aggregate view of usage information (Agichtein et al. 2006; Smyth and Balfe 2006). Several techniques and data structures can be used to represent user and usage information. They can be broadly classified into two groups: either a vector-based model or a semantic network-based model. A vector-based model (Shen et al. 2005) is made up of a feature vector (or composed of more than one vectors), which is a vector of terms and associated weights. User profiles can also be represented using a semantic network structure (Chirita et al. 2007). In this case the profile is made up of nodes and associated nodes that capture terms and their semantically-related/co-occurring terms respectively. Weights can be assigned to the nodes, their associated nodes, and the links between them. The advantage of this model over a vector-based model is that it can model the relationship between a term and its associated terms.

Personalization in search systems can be achieved by query adaptation, result adaptation, or both. In the result adaptation scenario, search result lists are often re-ranked and/or filtered by incorporating users’ interests accordingly (Wang and Jin 2010; Xu et al. 2008). On the other hand, query adaptation attempts to expand (augment) the terms of the user’s query with other terms, with the aim of retrieving more relevant results (Chirita et al. 2007; Bertier et al. 2009). Several techniques are used for obtaining terms for query expansion, including relevance feedback-based techniques (Biancalana and Micarelli 2009), co-occurrence-based techniques (Bertier et al. 2009; Chirita et al. 2007), thesaurus-based techniques (Voorhees 1994) among others. In terms of personalized query expansion, additional terms often come from individual user profiles to assist the user in formulating a better query.

Recent years have also witnessed the explosive growth of information on the World Wide Web (WWW). Social media systems have proved a powerful tool to encourage end-user participation in the WWW, for the purpose of categorizing and distributing content, sharing opinions and maintaining relationships. In social tagging systems such as del.icio.us,Footnote 1 LibraryThing,Footnote 2 Flickr,Footnote 3 etc., users are able to annotate each web resource with any number of free-form tags of their own choice. Such systems have become extremely popular over the past few years. This type of system also provides an ideal test bed for personalized search. A user profile can be easily derived from their feedback, providing a good indication of the user’s interests. Moreover, this kind of information is solely based on the user’s explicit, public social activities, which means the profile can be safely utilized without disrespecting or compromising the user’s privacy.

Social tags produced by users are usually regarded as high quality descriptors of the web pages’ topics and a good indicator of web users’ interests. Despite this fact, this uncontrolled manner of tagging results in the use of an unrestricted vocabulary. This places a structural barrier between the users and the swathes of globally available information, making searching through the collection difficult and generally less accurate. In current social media systems, search algorithms also tend to be rather simplistic in nature, relying on term matching methods, which often fail to deal with the vocabulary mismatch problem and result in poor ranking results.

To overcome this problem, researchers have attempted to use result re-ranking approaches (Xu et al. 2008; Carmel et al. 2009; Wang and Jin 2010). However, if relevant items could not be fetched in the first place, regardless of the complex re-ranking process, the results still tend to be unsatisfactory.

As previously mentioned, query expansion can partially solve the above mentioned problem. A classic technique is pseudo relevance feedback (PRF) or local analysis (Baeza-Yates and Ribeiro-Neto 1999), in which expansion terms are automatically extracted from the top-ranked documents and added to the source query, which is then optionally re-weighted (Rocchio 1971). This approach has been previously proven to work well. However, in the context of personalized search, the selected terms may be different from the users’ true interests, so that the retrieved documents may not be relevant to a particular user. There have been few attempts at selecting the appropriate expansion terms from a user profile (Biancalana and Micarelli 2009; Bender et al. 2008; Bertier et al. 2009; Chirita et al. 2007). This profile is normally mined from the annotations and content that the user has produced. Past research appears to favor tag–tag relationships, by selecting the most related tags from a user’s profile to enhance the source query. Given the fact that the tags might not be the precise descriptions of resources, the resulting retrieval performance has been markedly low (Bender et al. 2008; Bertier et al. 2009). Borrowing from the traditional information retrieval (IR) field, local analysis and co-occurrence based user profile representation have been adopted to expand the query according to a user’s interaction with the system (Biancalana and Micarelli 2009; Chirita et al. 2007). However, in this case the selection of expansion terms is solely based on lexical matching between the query and the terms which exist in the user profile. If the terms are not found in the user’s profile, the query cannot be expanded at all. Furthermore, in local analysis with co-occurrence representation, terms are considered of equal importance when added to the user profile. This may lead to improvements with a global effect rather than on a personalized level.

This paper is concerned with two areas of information retrieval: Search Personalization via individual user profiles and automatic query expansion via a new expansion framework which leverages content from social media applications. Unlike previous approaches to personalized query expansion, which are solely based on lexical matching between the query terms and the terms which exist in the user profile, the method proposed in this paper can expand the query even if the query terms are not matched with terms in the user’s profile. This is achieved by incorporating pseudo-relevance feedback information obtained from top-ranked documents. It is of particular importance to web search which is tailored using social media data because term use tends to be very inconsistent between different users especially when compared with terms appearing in web documents (Golder and Huberman 2005).

The query expansion framework proposed in this paper is based on individual user profiles. In the user profile, terms are modeled according to their relationships, which can be defined by co-occurrence statistics (used as a baseline) or defined by a Tag-Topic model introduced below. Each term in the user profile will have an associated weighting score calculated based on its relationship with other terms in the profile and terms extracted from top-ranked documents. After calculation, terms with highest scores will be chosen to expand the original query. The underlying theory is to regularize the smoothness of word associations over a connected graph using a regularizer function on terms extracted from top-ranked documents. The intuition behind the model is the prior assumption of term consistency: the most appropriate expansion terms for a query are likely to be associated with, and influenced by terms extracted from the documents ranked highly for the initial query. In other words, the selection of expansion terms for a given query is not solely based on lexical matching, but by context enhancing and weighting propagation. If the neighbors of a term in a connected graph are good expansion candidates for a query, then this term is also highly likely to be a good candidate for the query. In addition, the refined weighting scores should be at least somewhat relevant to the enhanced context (i.e. the top-ranked documents retrieved by the initial query, which are assumed to be relevant), which, in our framework, are constrained by a regularizer on the top-ranked documents.

In summary, the motivation to develop this expansion framework is twofold. Firstly, to incorporate pseudo-relevance feedback information obtained from top-ranked documents in the word association graph to expand the initial query. Secondly, traditional personalized query expansion can only work when query terms are found within the user profile, but the framework proposed here will work even when direct lexical matching fails. This fact is particularly important when enhancing search using social media data.

Due to the nature of social tagging systems, tags, web documents and terms extracted from documents are associated in a complex fashion. The relationship between tags and documents can be represented by a tag-document bipartite graph, and the content of documents can be represented by a document-term bipartite graph. Figure 1 shows a sample of these multiple bipartite graphs. In this paper, we propose to simultaneously incorporate the user’s annotations and the content information through a statistical model in a latent space graph. This is achieved by adopting the Author-Topic model introduced by Steyvers et al. (2004), and proposing a Tag-Topic model to learn topic-term and Tag-Topic distributions from the annotation data in an unsupervised manner. A latent graph is then built based upon the features derived between important terms and tags. The advantage of this approach is that it simultaneously incorporates web page content and annotations, which provides rich information for better performance.

Fig. 1
figure 1

Multiple bipartite graphs between tags, documents and terms

To illustrate the effectiveness of the proposed methodology, we follow previous work on using social data for evaluating personalized search (Xu et al. 2008; Carmel et al. 2009; Wang and Jin 2010) by using data crawled from a large social tagging system. Over two hundred users, distributed worldwide, who are active on the system have been tested. The experimental results suggest that the proposed personalized query expansion method can produce better results than the classical non-personalized search approach and other query expansion methods. The results also demonstrate the effectiveness of modeling content and annotations in the latent graph.

The main contribution of this paper is to propose a regularization framework for query expansion, which aims to produce more accurate personalized retrieval results. The key to expanding the query is the global term consistency over the word graph, which leverages the top-ranked documents retrieved by the query. Another contribution is that the framework also simultaneously incorporates the annotations and web documents through a Tag-Topic model in a latent graph.

The rest of this paper is organized as follows. Related work on query expansion, personalized search, topic models and semi-supervised learning is briefly summarized in Sect. 2. Section 3 describes the regularization framework for query expansion. Section 4 presents details of how to build the user profiles through a Tag-Topic model. In Sects. 5 and 6 a report is provided on a series of experiments performed over data crawled from a large social tagging system. This report includes details of the results obtained. Finally, Sect. 7 concludes the paper and proposes some future work.

2 Related work

2.1 Query expansion

Manual query expansion has been studied in early IR systems (Harter 1986). This approach demands user intervention and requires the user to be familiar with the search system, which is generally not true for the modern web. For these reasons, the overwhelming majority of search systems in existence today, function via automatic query expansion. A number of different automatic query expansion techniques have been described in the IR literature. One common technique employs a machine readable thesaurus to locate expansion terms in lists of synonyms (Voorhees 1994). Other approaches extract expansion terms from large collections of documents (Qiu and Frei 1993). Local analysis involving relevance feedback is another popular category of approach. Explicit feedback is often difficult to obtain because users are usually reluctant to provide such information. An alternative method is implicit relevance feedback through PRF (Baeza-Yates and Ribeiro-Neto 1999), in which expansion terms are automatically extracted from the top-ranked documents and added to the source query, which is then optionally re-weighted (Rocchio 1971). This approach has been previously proven to work well. However, in the context of personalized search, the selected terms may be different from the users’ true interests, so that the retrieved documents may not be relevant to a particular user. In contrast, our system ensures that only feedback terms that are relevant to the user’s needs are used for query expansion. Web query logs are also used by researchers to bridge the gap between the user-centric query space and author-centric web page space (Cui et al. 2002, 2003). However, in practice, acquiring web query logs is difficult for most researchers due to the various concerns of search companies. Co-occurrence based techniques are also highly attractive, and function by analyzing entire documents (Qiu and Frei 1993), lexical affinity relationships (Carmel et al. 2002) etc. In social tagging systems, these relationships tend to be more complex than web search and hence require more sophisticated modeling methods.

2.2 Personalized search

Personalized web search has been extensively studied. There are approaches that utilize query log and click-through analysis (Cao et al. 2009; Joachims et al. 2005). There are systems that explore desktop data and external resources (Chirita et al. 2007; Liu et al. 2004). There are also techniques that focus on the user task and activity context (Dou et al. 2007; Luxenburger et al. 2008). This technique is also applied to commercial search engines (Haveliwala et al. 2003).

None of the above work exploits information from social media systems to perform personalized search. In personalized search in social media, the search process is performed over “social” data gathered from Web 2.0 applications such as social bookmarking systems, wikis, blogs, forums and social network sites. Personalization usually involves two general approaches. The first approach runs the unmodified original query for all users but re-ranks the returned results based on an individual user profile. In (Xu et al. 2008) the authors developed a personalization approach to learn about users’ interests from their bookmarks and tags, then re-rank the results according to the topic relevance of documents and users’ interests. Carmel et al. (2009) investigated personalized social search based on the user’s social relations. Search results are re-ranked according to their relation to individuals in the user’s social network. Wang and Jin (2010) explored gathering data from multiple online social systems for adaptive search personalization. They created an interest profile for each user by integrating different streams of social information and then re-ranked the results through a linear combination of different computed scores. Though this group of work is attractive, if relevant items cannot be fetched in the first place, regardless of the complex re-ranking process, the results still tend to be unsatisfactory.

Another group of work modifies or augments a user’s original query, this approach is termed query expansion. Researchers have frequently used co-occurring tags to enhance the source query (Bender et al. 2008; Bertier et al. 2009). However, given the fact that the tags might not be precise descriptions of the content, retrieval performance is notably low. Borrowing from the traditional IR field, local analysis has been adopted to expand the query according to a user’s interaction with the system (Biancalana and Micarelli 2009; Chirita et al. 2007). In Biancalana and Micarelli’s system, the authors used a three-dimensional correlation matrix to build the user profile. Each term of the matrix is linked to an intermediate level extracted from an external resource. The related terms are extracted according to their relevance to the query. However, the selection of expansion terms is solely based on lexical matching between the query and the terms existing in the user profile. If a term is not found, the query cannot be expanded at all. Furthermore, in local analysis, terms are considered of equal importance when added to the user profile. This may lead to improvements with a global effect rather than on a personalized level. In contrast, the user profile used in the framework proposed by this paper only selects important terms extracted from the annotated content to avoid this problem. The system developed by Biancalana and Micarelli also relies on external categorization, which creates an extra burden and adds uncertainty to the process.

2.3 Topic models

To learn a latent space graph for use in the framework, this work is also related to the family of topic models. latent dirichlet allocation (LDA), after it was first introduced in (Blei et al. 2003), has quickly become one of the most popular probabilistic text modeling techniques and has inspired much research. There are many extensions to this model, most notably the Author-Topic model (Steyvers et al. 2004), which extracts information about authors and topics from text collections. This model has been adopted in the approach described in this paper and a Tag-Topic model proposed in building user profiles, from which a latent graph can be formed. Some researchers have also employed a modified version of LDA in social networks. Zhou et al. (2008) proposed a computationally tractable hierarchical Bayesian network method for modeling social annotations, together with language modeling for personalized ranking. In (Harvey et al. 2011) the authors proposed several hidden topic models to provide more accurate resource ranking. Unlike the modeling processes in their papers, which were deployed on the whole corpus, the model described in this paper is performed on an individual level to ensure a more concrete user profile for query expansion.

2.4 Semi-supervised learning

The framework described in this paper is influenced by existing work on machine learning, especially graph-based semi-supervised learning (Zhou et al. 2004; Zhu et al. 2003; Wang et al. 2008). The regularization framework proposed is closely related to label smoothness over the graph. However, the work here is different as the tasks are performed in a different setting. Their tasks are mainly used at the document level such as classification and clustering, while the methods proposed in this paper are focused on the word level and in a query-document dependent setting.

3 Personalized query expansion framework

In this section, the problem addressed by this paper is defined. The personalized query expansion framework is also described in more detail. This method builds upon individual user profiles in which terms are mined from both the annotations a user has made and the resources the user has marked. Each term in the user profile will have an associated weighting score calculated based on its relationship with other terms in the profile and terms extracted from the top-ranked documents. After calculation, the terms with the highest scores will be chosen to expand the original query. The underlying theory is to regularize the smoothness of word associations over a connected graph using a regularizer function on terms extracted from top-ranked documents. In Table 1, we list all the symbols we will use in the algorithms.

Table 1 Notation used in the personalized query expansion framework and the Tag-Topic model

3.1 Problem definition

In social tagging systems such as del.icio.us, users can label interesting web pages with primarily short and unstructured annotations in natural language called tags. These web pages are denoted as a link to a URL in the del.icio.us website. Textual content is crawled by following a URL link that refers to a document or web document. Multimedia content is excluded in this research. In response to a query, an initial set of the most relevant documents is fetched. We assume that the top “c” ranked documents are relevant, and therefore refers to top-ranked documents. Term refers to a word in the vocabulary, these two terminologies are used interchangeably. Terms extracted from documents are specifically called docTerm, to be distinguished from general “terms” used in user profiles and from tags.

Formally, social tagging data can be represented by a tuple \( \mathcal{P}: = (\mathcal{U}, \mathcal{D}, \mathcal{T}, \mathcal{A}) \), where \( \mathcal{U}, \mathcal{D}, \mathcal{T} \) are finite sets of users, web documents and tags, and \( \mathcal{A} \subseteq \mathcal{U} \times \mathcal{D} \times \mathcal{T} \) is a ternary relation, whose elements are called tag assignments or annotations. The set of annotations of a user is defined as: \( \mathcal{A}_{u} := \{ (t, d)|u, d,t \in \mathcal{A}\} \). The tag vocabulary of a user, is given as \( \mathcal{T}_{u} := \{ t|(t, d) \in \mathcal{A}_{u} \} \). The user’s set of documents is \( \mathcal{D}_{u} := \{ d|(t, d) \in \mathcal{A}_{u} \} \). We further define the docTerm vocabulary of a user to be \( docTerm_{u} := \{ w|w \in \mathcal{D}_{u} \} \) where w denotes the words in the document corpus. \( \mathcal{T}_{u} \) is the full list of tags that the user has used, and docTerm u is the vocabulary extracted from the documents that the user has tagged. So that terms in a user profile could be chosen from \( \mathcal{T}_{u} \), \( docTerm_{u} \), or \( \mathcal{T}_{u} \mathop \cup \nolimits docTerm_{u} \).

Given a source query q, a set of terms/words in the user profile {w 1, w 2w n }, and a set of initial top-ranked documents \( \mathcal{D}^{top} = \{ d_{1} , d_{2} \ldots d_{c} \} \) the goal is to return a ranked list of profile terms to be added to the query, regularized by terms extracted from the top-ranked documents.

Let G = (V, E) be a connected graph, where nodes V corresponding to the n words in the user profile, and edges E corresponding to the association strengths between words. Further we assume an n × n symmetric weight matrix A on the edges of the graph is given, which a ij denotes the weight between words w i and w j and M is a diagonal matrix with entries \( M_{ii} = \sum\nolimits_{j} {a_{ij} } \). We also define a n × c matrix F with F ij  = f(w, d) if a word w is presented in a document d and F ij  = 0 otherwise, where f(w, d) denotes weighting of w in d.

3.2 Expansion framework

Inspired by the semi-supervised learning methods, here we develop a personalized query expansion framework. Formally, the cost function \( \Re (F,w,G) \) in a joint regularization framework similar to (Zhou et al. 2004) is defined as:

$$ \Re (F,w,G) = \frac{1}{2}\left( {\mathop \sum \limits_{i,j = 1}^{n} a_{ij} \left\| {\frac{{f(w,d_{i} )}}{{\sqrt {M_{ii} } }} - \frac{{f(w,d_{j} )}}{{\sqrt {M_{jj} } }}} \right\|^{2} + \mu \mathop \sum \limits_{i = 1}^{n} \left\| {f(w,d_{i} ) - f^{0} (w,d_{i} )} \right\|^{2} } \right) $$

where μ > 0 is the regularization parameter. f 0(wd i ) is the initial weighting matrix of word w in the document d i (tf-idf weighting used in the current paper Jones 1988). Let F and F 0 be a refined weighting matrix and an initial weighting matrix, respectively.

The first term of the right-hand side in the cost function is the global consistency constraint, which means that a good weighting function should not change too much between nearby points. In this paper, nearby points are refined weighting scores with respect to initial relationships between words and context information (top-ranked documents) obtained by initial query. They are likely to have the same effect over the graph. The second term is the fitting constraint, which means the weighting of words should fit the weighting scores of words extracted from the top-ranked documents retrieved by the given query. The trade-off between each other is controlled by the parameter μ.

Then the final weighting function is

$$ F^{*} = \arg \mathop {\min }\limits_{{F \in \mathbb{R}^{ + n} }} \Re (F,w,G). $$

After simplifying, a closed form solution can be derived as (see also Zhou et al. 2004; Zhu et al. 2003):

$$ F^{*} = \mu_{2} (I - \mu_{1} S)^{ - 1} F^{0} $$

where \( \mu_{1} = \frac{1}{1 + \mu },\quad \mu_{2} = \frac{\mu }{1 + \mu },\quad S = M^{{ - \frac{1}{2}}} AM^{\frac{1}{2}} \), and I is an identity matrix. Note that S is a normalized graph Laplacian matrix. We will introduce the calculation of A and S in Sect. 4 when we discuss the user profile construction process. Given the refined weighting matrix F, the final weighting scores for each word w could be computed as \( w = \mathop \sum \limits_{i = 1}^{c} f(w,d_{i} ) \), from which we can acquire top γ words from the final ranked list of profile words to be added to the query. It is worth noting that μ 2 could be eliminated as it does not change the ranking.

An important feature of such computation is that weightings calculated here share similarities with the entries obtained in the F * in Zhou et al.’s paper (Zhou et al. 2004), where they try to find largest entry to get the corresponding label while we are trying to find large added weights to select terms. The actual values of weights are not so critical as far as they have discriminative power to separate high potential words from low potential words. They are not used in the later computation. Section 6.2 compares this framework with the method of directly matching terms inside the user profile to illustrate the difference. This concludes our discussion of the proposed framework.

4 User profile construction

In order to capture accurate information for the construction of the user profile, in this paper the tags and web documents are modeled simultaneously. In addition to Fig. 1, an example page with sample tags and a document linked to the marked URL is given in Fig. 2. The lexical processing of words inside the document will be detailed in the next section.

Fig. 2
figure 2

An illustrative example of a social tagging page with one correspondent web document

In the current paper, an Author-Topic model, introduced by Steyvers et al. (2004), was adopted and a Tag-Topic model was proposed in order to learn topic-word and Tag-Topic distributions from the annotation data in an unsupervised manner. Then a latent graph is built based upon the features derived. This can be achieved through important docTerms, tags or a mixture of both. The advantage of this approach is that it simultaneously incorporates web page content and annotations, which provides rich information for improved performance.

4.1 Tag-Topic modeling for social tagging data

The original author-topic model, introduced by Steyvers et al. (2004), reduces the process of generating a document to a simple series of probabilistic steps. The model not only discovers the topics that are expressed in a document, but also which authors are associated with each topic. To run this model on the social tagging data at an individual user level, we can view the tags as authors in the new proposed model. When generating a document, a tag is chosen at random for each individual word in the document. This tag picks a topic from its multinomial distribution over topics, and then samples a word from the multinomial distribution over words associated with that topic. This process is repeated for all words in the document. This process is summarized in Table 2, and the graphical model corresponding to this process is shown in Fig. 3. Where θ and φ are topic-word distributions and Tag-Topic distributions respectively, α and β are Dirichlet priors. For each word, the topic and tag assignment are sampled from:

$$ p(z_{i} = j,x_{i} = k|w_{i} = m,\user2{z}_{\neg i} ,\user2{x}_{\neg i} ) \propto \frac{{C_{mj}^{{\mathcal{W}\mathcal{O}}} + \beta }}{{\mathop \sum \nolimits_{m'} C_{m'j}^{{\mathcal{W}\mathcal{O}}} + L\beta }}\frac{{C_{kj}^{{\mathcal{T}\mathcal{O}}} + \alpha }}{{\mathop \sum \nolimits_{j'} C_{kj'}^{{\mathcal{T}\mathcal{O}}} + \mathcal{O}\alpha }} $$

where z i  = j and x i  = k represent the assignments of the ith word in a document to topic j and tag k, respectively. w i  = m represents the observation that the ith word is the mth word in the lexicon. \( \user2{z}_{\neg i} ,\user2{x}_{\neg i} \) represents all topic and tag assignments not including the ith word. Furthermore, \( C_{mj}^{{\mathcal{W}\mathcal{O}}} \) is the number of times word m is assigned to topic j, not including the current instance, and \( C_{kj}^{{\mathcal{T}\mathcal{O}}} \) is the number of times tag k is assigned to topic j, not including the current instance, L is the size of the lexicon.

Fig. 3
figure 3

Graphical representation for the Tag-Topic model in social tagging data

Table 2 Generative process of the Tag-Topic model in social tagging data

From the count matrices obtained during the process, θ and φ can be easily estimated as:

$$ \varphi_{mj} = \frac{{C_{mj}^{{\mathcal{W}\mathcal{O}}} + \beta }}{{\mathop \sum \nolimits_{{m^{\prime } }} C_{{m^{\prime } j}}^{{\mathcal{W}\mathcal{O}}} + L\beta }} $$
$$ \theta_{kj} = \frac{{C_{kj}^{{\mathcal{T}\mathcal{O}}} + \alpha }}{{\mathop \sum \nolimits_{{j^{\prime } }} C_{{kj^{\prime } }}^{{\mathcal{T}\mathcal{O}}} + \mathcal{O}\alpha }} $$

where φ mj is the probability of using word m in topic j, and θ kj is the probability of using topic j and tag k. The algorithm assigns words to random topics and tags (from the set of tags annotated to the document), and then repeats the Gibbs sampling process to update topic assignments for several iterations.

The model proposed here is almost identical to the original Author-Topic model except that authors are now replaced by tags. This can be achieved because the mixture weights for different topics are now determined by the tags that a user has assigned to the document, in conjunction with the document corpus. It reflects the user’s view of the document, and different documents assigning a same tag will be correctly reflected in the user profile, to ensure more accurate relationships between the terms stored. By learning the parameters of the model, we obtain the set of topics that appear in a corpus and their relevance to different documents, and identify which topics are used by which tag. Hence, the behavior and interests of a user could be identified.

4.2 Build the latent graph

After the topic-word distributions and Tag-Topic distributions have been obtained, the adjacency graph of word associations and tag associations can be constructed. Such a task requires the computing of the similarity between tags and/or docTerms. To illustrate how the model could be used in this respect, taking tags as an example, the distance between tags t i and t j was defined as the symmetric KL divergence between the topic distributions conditioned on each of the tags:

$$ symKL(t_{i} ,t_{j} ) = \mathop \sum \limits_{o = 1}^{\mathcal{O}} \left[ {\theta_{io} \log \frac{{\theta_{io} }}{{\theta_{jo} }} + \theta_{jo} \log \frac{{\theta_{jo} }}{{\theta_{io} }}} \right]. $$

Similarly we can compute associations between words. The distance between tags and words can also be calculated by using corresponding topic entries in θ and φ. However, in experiments, this method did not achieve good results compared to using either words or tags alone. This is understandable because the multinomial distribution over words and over tags may contain different meanings. So in the mixture of both approaches, we simply using two latent graphs, by setting their parameters, fitting into the query expansion framework, then combining the words obtained from both.

So a latent graph G is defined using the latent feature obtained from the Tag-Topic model, where the nodes denote the terms and the edges E are weighted by symKL. After normalization, matrix \( S = M^{{ - \frac{1}{2}}} AM^{\frac{1}{2}} \) can be calculated. This process is executed offline, and then matrix S is saved for the query expansion model. Since terms extracted from the documents are not equally important, we only keep the top δ terms from each topic to form the graph. Tags are usually regarded as high quality descriptors of the web pages’ topics and a good indicator of web users’ interests, so we keep all of them in the user profile. To compare the use of docTerms extracted from documents and tags assigned to the documents for query expansion, three sets of user profiles have been defined: selected docTerms, tags and a mixture of both. In the next section, we will also compare the performance of overall query expansion using this graph and using a co-occurrence based graph. It is also worth noting that according to Wang and Zhang (2006), the building of an affinity matrix is very important and can greatly affect the framework effectiveness. So there is no guarantee the latent graph used in this paper is an optimal affinity matrix, but rather a possibility. The evaluation described in the following sections demonstrates that this approach works better than various baselines. Further improvements may be acquired if attempts are made to optimize the matrix employed. This is noted as important future work.

5 Evaluation

In the following section a series of experiments are described which have been designed to evaluate the query expansion framework described above. This evaluation focuses on the following thematically related questions:

  1. 1.

    Is the proposed personalized query expansion model an improvement over classical non-personalized and personalized query expansion techniques that utilize social media data?

  2. 2.

    Will the user profiles built upon the Tag-Topic model prove an advance over the co-occurrence based model?

  3. 3.

    How does the performance differ for users with different amounts of data available on social systems (i.e. active and less active users)?

  4. 4.

    Will the filtering of less important words extracted from annotated documents improve the quality of personalized query expansion?

  5. 5.

    Are the user profiles containing tags, docTerms and a combination of both equally effective when used in the context of personalized query expansion?

5.1 Experimental data

In order to evaluate these methods on real-world data a crawl was conducted on the popular social tagging site delicious during December 2010. The crawling procedures used by other researchers to download the data from the web (Carman et al. 2008; Harvey et al. 2011) were followed to avoid biased records. To ensure a random sample of recent data the most recent URLs submitted to delicious were downloaded and the usernames of the users who bookmarked them were recorded. After several iterations a sample of 12,043 unique usernames had been collected. Then for each of these usernames the user’s bookmarking records were downloaded by analyzing the corresponding web pages. The reason that the delicious API was not used is that restrictions are placed on the volumes of downloads possible, in this case approx. 100 bookmarks. It is the authors’ belief that this process can lead to more complete user profiles for users and a more comprehensive test corpus. Non-English users were filtered out based upon their tags because it was desirable to evaluate in a monolingual setting. Also users with less than 10 personal tags were ruled out due to the difficulty in creating a user profile from such few tags. The actual web pages were then crawled, and a total of 5,943 users, 1,190,936 web pages and 283,339 tags were obtained. Table 3 provides statistics which describe the dataset used in the experiments, where, for example, “Max.tags” denotes the maximum number of distinct tags associated with each user.

Table 3 Statistics of delicious dataset sample

Four groups of users were created according to the number of bookmarks associated with the users: users with less than 50 bookmarks, denoted as DEL50; users with 50–100 bookmarks, denoted as DEL100; users with 100–500 bookmarks, denoted as DEL500 and finally users with more than 500 bookmarks, denoted as DELgt500. This choice reflects users who are active in the online social system as well as those who are less active, and is consistent with the previous research (Xu et al. 2008; Wang and Jin 2010). 50 randomly selected users from each group together with their tagging records were extracted to form a total collection of 200 test users. The English terms were processed in the usual way, i.e. down-casing the alphabetic characters, removing the stop words and stemming words using the Porter stemmer. All the pre-processed web pages are used in the experiments as the document corpus. No other filtering is conducted. All the information retrieval experiments were performed using the TerrierFootnote 4 open source platform.

5.2 Evaluation methodology

For each user, 75% of his/her tags with annotated web pages were used to create the user profile, while the other 25% of his/her tags with annotated web pages were used as a test collection. Tags are subsequently used as queries (see Fig. 2 for example tags). Because the query expansion framework described in this paper used individual user profiles to adapt the search, it would be considered personalized search when compared to those methods which rely on textual relevance only (such as BM25 Robertson and Zaragoza 2009), used as a non-personalized baseline below).

A subset of users was also randomly selected to train the necessary parameters. Every effort was made to ensure there was no overlap between the training-set of users and the test-set of users.

The major challenge in evaluating a personalized search system is to determine which results are considered relevant and useful to a search query by a specific user. We employ the evaluation method used by previous researchers in personalized social search (Xu et al. 2008; Wang and Jin 2010). The main assumption is as follows: Any documents tagged by u with t are considered relevant for the personalized query u, t (i.e. u submits the query t).Footnote 5 The dependency between personalization and evaluation was also eliminated according to (Carmel et al. 2009)

The following evaluation metrics were chosen to measure the effectiveness of the various approaches: the precision of the top 5 documents (P@5), mean reciprocal rank (MRR), mean average precision (MAP) and the recall of the top 5 documents (R@5). The first three measurements are commonly used to evaluate search algorithms while the last one is useful for evaluating query expansion systems as this method has been shown to improve both recall and precision in the past. The four metrics were calculated for each user and the mean of all the values was calculated, so that the average performance over test users could be computed. Statistically-significant differences in performance were determined using a paired t-test at a confidence level of 95%.

5.3 Experimental baselines and runs

In order to usefully evaluate the performance of the personalized query expansion framework 2 different baselines were selected: BM25—a popular and quite robust probabilistic retrieval method, and BM25PRF—a pseudo-relevance feedback oriented query expansion method based on the Divergence from Randomness (Amati and Rijsbergen 2002) theory. This approach has previously shown good results, which is also a natural choice for evaluating the difference between expanding queries by selecting the terms from the user profiles and from relevant documents.Footnote 6

In addition to non-personalized baselines, we have several personalized baselines. Firstly, a co-occurrence matrix was built according to (Chirita et al. 2007; Biancalana and Micarelli 2009). For all the tags and documents in the training set, we first select the important terms (or keywords) with high tf-idf scores (20% used in the experiment). This measurement appears to work better than using document frequency alone (Chirita et al. 2007). We then calculate the cosine similarity between two words w i and w j by using:

$$ \cos (w_{i} ,w_{j} ) = \frac{{DF_{{w_{i} ,w_{j} }} }}{{\sqrt {DF_{{w_{i} }} \cdot DF_{{w_{j} }} } }} $$

where DF wi is the document frequency of word w i . Note that here tags and docTerms are modeled into the matrix together. After obtaining this matrix, we use it in the following two ways:

5.3.1 Lexical matching

Processing the user profile by using lexical matching between query terms and terms that exist in the user profile. This is achieved by the procedure described in Table 4 with an additional operation to calculate the correlation between the word and all words/terms in the submitted query. This method will leave some queries un-expanded if no matching has been found in the user profile. This method is denoted as lexical matching and co-occurrence statistics (LMCO) later on.

Table 4 Query expansion procedure based on lexical matching and co-occurrence based user profiles

5.3.2 Query expansion framework

In a similar fashion to how the latent graph is used in the framework, graph G is used, where the nodes denote the terms and the edges E are weighted by their co-occurrence similariy. After normalization, matrix S is used as usual way for the query expansion model. We denote this method as PQECO.

We are also interested in the personalization approach proposed by (Bender et al. 2008) which is based on pure tag–tag relationships. So an additional baseline is included here based on the co-tagging activities a user performed. In this case, the user profiles contain training tags with their co-tagging statistics computing using the Jaccard coefficient:

$$ J(t_{i} ,t_{j} ) = \frac{{|N_{{t_{i} }} \mathop \cap \nolimits N_{{t_{j} }} |}}{{|N_{{t_{i} }} \mathop \cup \nolimits N_{{t_{j} }} |}} = \frac{{|N_{{t_{i} }} \mathop \cap \nolimits N_{{t_{j} }} |}}{{|N_{{t_{i} }} | + |N_{{t_{j} }} | - |N_{{t_{i} }} \mathop \cap \nolimits N_{{t_{j} }} |}} $$

where \( |N_{{t_{i} }} | \) denotes the number of documents tagged by t i and \( |N_{{t_{i} }} \mathop \cap \nolimits N_{{t_{j} }} | \) denotes the number of documents co-tagged by t i and t j . This method is described in detail in Table 5 and denoted as COTAG hereafter.

Finally, there are also three variants of the proposed query expansion method depending on the characteristics of the user profiles. For those user profiles which only contain terms extracted from the documents, the algorithm is denoted as PQE_terms, for those user profiles which only contain tags, the algorithm is denoted as PQE_tags. PQE_mix is used to represent the query expansion method which uses user profiles that containing a mixture of terms and tags.

Table 5 Query expansion procedure based on co-tagging

5.4 Parameter setting

One important part of the experiment involved establishing the optimal values for the various parameters discussed above. The following section describes in detail how values for \( \mathcal{O} \) and δ (components of the user profile construction) and μ and γ (important elements in the query expansion process) were selected. While there are still many parameter settings in addition to these four in the algorithms used by this research, they are only briefly mentioned in the following section as they are deemed less important.

5.4.1 Setting the Tag-Topic modeling

The topic numbers \( \mathcal{O} \) and the top terms for each topic δ are used in the user profile creation process to control the association accuracy and number of terms in the profile. The selection of topic numbers is illustrated by Fig. 4a, the best value of this parameter in terms of MAP was “5” with fewer topics preferred. This partially predicts the power of latent features in the framework. We did not test topic numbers exceeding 10 because of two reasons: firstly, low numbers of topics tend to work better; secondly, although the dataset is large, for each user this parameter tends to be small as it only contains a limited number of tags and document terms. Testing this feature would be highly desirable if the user profiles in the collection were enlarged, which is planned for future work. A number of runs were executed with a spread of settings from 10 to 100 for the parameter δ. As shown in Fig. 4b, interestingly the best values were obtained when the number of words per topic is 20. This further confirms the argument above that not all terms should be considered of equal importance when being added to the user profile as was the case in previous research (Biancalana and Micarelli 2009). 20 was selected as the value for δ in all subsequent experiments.

Fig. 4
figure 4

The impact of varying parameters \( \mathcal{O} \) (a) and \( \delta \) (b)

5.4.2 Setting the query expansion framework

Previous work on the regularization parameter μ 1 suggests that a larger value will give optimal performance (Zhou et al. 2004). To verify this finding, we conducted a number of runs against the training-set of users, with a spread of settings for the μ 1 parameter. As illustrated by Fig. 5a, the highest MAP scores where obtained when μ 1 = 0.9. In general, the number of expansion terms γ should be within a reasonable range in order to produce consistently good performance. Too many expansion terms not only consumes more time in the retrieval process, but can also have side-effects on the retrieval performance. This is even worse when choosing how many tags should be added to the query. We examine the performance of the query expansion by using 5, 10, 15, and 20 expansion terms and 1–10 expansion tags on the training data. The results are shown in Fig. 5b for terms and Fig. 5c for tags. The best performance is obtained with around 10 terms and 1 tag. It is worth noting that the curve produced by tags is flatter than that produced by terms. The curve of the tags demonstrated a steady drop after 1 tag. This was attributed to the fact that using tags alone for expanding queries is not sufficient to acquire optimal performance. It was decided to set μ 1 = 0.9 and γ = 10 for terms and γ = 1 for tags in all subsequent experiments.

Fig. 5
figure 5

The impact of varying parameters \( \mu_{1} \) (a), γ for terms (b) γ and for tags (c)

5.4.3 Setting other parameters

The number of top documents \( \mathcal{D}^{top} \) used in the query expansion framework was set to 10 empirically (from 100 retrieved documents) and the hyper-parameters to run the topic model were set to α = 0.1 and β = 0.1. The parameters for BM25PRF were also set to 10 documents and 10 terms as they gave the highest retrieval effectiveness. The BM25 parameters were left unchanged from their default values in the Terrier distribution. Finally the number of expansion terms applied to LMCO and COTAG were set to 10 and 5 as these provided the best performance.

6 Experimental results

In this section we present our experimental results. The personalized model is first compared to the non-personalized model. Then we evaluate the effectiveness of the proposed query expansion framework with personalized query expansion based on lexical matching, the effectiveness of using the Tag-Topic model is also compared to user profiles built by using co-occurrence statistics. Finally, we illustrate the performance of the proposed model with expansion terms obtained by pure co-tagging activities. We conclude this section by detailing a comparison between different groups of users.

6.1 Personalization versus non-personalization

6.1.1 Overall performance

This set of experimental results describes the performance of the three personalized query expansion runs proposed in this paper together with two non-personalized baselines on the overall test users, which are shown in Table 6. The statistically significant differences are marked as † w.r.t to the BM25 baseline and * w.r.t to the BM25PRF baseline.

Table 6 Performance of the six personalized query expansion runs and two non-personalized baselines

As illustrated by the results, the BM25PRF model was the lowest performer for all evaluation metrics. This result is not surprising because the evaluation described in this paper is based upon a personalized-approach rather than the non-personalized evaluation model normally employed in the large evaluation campaigns. This further demonstrates that merely borrowing common techniques from traditional IR will not solve the personalized search problem. Pleasingly, the three personalized query expansion-based search models all outperform the simpler text retrieval model with the highest improvement of 28.95% (In terms of the PQE_mix method with the MRR metric when compared to BM25), which is statistically significant. It should be noted that all three personalized query expansion methods provide an average improvement of 40.74% compared to the traditional query expansion method with the highest improvement at 61.15%, which is also statistically significant.

There were noticeable improvements in retrieval effectiveness when using user profiles which consisted of terms and a mixture of terms and tags in query expansion, but a more modest increase for the user profiles which consisted of tags alone. This reinforces the earlier finding that using tags alone for expanding queries is not sufficient. Another exciting observation is that in many cases, the personalized query expansion methods, even though tuned for MAP, can outperform the baselines for all the evaluation metrics, with statistically significant improvements in almost entire runs. We will delay the discussion of recall results (R@5) until later in this section.

6.1.2 Expansion terms differences

In addition to the overall performance evaluation, a side-by-side comparison of example expansion terms is also depicted for selected terms by using the PQE_terms and BM25PRF models in Fig. 6. The two methods produced very different sets of expansion terms for a query. The personalized query expansion method presents many good personalized recommended terms closely resembling the user’s interests while the BM25PRF model generates terms more closely aligned with the corpus statistics. For example, the suggested terms for the query “logo” produced by the personalized model are specifically about logo design for particular companies (expansion terms: “design”, “web”, “page”, “mobil”, “intel”,Footnote 7 etc., for example) while the terms extracted by the BM25PRF method are actually about more general company logo design (expansion terms: “busi”, “company”, “corpor”, “design”, etc.). Similar differences can be found in other cases, for example, with regard to the source query “swap” the BM25PRF method selected terms which are about swapping hardware (expansion terms: “disk”, “mb”, “memori”, etc.) while the user is in fact interested in articles talking about swapping online games (expansion terms: “game”, “free”, “onlin”, “product”, etc.).

Fig. 6
figure 6

Example expansion terms for the selected queries using PQE_terms and BM25PRF

6.1.3 Individual analysis

To further explain the success of the personalized query expansion methods at an individual user level, a separate figure is presented which depicts the performance of all the users in the DEL50 group. As shown in Fig. 7, personalized runs outperformed non-personalized runs on many users (around 62%). Only 12% of users have non-personalized baselines which outperformed the personalized approach. This indicates that for most users, it is beneficial to personalize their search results.

Fig. 7
figure 7

Individual analysis

6.2 Personalized query expansion using lexical matching and co-occurrence statistics

The goal of the second set of experiments is to evaluate the performance of personalized query expansion using LMCO, in comparison with the proposed framework (PQECO). Experimental results are shown in Table 6 and visualized in Fig. 8, together with the non-personalized baseline BM25 and the results obtained by using the PQE_mix method, which tends to acquire the best performance in the last subsection. The statistically significant differences in the table are marked as l w.r.t to the LMCO baseline and p w.r.t to the PQECO for the PQE_mix method only (Sections 5.3.1 and 5.3.2).

Fig. 8
figure 8

Comparison with personalized baselines

As we can see from the figure, query expansion solely based on co-occurrence statistics and lexical matching is unsatisfactory. Although the performance is better in terms of MRR and P@5 metrics when compared to the non-personalized baseline, however, in the MAP metric the performance is even lower. After examining the expanded terms in the LMCO model, it was found that because of the nature of social tagging systems, many tags are freely chosen and different from the terms stored in the user profiles, leave a large number of queries un-expanded. Furthermore, the expanded terms sometimes show noise, resulting in lower performance than the BM25 baseline.

However, using the same co-occurrence matrix as in LMCO, the PQECO method works much better, with performance just slightly lower than the PQE_mix method. In some metrics it appears to work better than PQE_terms. This shows the power of using pseudo-relevance feedback documents to enhance the word graphs. Also the effectiveness of using the Tag-Topic model is also empirically confirmed (in terms of PQE_mix which works better than PQECO). It should be noted that the improvements achieved by PQE_mix in comparison to PQECO and LMCO are statistically significant.

6.3 Comparison with co-tagging

We now examine the performance of the proposed model with expansion terms obtained by pure co-tagging activities. This is also demonstrated in Table 6 (statistically significant differences are marked as c for PQE_mix only) and in Fig. 8. Surprisingly, this approach outperforms the method that uses LMCO. As these two methods share some similarities, for example, if terms in the query are not found in the user profile, the query will remain unexpanded in the final run, we can only draw the conclusion that tags have a more sound effect in personalized query expansion exercises. The results also confirm that the proposed personalized query expansion framework outperforms all the personalized query expansion baselines used in the experiments described in this paper, with the highest improvement reaching 31.38% in terms of MAP (PQE_mix against LMCO). The use of the Tag-Topic model in the framework leads to a highest improvements of 6.67% in terms of MAP over the use of the co-occurrence matrix in the framework.

6.4 Comparison between groups

Next, the performance differences between different groups of users are considered. Figure 9 shows a comparison of performance for the four groups in the context of user activities with the MAP metric (as similar behavior is observed in other metrics). As can be seen, the results are mixed. Still, the personalized query expansion methods outperform the non-personalized approaches except in one case where PQE_tags performed worse than the baseline in the DEL100 group. However, the differences between the PQE_tags and PQE_terms methods among different groups of users vary, for example, PQE_tags achieves better MAP in the DELgt500 group than the PQE_terms method, but the same is not true for other groups. The most effective method is PQE_mix, which is better than the other methods both for users who are more active on social systems and those who are less active. Moreover, the improved performance on low active users does demonstrate that this method is very effective even for users who do not have much available data in online social tagging systems.

Fig. 9
figure 9

Comparison between different user groups

6.5 Recall results

In addition to the precision-based measurements, the personalized query expansion methods also showed significant improvements with regard to the recall-based metric R@5. These improvements are on a similar scale when compared to the non-personalized and various personalized baselines. This reveals the benefits of adopting this framework even in situations where recall is important. An illustration of these findings is supplied in Fig. 10, which gathers together recall-precision plots for the five retrieval systems. As can be seen in this figure, the improvements of the proposed personalized methods are consistent. Generally speaking, the personalized approaches can achieve better recall performance than non-personalized approaches.

Fig. 10
figure 10

Recall-precision plots for the eight retrieval systems

7 Conclusion and future work

In this article a novel query expansion framework was described which is based on individual user profiles mined from the annotations and resources the users bookmarked. The intuition behind the model is the prior assumption of term consistency: the most appropriate expansion terms for a query are likely to be associated with, and influenced by terms extracted from the documents ranked highly for the initial query. A Tag-Topic model was also introduced which simultaneously integrates the annotations and web documents through a statistical model in a latent space graph. The proposed personalized technique performed well on the social data crawled from the web, delivering statistically significant improvements over non-personalized and personalized representative baseline systems with improvements up to 61.15% and 31.38%, respectively. The effectiveness of the approach was also demonstrated for users with different levels of tagging activity and at an individual user level.

This research continues along several dimensions. The proposed framework is a general query expansion framework, not only should it utilize social media data, but also data obtained through web search logs. Future work is currently being planned to apply this framework to a large sized, real-world web search log to test its effectiveness. Also, an optimization algorithm is being designed for incremental updates of the user profile generation and optimal graph generation. latent semantic indexing (Deerwester et al. 1990) is an alternative to the topic models used to calculate the semantic information, an evaluation of this approach will be included in future work to compare performance. Future work will also include the integration of query expansion and results re-ranking. Information from the user profile is used to re-rank feedback documents, and then a subset of the documents that are most relevant to the user can be selected and terms can be extracted to be added to the query.