1 Introduction

In the last few years there has been a growing interest from the IR community on how the implicit feedback from users can be used to deliver a better search experience. For example, the feedback has been shown to allow a better understanding of users’ search needs and to help in establishing document relevance with a higher degree of accuracy than what can be provided by standard IR techniques. While the analysis of the users’ interaction is potentially relevant for any IR system, most work in the literature concentrates on Web logs, because they are widely available in many contexts, as they are stored by Web search engines, Internet providers and by many institutions and enterprises that manage a HTTP proxy as a gateway for their Intranets.

A common approach is to represent the user feedback as a query/document bi-partite graph (Craswell and Szummer 2007; Baeza-Yates et al. 2005b), where the queries are connected to the corresponding documents selected by the users and vice versa. The bipartite graph has been shown to be useful in improving the ranking of pages, for example by providing additional results for a query by selecting documents among the relevant results of similar queries (Craswell and Szummer 2007). Using random walks over the bi-partite graph, it is also possible to determine groups of related entities, which can be used in several applications like document or query suggestion (Baluja et al. 2008).

Other authors focused on modeling the users’ query refinements (sequence of queries issued in a single search session) by building a query-flow graph (Boldi et al. 2008; Baeza-Yates et al. 2005b). Another related approach has been proposed in the context of modeling folksonomies, which can be represented as a tri-partite document-user-tag graph, which has been successfully applied in Web ranking tasks (Hotho et al. 2006).

In spite of the good results obtained by these approaches, these data representations do not fully capture the richness of the information available in the Web logs. In this paper, we start from the query-document bi-partite graph proposed in (Craswell and Szummer 2007), which is augmented to represent each single user as a separate node. User nodes are then linked to the queries they have issued and to the documents they have selected. Furthermore, we directly represent query refinements performed by the users as separate transitions between the corresponding query nodes in the graph similarly to (Boldi et al. 2008). The resulting data structure is a much richer representation of the collective search sessions, since it can compactly and consistently represent all the fundamental actions performed by the users such as issuing queries, selecting documents, and query refinements. We present two general learning frameworks to process the data structure in supervised and unsupervised Web mining tasks. The frameworks scale up to graphs with billion of nodes that are the norm in the Web context. The experimental results evaluate the proposed learning frameworks and confirm that this extended representation can be successfully used both in clustering and classification tasks. In particular, we report experiments on discovering semantically relevant query suggestions and on text categorization by topic.

The outline of the paper is the following; Sect. 2 shows how the graphical representation of the Web logs is built. Section 3 introduces some applications that can be potentially faced by using the proposed data structure. Section 4 presents the unsupervised algorithmic solutions that have been devised to discover related entities over the graph, while Sect. 5 presents the approach adopted to perform entity classification over the graph. Section 6 reports some experiments showing the effectiveness of the proposed solution over the task of discovering query suggestions and Web page categorization by topic. Finally, Sect. 7 draws some conclusions (Fig. 1).

2 A complete graphical representation of web logs

Three classes of entities emerge as the fundamental actors when analyzing Web sessions: users, queries and documents. These entities feature specific relationships among them: a user issues a query for which a set of documents is returned. The user then either selects a document, or refines the previous query, or ends the search session.

Most work in the literature has so far focused on creating a graphical model of a subset of the information available in the logs by using either:

  • the query-document bi-partite graph (Craswell and Szummer 2007), connecting each query to the documents that have been selected;

  • the query-flow graphs representing query refinements (Jones et al. 2006; Boldi et al. 2008);

  • the document-user-tag tri-partite graph to represent folksonomies (Hotho et al. 2006).

However, richer models and algorithms that are able to represent and take into account all the users’ actions composing the search sessions could provide a significant advantage in many applications. This section attempts at defining this more representative data structure.

Let G be a graph formed over a set of nodes, each of which corresponds to an entity in the logs (either a user, a query or a document). As explained in the following sections, the nodes are connected based on relationships among the corresponding entities. Let \(\mathcal{Q},\;\mathcal{D},\;\mathcal{U}\) be the set of nodes representing the queries, the documents and the users, respectively. It holds that each node in G belongs to exactly one of \(\mathcal{Q},\;\mathcal{D},\; \mathcal{U}\). Thus, \(\mathcal{Q} \cap \mathcal{D} = \mathcal{Q} \cap \mathcal{U} = \mathcal{D} \cap \mathcal{U} = \emptyset\).

Each edge connecting two nodes in G is assigned a weight modeling the strength of the connection. In the following paragraphs, we explain how the connections are formed and how the corresponding weights are determined. Obviously, while the employed heuristics are natural and sound, they are by no means the best conceivable and other approaches could be pursued.

2.1 Queries and documents

This portion of the graph has been extensively studied in the previous literature. We followed exactly the approach proposed in (Craswell and Szummer 2007 and Szummer and Craswell 2008) to establish the connections and set the weights between query and document nodes. Let s(qd) be the number of times that the document d has been selected by a user after having issued a query q. Then,

$$ w(q,d) = \alpha_{QD} {\frac{s(q,d)} {S_{QD} + \sum_{i\in {\mathcal{D}}} s(q,i)}} $$

where S QD is a smoothing factor which penalizes the connections departing from nodes that have been observed few times in the available logs and for which the weights can not be estimated with high confidence (e.g. relations for which little evidence is available tend to get assigned a smaller strength weight, this a common assumption made in Bayesian parameter estimation). α QD is a parameter which determines the limit for total sum of the query-document edge weights, exiting from any query node q. If \(\sum_{i\in \mathcal{D}} s(q,i) \gg S_{QD}\), then \(\sum_{i\in \mathcal{D}} w(q,i) \approx \alpha_{QD}\). In our implementation, α QD is fixed for all query nodes.

A document is connected to set the of queries for which it has been selected. The weight associated to each connection is

$$ w(d,q) = \alpha_{DQ} {\frac{s(q,d)} {S_{DQ} + \sum_{i\in {\mathcal{Q}}} s(i,d)}} $$

where S DQ is the smoothing factor for this set of edge weights and α DQ is a parameter determining the limit for the total sum of the document-query weights for each document node.

Please note that document-to-query and query-to-document connections are symmetric but their associated weights are not. This is needed to enforce that the sum of all weights have a probabilistic interpretation.

2.2 Users and documents

Every time a user visits a document, he/she is expressing a preference for that document to some extent. Let v(ud) be the number of times a user u visited a document d, then the weight of the (ud) connection is set equal to

$$ w(u,d) = \alpha_{UD} {\frac{v(u,d)} {S_{UD} + \sum_{i\in {\mathcal{D}}} v(u,i)}} $$

where S UD is the smoothing factor used for this category of weights and α UD determines the total strength of the user-document connections out of any user u.

Similarly, the weight of the connection from a document to a user is determined by

$$ w(d,u) = \alpha_{DU} {\frac{v(u,d)} {S_{DU} + \sum_{i\in {\mathcal{U}}} v(i,d)}} $$

where S DU is another smoothing factor and α DU determines the total strength of the document-user connections for each single document node u.

Please note that the definition of visit has been left very general on purpose: a user can visit a document after having issued a query to a search engine or, in general, as the result of his/her browsing activity. Search engines can track only the user visits in response to a search or the entire browsing session may be occasionally available via toolbar logs. An Internet provider or a HTTP proxy for an Intranet can always track all users’ visits, creating a larger number of connections. Our model can deal with both these cases, even if only visits following searches have been tested in the experimental section.

2.3 Users and queries

Let n(qu) be the number of times a query q has been issued by a user u. Using the same approach as for the previous entity types

$$ w(q,u) = \alpha_{QU} {\frac{n(q,u)} {S_{QU} + \sum_{i\in {\mathcal{U}}} n(q,i)}} $$

and

$$ w(u,q) = \alpha_{UQ} {\frac{n(q,u)} {S_{UQ} + \sum_{i\in {\mathcal{Q}}} n(i,u)}} $$

where S UQ and S QU are the smoothing factors, α UQ and α QU determine the upper limit of the total sums of the user-to-query and query-to-user edge weight, respectively.

2.4 Query refinements

Query-query connections model query refinements. A query refinement is a pair of queries that are often issued in a sequence by the users and that share a common or closely related search goal. Those queries should be treated as part of a single search session. This class of node connections is harder to model starting from the logs, because it is not often clear when a query refinement has happened versus when there is a shift in the goal of the search activity (in the latter case the queries do not belong to the same session and should not be connected). Determining whether two queries are part of the same logical session is still an open research problem (He et al. 2002; Huang et al. 2004) and the commonly employed techniques rely on simple heuristics. In our implementation, we consider two queries as part of the same session, if they are consecutive and they are issued within 300 s. While this definition triggers many false positives, the weight computation described in the following of the paragraph keeps the resulting noise under control.

Let |q| be the number of times that a query q has been issued. The prior probability that a user will issue q as his/her next query is: \(p(q) = {\frac{|q|} {\sum_{u \in \mathcal{Q}} |u|}}\). Two queries are connected by a link assigned to the weight

$$ w(q,q^\prime) = \max \left(\alpha_{QQ} \left( p(q^\prime | q) - p(q^\prime) \right), 0\right) $$
(1)

where α QQ is a constant parameter setting the upper limit for the total sum of the refinement weights for any query node

As shown by (1), the weight of a link connecting a query q to one of its refinements \(q^\prime\) is established with a strictly positive weight, if and only if the probability that a user will issue \(q^\prime\) as a refinement is larger than the prior probability of query \(q^\prime\). If a query does not have any refinement which occurs more often that its prior probability, no refinement connections are established.

Let \(r(q,q^{\prime})\) be the number of times that a user refined the query q with \(q^\prime\), \(p(q^\prime | q)\) is estimated by \({\frac{r(q,q^\prime)} {S_{QQ} + |q|}}\), where S QQ is a smoothing factor that limits the noise introduced by refinements that have been observed only few times.

Substituting in (1) the estimates for \(p(q^\prime | q)\) and \(p(q^\prime)\), yields

$$ w(q,q^\prime) = \max \left(\alpha_{QQ} \left( {\frac{r(q,q^\prime)} {S_{QQ} + |q|}} - {\frac{|q|} {\sum_{u \in {\mathcal{Q}}} |u|}} \right), 0\right) $$
(2)

2.5 Building the complete graph

As explained in the next sections, some Web mining applications require to perform random walks over the graph representation of the logs. In order to have a parameter controlling the speed of convergence for the walk, it is useful to add a self-connection of each node to itself. We indicate with s(i) the weight of the self-connection for node i. Larger self-connection weights increase the expected number of iterations needed to move away from any starting point during the random walk.

In our setting, all the self-connection weight are initially assigned the same value s, in the following of this section we explain how they are later adjusted for each single node.

The graph G is built by considering the self-connections together with all the connections defined in Sects. 2.1, 2.2, 2.3, 2.4. In order to be able to perform random walks over the generated graph, the connections exiting from each node should sum up to one. Therefore, we impose that \(s + \alpha_{QU} + \alpha_{QD} + \alpha_{QQ} = 1\) for each node corresponding to a query, \(s + \alpha_{DQ} + \alpha_{DU} = 1\) for each document node and, finally, \(s + \alpha_{UD} + \alpha_{UQ} = 1\) for each user node.

Unfortunately, the smoothing factors used for all the transition weights prevent the sum of all the weights out of a node from summing up to one. Let’s for example consider the query-document weights, for any node q, it holds that \(\sum_{d\in \mathcal{D}} {\frac{s(q,d)} {S_{QD} + \sum_{i\in \mathcal{D}} s(q,i)}} < 1\). Thus, \(\sum_{d\in \mathcal{D}} w(q,d) < \alpha_{QD}\).

For each query node q, we introduce the set of differences that measure the amount needed to fulfill the probabilistic constraint

$$ \begin{aligned} \lambda_{QD}(q) &= \alpha_{QD} - \sum_{d \in {\mathcal{D}}} w(q,d)\\ \lambda_{QU}(q) &= \alpha_{QU} - \sum_{u \in {\mathcal{U}}} w(q,u)\\ \lambda_{QQ}(q) &= \alpha_{QQ} - \sum_{q^\prime \in {\mathcal{Q}}} w(q,q^\prime) \end{aligned} $$

Similarly for each document node d,

$$ \begin{aligned} \lambda_{DQ}(d) &= \alpha_{DQ} - \sum_{q \in {\mathcal{Q}}} w(d,q)\\ \lambda_{DU}(d) &= \alpha_{DU} - \sum_{u \in {\mathcal{U}}} w(d,u) \end{aligned} $$

And, finally for, each user node u,

$$ \begin{aligned} \lambda_{UD}(u) = \alpha_{UD} - \sum_{d \in {\mathcal{D}}} w(u,d)\\ \lambda_{UQ}(u) = \alpha_{UQ} - \sum_{q \in {\mathcal{Q}}} w(u,q) \end{aligned} $$

These quantities can significantly differ over the nodes.

For each node i, the overall amount missing to achieve stochasticity is

$$ \lambda (i) = \left\{ {\begin{array}{*{20}c} {\lambda _{{QD}} (i) + \lambda _{{QU}} (i) + \lambda _{{QQ}} (i)} \hfill & {{\text{if}}\;i \in \mathcal{Q}} \hfill \\ {\lambda _{{DQ}} (i) + \lambda _{{DU}} (i)} \hfill & {{\text{if}}\;i \in \mathcal{D}} \hfill \\ {\lambda _{{UQ}} (i) + \lambda _{{UD}} (i)} \hfill & {{\text{if}}\;i \in \mathcal{U}} \hfill \\ \end{array} } \right. $$

The self-connection weight can be used to fulfill the probabilistic normalization for each node i by adding λ(i) to the initial value s

$$ s(i) = s + \lambda(i) $$

This choice models the fact that the less amount of evidence is globally available in a node, the less the connections can be reliably established. In the context of random walks, this increases the probability of “conservatively” remaining in the current state (node).

3 Applications

The proposed graphical representation of Web logs has many potential applications, which can be approached within two different frameworks. Unsupervised algorithms can be applied to group the nodes of the graph according to some similarity criterion, for instance to find related queries, users or documents. On the other hand, supervised multi-class classification require to determine if each node of the graph is belonging or not to a set of predefined categories. This is commonly performed by assigning a score for each category to each node. The scores are learned starting from a set of labeled nodes with supervised scores. A non-exhaustive list of applications falling in the first group is the following:

  • Query recommendation. This problem can be formulated as a semantic clustering problem: given a query, find other queries that could help the user in finding what he/she is looking for (Beeferman and Berger 2000; Wen et al. 2001; Baeza-Yates et al. 2005a; Zhang and Nasraoui 2006; Baeza-Yates and Tiberi 2007; Donato and Gionis 2010). This task is a precious help to the users in the context of informational queries (Broder 2002), for which the user does not have an exact goal in mind when starting his/her search, or whenever the user has a limited knowledge about the topic and the collective knowledge of the community can help him to correctly formulate the query. The importance of this task is shown by the fact that a large portion of queries are reformulated by Web users after having evaluated the search results (Jansen et al. 2007).

  • Related document suggestion. This is also a semantic clustering problem: given a document that is relevant to the user, find other related documents that the user may like. This task has been extensively studied in the context of semantic clustering of search results (Zeng et al. 2004; Ferragina and Gulli 2008; Carpineto et al. 2009), classical content-based IR (Broder et al. 1997; Zamir and Etzioni 1998; Cooper et al. 2002; Berry and Castellanos 2007), or using the hyperlink structure of the Web graph (He et al. 2002; Wang and Kitsuregawa 2002; Flake et al. 2002).

  • User clustering and personalization. Given a user, find other users that share common interests. This is commonly done to provide personalized result rankings to a user, based on the search behavior of other similar users (Pierrakos et al. 2003; Adomavicius and Tuzhilin 2005).

A few examples of tasks falling in the second category are:

  • Document ranking. Given a query, order the documents by their relevance. This is commonly approached by learning a ranking function from a set of (query-perfect_rank) examples (Cohen et al. 1999; Radlinski and Joachims 2005; Burges et al. 2005; Liu 2009).

  • Document categorization. The goal of this task is to classify each document as belonging or not to a given category (Joachims 1998; Sebastiani 2002). When multiple categories are given, it is possible to build one classifier for each category. For example, Porn pages filtering can be formulated as a classification-by-topic task that is particularly relevant for many Web applications. Most porn detectors take decisions based on both the visual information and the text on the page. However, text-based methods can be easily fooled by malicious Web masters (spammers), who have full control of the page content, or even by legitimate pages with little textual content.

3.1 Target applications

The experimental section of this paper is focused on the following two applications: discovering query suggestions (unsupervised Web mining task) and classification-by-topic (supervised Web mining task).

The problem of finding good query suggestions have been extensively studied in the literature. Most of the studies concentrated on using query refinements (Baeza-Yates et al. 2004; Boldi et al. 2008), the intersection among the result sets using the query-document bi-partite graph (He et al. 2002), or term-based statistics (Collins-Thompson and Callan 2005). In this paper we employ the Web logs to extract semantically relevant suggestions by observing that similar queries tend to select (be connected) to similar sets of documents and they are issued by similar sets of users. Furthermore, related queries tend to be often issued sequentially in the same search sessions. The Web log graph proposed in this paper captures the information needed to take advantage of all these properties at the same time and allows using unified processing methodologies as explained in Sect. 4.

Document classification-by-topic is a very well studied problem. Most of the work in the literature concentrates on classification using the textual content (Sebastiani 2002). A few papers do not consider documents as separate entities, but embed them into a network of connections. For example, (Chakrabarti et al. 2001; Fürnkranz 1999) exploit the HTML links among Web pages to improve the performances of a classifier. However, we think that the information stored by the logs is more direct and valuable than that provided by the HTML links. We propose to use a diffusion algorithm similar to what proposed by (Zhou et al. 2004, 2005; Zhou and Scholkopf 2004). As described in Sect. 5, the diffusion algorithm starts from a small set of examples of documents on a topic T to discover new documents belonging to the same topic. Indeed, documents belonging to the same topic tend to attract visits from users sharing a common interest on the topic. It is therefore likely that a user that visited an on-topic document also visited other similar ones. The same reasoning holds for query-document connections: a specific query determines the topic of the documents that are selected by the users after having issued it. So, queries and users can be used as a gateway to extend our supervised knowledge to other documents. Since the connections are built from the contextual behavior of the users—who are supposed to semantically understand the content of the pages, the proposed methodology should be more accurate on pages mainly containing images (and no text) and more robust with respect to spammers than purely content-based approaches. This methodology is therefore a perfect candidate to integrate content-based approaches for Web text categorization applications.

4 Markovian node clustering

The general framework described in this section can be used to detect related or similar entities on the graph, it is therefore suited to tackle any application involving unsupervised learning.

Please note that the matrix W collecting the weights w(ij) defined in the previous section is by construction stochastic. Consider now the Markov process whose transition matrix is W, we indicate with w n (ij) the (ij) th element of W n representing the probability that a surfer starting in node i will end up in node j after exactly n steps. Let us assume that i and j are nodes representing queries. In the simplified case where W represents a bi-partite graph containing only the query-document connections, a non-null w 2(i, j) value, obtained by iterating the Markov model for two steps, is higher as the larger is the intersection between the two sets of documents selected for the queries i and j. The proposed model extends this concept by considering also the transitions between other node types. Therefore, w 2(i, j) can increase also if i and j tend to be issued by a similar set of users and/or if j is a common query refinement of i. By computing higher powers W n, n > 2, the clusters are grown recursively and objects representing more distant semantic concepts are merged. Generally speaking, w n (ij) will indicate how the element j is semantically similar to the element i according to the users’ search sessions. Please note that since the initial matrix is stochastic, any power of the matrix remains stochastic. This means that the value of any element (ij) remains in the [0, 1] range. For the i-th entity in the graph, the most related entities are discovered by starting the random walk from the i-th node and letting it run for n steps. The similarity degree between entity i and j is proportional to the probability of ending the walk in node j. Please note that once W n has been computed, it is not needed to run a separate walk for each entity, as the i-th row of W n already contains the probability distribution over the entities of the graph for a walk starting in i. This means that it is possible to rank the entities by their similarity with the entity i by sorting them according to their corresponding values on the i-th row of W n. It is now possible to define a threshold to select the top entities to be returned for the target application. This clustering algorithm is similar to other clustering algorithms based on Markov processes like MCL (Enright et al. 2002).

The optimal value for n can not be determined a-priori and depends on the application that defines the desired semantic homogeneity of the elements. In particular, larger values of n will tend to create similarity relations that are less semantically homogeneous and should be used when higher recall is desired. For most of the semantic clustering applications on the Web, n ∈ [2, 5] should be appropriate. In the experimental section, the impact of n on the accuracy of the discovered similarity relations is studied in more details. The self-connection s has also an influence on the selected value for n, since larger values slow down the diffusion of the information.

From an implementation point of view, W can be huge containing billions of rows. However, W is typically very sparse and, as long as n is low, W n remains sparse in most applications. This allows us to get advantage of the fact that products between sparse matrices can be efficiently computed, and W n can be iteratively computed as \({\varvec W}({\varvec W}({\varvec W}(\ldots({\varvec W}{\varvec W}))))\). When working at a very large scale, the computation can be parallelized using a distributed computational schema like MapReduce (Dean and Ghemawat 2008), which has been successfully applied to matrix multiplication problems with billion of nodes (Papadimitriou and Sun 2008; Cohen 2009).

5 Regularization in discrete domains for supervised tasks

Given a graph representing similarity relationships among a set of entities, it is possible to approach a classification task for the same entities using a transductive schema. In particular, we consider a general formulation of transductive node classification in discrete domains based on a regularization principle (Zhou and Scholkopf 2004; Zhou et al. 2004). This class of algorithms exploits a set of examples (labeled nodes) and the connections (graph edges) among the objects. Entity classification is performed by computing a function that is defined over the graph nodes (each node representing an object to be classified) and that is enough smooth when considering nearby nodes. Basically, generalization of the labels assigned to a small set of supervised nodes to the other nodes is obtained by exploiting the graph topology. This approach is particularly interesting for the considered task, since it may be difficult to provide useful features to describe some of the considered entities (e.g., users) from query logs, whereas a set of similarity relationships can be easily computed as shown in Sect. 2.

Discrete domain regularization infers a function assigning a classification score to each node of the graph, starting from a small set of labeled nodes which are assigned a given target score. Typically, a score equal to +1 (−1) is used to indicate a page belonging (not belonging) to the target category. The values in the range (−1,1) can be used to indicate the confidence of the classification of the node into the corresponding category, assuming the value 0 as the threshold for the decision. This approach allows us to define a binary classification scheme for each considered class (i.e., we decide to attach a given class label to an item without considering the results for the other classes). When facing a multiclass classification problem with mutually exclusive classes, the procedure is applied for each class and the final classification is performed by selecting the class yielding the maximum value (eventually considering rejection if the maximum is below a given confidence threshold).

More formally, let us suppose we are given a graph G and let V G be the set of nodes in G. Each edge of the graph is assigned a weight representing the strength of the connection between the corresponding nodes. In particular, given two nodes uv ∈ V G , \(w_{uv} \ge 0\) indicates the weight of the connection between u and v. w uv  = 0 is equivalent to not having a connection between the nodes. In the considered framework, the connection weights define the strength of the nodes’ similarity (i.e., a higher weight value should imply a higher similarity between the two connected nodes). In fact, if the weights are computed using the procedure proposed in Sect. 2, their value is related to the observed correlation between the considered nodes.

The graph G can be represented via its adjacency matrix W, whose (uv) − th element is equal to w uv . Let y be a vector of size |V G | having in the u th position the target classification score of node u if it is labeled and 0 otherwise. Discrete domain regularization estimates the function f(v), v ∈ V G by computing a score value for each node in V G , in order to yield a good fitting of the target vector on the labeled nodes with “smooth” variations over the graph connections. In particular, the learning problem determines the vector \({\varvec f}^\star\) minimizing the following cost functional,

$$ {\varvec C}_{{\varvec G}}[{\varvec f}] = {\frac{1} {2}} {\big\| {\varvec f} - {\varvec y} \big\|}^2 + \lambda {{\varvec f}}^T {\varvec R}_{{\varvec G}} {\varvec f} $$
(3)

where R G is a regularization matrix defined to penalize non-smooth solutions, y is the vector of target scores, whose non-supervised entries are set to 0, and \(0 \le \lambda \le 1\) is a constant determining the trade off between regularization and error over the training nodes.

The optimal solution minimizing (3) can be computed by finding its stationary points, obtained by solving

$$ \nabla_{{\varvec f} } {\varvec C}_{{\varvec G}}[{\varvec f}] = ({\varvec f} - {\varvec y}) + \lambda {\varvec R}_{{\varvec G}} {\varvec f} = 0 . $$

If (I + λR G ) is invertible, \({\varvec f}^\star\) exists, is unique and it is equal to

$$ {\varvec f}^\star = ({\varvec I} + \lambda {\varvec R}_{{\varvec G}})^{-1} {\varvec y}. $$
(4)

In order to provide a meaningful definition of R G , we start from a regularization functional C R G [f] that penalizes the distance between the values computed for pairs of connected nodes, weighted by the strength of the connection,

$$ {\varvec C}^R_{{\varvec G}}[{\varvec f}] = {\frac{1} {2}} \sum_{u=1}^{|V_{{\varvec G}}|} \sum_{v=1}^{|V_{{\varvec G}}|} w_{uv} (f_u - f_v)^2 $$
(5)

where f u is the value of f for the uth node. This functional favors functions assuming close values on nodes that are strongly connected.

Equation 5 can be rearranged as

$$ {\varvec C}^R_{{\varvec G}}[{\varvec f}] = {\frac{1} {2}} \sum_{u=1}^{|V_{{\varvec G}}|} \sum_{v=1}^{|V_{{\varvec G}}|} {\frac{(w_{uv} + w_{vu})} {2}} (f_u - f_v)^2 . $$
(6)

If we define \(\bar{w}_{uv} = {\frac{(w_{uv} + w_{vu})} {2}}\), (6) becomes

$$ \begin{aligned} {\varvec C}^R_{{\varvec G}}[{\varvec f}] &= {\frac{1} {2}} \sum_{u=1}^{|V_{{\varvec G}}|} \sum_{v=1}^{|V_{{\varvec G}}|} \bar{w}_{uv} (f_u - f_v)^2\\ &= {\frac{1} {2}} \sum_{u=1}^{|V_{{\varvec G}}|} \sum_{v=1}^{|V_{{\varvec G}}|} \bar{w}_{uv} f_u^2 + {\frac{1} {2}} \sum_{u=1}^{|V_{{\varvec G}}|} \sum_{v=1}^{|V_{{\varvec G}}|} \bar{w}_{uv} f_v^2 - \sum_{u=1}^{|V_{{\varvec G}}|} \sum_{v=1}^{|V_{{\varvec G}}|} \bar{w}_{uv} f_u f_v. \end{aligned} $$
(7)

The weights \(\bar{w}_{uv}\) are symmetric (\(\bar{w}_{uv} = \bar{w}_{vu} ~ \forall u,v\)), therefore it holds that

$$ \sum_{u=1}^{|V_{{\varvec G}}|} \sum_{v=1}^{|V_{{\varvec G}}|} \bar{w}_{uv} f_u^2 = \sum_{u=1}^{|V_{{\varvec G}}|} \sum_{v=1}^{|V_{{\varvec G}}|} \bar{w}_{uv} f_v^2. $$

Thus, we obtain

$$ {\varvec C}^R_{{\varvec G}}[{\varvec f}] = \sum_{u=1}^{|V_{{\varvec G}}|} \sum_{v=1}^{|V_{{\varvec G}}|} \bar{w}_{uv} f_u^2 - \sum_{u=1}^{|V_{{\varvec G}}|} \sum_{v=1}^{|V_{{\varvec G}}|} \bar{w}_{uv} f_u f_v $$
(8)

Let \(\bar{{\varvec W}}\) be a symmetric square matrix having \(\bar{w}_{uv}\) as (uv) th element and D be a diagonal matrix with its u th element d u equal to \(\sum_{v=1}^{|V_{{\varvec G}}|} \bar{w}_{uv}\), then the first and second terms of Eq. 8 can be expressed as f T D f and \({\varvec f}^T \bar{{\varvec W}} {\varvec f}\), respectively.

Therefore, C R G [f] can be compactly rewritten as

$$ {\varvec C}^R_{{\varvec G}}[{\varvec f}] = {\varvec f}^T ({\varvec D} - \bar{{\varvec W}}) {\varvec f} $$

which expresses the regularizer in the form required by (3). Now, setting \({\varvec R}_{{\varvec G}} = {\varvec D} - \bar{{\varvec W}}\) into (4) allows us to compute the optimal score vector as

$$ {\varvec f}^\star = \left({\varvec I} + \lambda {\varvec D} - \lambda \bar{{\varvec W}} \right)^{-1} {\varvec y} . $$
(9)

Since \({\varvec{D}} - \bar{{\varvec W}}\) is diagonally dominant, \({\bf I} + \lambda {\varvec D} - \lambda \varvec {\bar {W}}\) is also diagonally dominant and, therefore, invertible. Thus, the optimal solution \({\varvec f}^\star\) exists and is uniquely defined by the graph and the supervised vector of target scores.

Equation 9 requires the inversion of a square matrix, which has size equal to the number of nodes in the input graph. This graph can have billions of nodes in Web applications, and direct inversion can be intractable. However, if the largest eigenvalue of \(\lambda ({\varvec D} - \varvec {\bar {W}})\) lays inside the unit circle and \(\varvec {\bar {W}}\) is sparse, the solution can be efficiently found by solving the following iterative equation,

$$ {\varvec f} (t+1) = {\varvec y} + \lambda (\bar{{\varvec W}} - {\varvec D}) {\varvec f} (t). $$
(10)

Interestingly, this iterative equation represents a diffusion process of the labels through the graph.

6 Experimental results

The experiments have been carried on the user logs released by AOL in 2005. The dataset is a sample of the search activity of 658000 anonymized US-based users over a three month period (March to May 2005). This has been estimated to be a sample of approximately 1.5% of the overall AOL users in the considered period. The dataset contains 4.8 million queries and 1.8 million URLs. Since we are aware of the privacy concerns of this dataset, the graph has been pruned to remove queries and documents that have been issued by less than four users. This should remove personal queries and documents that could allow us to associate any anonymous user id to a real person. Furthermore, the graph has been pruned by removing all users that had issued less than three queries. The final graph contains 982354 nodes of which 646603 represent queries, 130502 documents and 205249 users. Please note that this dataset is specific to a single language and country. When working with locale heterogeneous Web logs, it may be more effective to create separate graphs for each locale and iteratively apply the proposed approach.

6.1 Discovering semantic query suggestions

This set of experiments aims at showing how the augmented data structure can be used to extract high quality query suggestions.

The Web logs have been used to build the following five graphs, where the connections are set using the methodology explained in Sect. 2. The first graph G qq considers only query refinements, i.e., α QQ  = 1 − s while all other α parameters are set to zero. This is equivalent to considering only the query nodes and the edges directly connecting them. The second graph G qd considers only document-query and query-document transitions: \(\alpha_{QD} = \alpha_{DQ} = 1 - s\), while \(\alpha_{QQ} = \alpha_{DU} = \alpha_{UD}, \alpha_{UQ} = \alpha_{QU} = 0\). The third graph G qu considers only query-user and user-query transitions: \(\alpha_{QU} = \alpha_{UQ} = 1 - s\), with the other α parameters set to zero. The fourth graph G qqu considers query-query, query-document and document-query connections: \(\alpha_{QQ} = \alpha_{QD} = {\frac{1-s} {2}},\) α DQ  = 1 − s and the other α parameters set to zero. Finally, the fifth graph G all equally weights all the connections across different node types: \(\alpha_{QU} = \alpha_{QD} = \alpha_{QQ} = {\frac{1-s} {3}},\) \(\alpha_{DQ} = \alpha_{DU} = {\frac{1-s} {2}}\) and \(\alpha_{UQ} = \alpha_{UD} = {\frac{1-s} {2}}.\) The self connection parameter s has been set to 0.5 for all the graphs.

Let \({\varvec W}_{qq}, {\varvec W}_{qd}, {\varvec W}_{qu}, {\varvec W}_{qqu}, {\varvec W}_{all}\) indicate the adjacency matrices of \({\varvec G}_{qq}, {\varvec G}_{qd}, {\varvec G}_{qu}, {\varvec G}_{qqu}, {\varvec G}_{all},\) respectively.

Analysis of the accuracy provided by the graph sub-portions. Three propagation steps have been run on the single sub-graphs, yielding the matrices: \({\varvec W}^3_{qq}, {\varvec W}^3_{qd}, {\varvec W}^3_{qu}, {\varvec W}^3_{qqu}, {\varvec W}^3_{all}\). The selection of only three iterations for the propagation step is motivated by the results reported in the following.

We randomly sampled 125 original queries that have been discovered to be similar to at least another query with a score higher than 0.01 in at least one of the above matrices. For each selected query, the query itself and up the 10 most similar corresponding queries for each of \({\varvec W}^3_{qq}, {\varvec W}^3_{qd}, {\varvec W}^3_{qu}, {\varvec W}^3_{qqu}, {\varvec W}^3_{all}\) have been sent to a set of 4 human raters, who have been previously instructed on the desired rating policies and their scores calibrated on validation data. The raters marked the single queries in each set as relevant or irrelevant according to the supposed search goal of the user. Given the high number of pairs (original query, suggested query) to score, each pair was scored by a single rater. However, a supervisor scanned and eventually corrected all the ratings at a later stage (the rate of corrections was around 4%).

When considering only the sub-graph containing the query-document connections, the query discovery approach is similar to what proposed in (Craswell and Szummer 2007). Therefore, we will consider this result as a baseline.

Table 1 reports the precision scores when considering the top N ∈ {1, 3, 5, 10} query suggestions for each experiment. Since the set of all relevant suggestions of a query is unknown, it is not possible to measure recall directly. We instead used a pseudo-recall metric, which measures the percentage of queries for which N query suggestions are available. Table 2 reports the obtained values for this pseudo-recall metric for N ∈ {1, 3, 5} for each experiment.

Table 1 Precision@N for the query recommendations computed by performing a 3-step propagation for the different graph configurations
Table 2 Percentage of queries for which N query suggestions are available

The experimental results show that the query-query connections are the most powerful, closely followed by the query-document ones. User-query transitions are less useful in discovering relevant query suggestions but they still help in finding some. As expected, most of the gains come from the combination of the query-query and query-document connections. This combination significantly improves what can be achieved by considering a single set of connections. However, the precision and recall of the diffusion schema can be further improved by using the complete graph, which includes all the available information at the same time. This shows that the single portions of the graph are partially orthogonal in terms of the discoverable suggestions. Therefore, even the query-user connections, which are weaker when considered in isolation, play an important role in increasing the precision and recall of the overall results. This result is particularly interesting as the previous work in the literature focused on a few single sub-portions of the Web log graph. According to these results, only query-rephrasing appears to be a very strong signal in isolation. Any other portion of the graph provides a weaker signal, which can still be successfully exploited by cross-reinforcing them with the signals coming from the other portions.

Table 3 reports the query recommendations assigned to a score above 0.01 in any of the different sub-graphs for a small set of user queries. This small set of examples shows how the proposed methodology is able to extract relevant suggestions. In particular, some recommendations show a strong semantic relation with the initial query and it would be hard to extract them without using the collective feedback of the users. Most of the highly semantic related queries come from the query-query portion. Query-document connections provide other relevant recommendations, which are often different from those provided by the query-query connections. As shown by the “mapquest” example query, performing graph propagation over the entire graph is not equivalent to an a-posteriori merging of the results of the propagation over the single sub-portions. The recommendation can be reinforced by signal propagation across paths crossing multiple different sub-portions, making the process highly non-linear. This means that a recommendation can get a small score when considering any single connection type, while it gets a higher positive score when propagating over the entire graph.

Table 3 Query recommendations that have been scored above 0.01 by the diffusion algorithm applied over the complete graph or over one or more of the subgraphs

Analysis of the obtained accuracy when varying the propagation steps. A second set of experiments studies how the precision and recall results provided by the system vary when changing the number of propagation steps. This set of experiments was carried out on the complete graph containing all the available information. First, the matrixes Wn all with n = 2, 3, 4, 6 have been computed. A set of randomly sampled queries and the corresponding extracted similarity ranks have been sent to a set of human raters (using the same rating methodology previously explained). These scores have been used to determine the precision scores of the obtained query suggestions. Recall was measured using the same methodology explained in the previous paragraph.

Tables 4 and 5 show the Precision@N and the percentage of queries for which N suggestions are available obtained for a different number of propagation steps. As expected, a 2-step propagation provides high precision but a relatively lower recall. On the other hand, the precision scores steeply decrease when propagating for 4 or more steps. Employing a 3-step propagation provides a very nice trade-of between precision and recall. These results suggest that a limited number of steps should be employed for all clustering applications.

Table 4 Precision@N scores of the discovered query suggestions for a different number of propagation steps over the complete graph
Table 5 The percentage of queries for which N query suggestions are available for a different number of propagation steps

Query Suggestion Categories. Query suggestions can be subdivided into the following categories:

  • Generalization. The suggestion denotes a more general concept than the original query. In this case, the query is a hyponym of the suggestion. For example “walmart” → “department store” is a generalization.

  • Specification. The suggestion denotes a more specific concept than the original query. In this case, the query is a hypernym of the suggestion. For example, “walmart” → “walmart special sales” is a specification.

  • Related. The suggestion delivers a concept that is related to the original query and has the same level of generality. For example, “walmart” → “target”.

  • Stemming-based suggestions. The suggestion is a different linguistic morphology of the same canonical form. For example, “table” → “tables” or “tables” → “table”.

  • Equivalence. The suggestion delivers the same semantic meaning of the original query and it is only another way to write the same concept delivered by the original query. For single terms, these rewrites are called synonyms. For example, “car” → “auto” can be assumed to be two equivalent queries.

  • Spell-correction. The suggestion is a spell-corrected form of the original query. For example, “disneychannle” → “disneychannel” belongs to this category.

Multiple suggestion categories can correspond to get a single suggestion. For example “ups tracking” and “fedex” is a related query of a generalization, while “ups traking” → “fedex tracking” is a spell correction of a related query.

All the above suggestion types can be useful and relevant depending on the context. In particular, suggestions in the “related” category are often regarded as particularly intriguing. Indeed, suggestions in this category are often very useful for the users, but also very hard to discover, since they require to understand the semantic of the query and of the associated suggestions.

Figure 2 shows the distribution by category of the relevant suggestions that are discover when propagating over each single sub-portion of the graph and when considering all the available data at the same time (all). Suggestions in the equivalence class are quite common for all methods, followed by suggestions in the specification and spell-correction classes. Stemming-based rewrites appear to be more rare in our data. It is interesting to note that the query-query connections are a great source of suggestions in the related category.

Fig. 1
figure 1

A tri-partite graph with queries, documents and users: a) queries and documents are connected by document selections, b) users and queries by searches, c) users and documents by visits, d) a query points to another query if the latter is a user refinement of the first query

Fig. 2
figure 2

The distribution by type of the relevant query suggestions when using the entire graph (all) or only query-query, query-document and query-user transitions

6.2 Web page categorization by topic

Documents on a given topic T tend to be connected to the users interested on T and to the queries issued for information requests about T. Therefore, assuming to know the category of a few entities (nodes) on the graph, it should be possible to exploit the correlations among sets of connected nodes to extrapolate the category of other unlabeled entities. In this section, we show how the structured representation of the Web logs presented in the previous sections can be used to improve classical categorization by topic, commonly purely based on page content. Classification is performed using the methodology described in Sect. 5.

We are not aware of any categorization supervised dataset built on top of the AOL data, we thus built our own datasets using the following methodology, which aims at minimizing the need of performing a manual labeling of the data:

  • all the documents in the dataset have been downloaded from the Internet. Since the AOL dataset is from 2005, many pages were not available anymore at the time of download (February 2010) and have been discarded, as well as any page containing less than 10 terms (for which categorization by content would be negatively biased, not being able to work with good accuracy). The final set of valid downloaded documents had size 118605.

  • A set of “unambiguous” keywords have been selected to describe various categories as soccer, sport, porn, tennis, etc. For unambiguous, we mean that the presence of the keyword in the URL is almost a perfect discriminant (for the AOL data) that the page is about the topic. This means that matching the keywords on the URLs does not introduce any, or a very small set of false positives. For example the keyword “soccer” has been selected for category soccer and “xxx” and “porn” for the category porn. The number of selected documents for a category is typically in the order of a few hundreds. For example, the number of selected documents for the categories soccer, sport, porn and tennis is equal to 185, 689, 314, 97, respectively. A human expert manually scanned the positive examples and found around 5 false positives across all categories. This small number of false positives is due to a careful selection of the keywords representing the categories, which have been selected trying to avoid semantic conflicts. The detected false positive have been removed from the dataset, even if their small number was not introducing any significant noise.

  • A sample of 1% of the overall number of documents have been selected as negative examples, and manually reviewed to discard the documents belonging to any of the selected categories. This is the only step of the dataset construction, requiring some manual labeling.

  • A labeled dataset of pages has been compiled for each category by adding the documents containing at least one keyword for the category in the URL as positive examples, and the documents collected in the random sample as negative examples.

  • The URLs are discarded from the data representation of the documents and never used in the following evaluations.

  • The labeled dataset for each category is randomly split into a training and a test set of equal size. This step is repeated 20 times to get multiple datasets to perform random sub-sampling crossvalidation.

Whereas we are aware that the labeling procedure described above can introduce some bias in the selection of the training data, it has the advantage that allows us to build an arbitrary large number of training and test sets with a limited amount of manual labeling.

Each training labeled dataset has been used to train an SVM classifier (Scholkopf and Smola 2001) based on a linear kernel and processing the bag-of-word representation of the pages, containing plain term frequency (tf) features. The bag-of-word representation has been extracted using the HTML parsing library part of the open source lynx browser.Footnote 1 The SVM classifier has been implemented using the SVMlib software library.Footnote 2 We decided to employ an SVM classifier in our experiments, since it has been shown to provide state-of-the-art results on text categorization tasks with little tuning (Joachims 1998).

The graph regularization scheme defined in Sect. 5 is applied over the stochastic W matrix, whose construction methodology has been defined in Sect. 2.5. Since W is stochastic, its largest eigenvalue is equal to 1 (Seneta 2006). If \(\lambda \le 1\), the iterative (10) can be efficiently used to solve the minimization problem. This is indeed the employed optimization schema used to compute the experimental results presented in this section.

For each category and crossvalidation fold, a labeled vector y has been constructed starting from the corresponding training dataset. For each labeled vector, graph diffusion has been performed using (10) for 5 iterations. Even if 5 steps are usually not enough to reach convergence, no consistent accuracy gain was observed by increasing the number of iterations and, therefore, we early interrupted the process to save computational resources.

Given the representation of an input document x, let O SVM (x) and O PROP (x) be the output of the trained SVM classifier and of the graph propagation (PROP), respectively. The classification for the SVM+PROP classifier was trivially performed by outputting a page as belonging to a category if at least one of the classifiers returns a positive value: \((O_{SVM}({\varvec x}) > 0) \lor (O_{PROP}({\varvec x}) > 0)\).

Table 6 compares the classification accuracy for various text categories as an average over the 20 crossvalidation sub-samples. The approach integrating SVM and graph propagation consistently improves the baseline provided by the SVM classifier for all the tested categories. In particular, the accuracy gain of the PROP+SVM versus the SVM classifier across the folds is statistically significant with at least 95% confidence for the categories “porn”, “soccer”, “movies”, “tennis”, “travel” and “casinos”. Table 7 reports the precision and recall for the positive class obtained when using the SVM+PROP and SVM classifiers. The SVM+PROP classifiers show an increase of the average recall versus the SVM ones for all classes. In particular, the recall gains are statistically significant for the classes “porn”, “soccer” and “casinos”. The SVM+PROP also features a small increase in precision for most of the classes but it is never statistically significant.

Table 6 Comparison of the classification accuracy (in percentage), when using an SVM classifier based on the page content
Table 7 Comparison of the precision/recall values (in percentage) for the positive class, when using an SVM classifier based on the page content

The PROP classifiers provide a lower accuracy than the corresponding SVM classifiers. However, this is mainly due to the their low recall as shown by Table 7. On the other hand, their precision scores for the positive class are consistently very high, meaning that they very rarely tag a document as belonging to a class, when the document does not really belong to that class. This explains why the ensemble classifier, merging the outputs of the SVM and PROP classifiers using a simple OR, over performs the single underlying classifiers. Indeed, the ensemble classifier relies on the SVM classifier most of the time, but it is able to recover some mistakes by trusting the PROP classifier when it triggers.

Table 7 also provides some insights about the results shown in Table 6. Indeed, it explains that the previously reported accuracy gains are mainly coming from an increase in recall provided by the SVM+PROP classifier over the SVM one. Indeed, the SVM+PROP approach is able to classify some pages with little or misleading textual content that are common on the Web, for which the SVM classifier performs poorly.

7 Conclusions and future work

This paper presents a graphical representation of the collective search tasks performed by the users. The representation can be directly extracted from the Web logs of Intranet HTTP proxies, Internet providers, or search engines. This representation improves what has previously proposed in the literature because it models searches, user visits and query refinements all in a comprehensive data structure. The paper studies how it is possible to process the graph in Web mining applications defining either a supervised or unsupervised learning task. Unsupervised tasks can be approached using a diffusion algorithm based on a Markov process that allows us to detect groups of related entities (either queries, documents or users). Supervised learning tasks can be tackled using a regularization schema defined over discrete domains.

The experimental results presented in this paper concern the tasks of discovering semantically relevant query suggestions and Web document categorization. For the query suggestion tasks, the experiments measure the accuracy provided by the single sub-portions of the graph. In particular, the portion of the graph modeling query refinements (query-query connections) has been proved to be the most powerful to solve the studied task. The portion including document-query links closely follows, while document-user connections are less useful for this application. Whereas all the previous works in the literature focused on a subset of the available information in the Web logs, the experiments presented in this paper show that it is possible to significantly improve the quality and coverage of the query suggestions by using all the available information at the same time.

For the document categorization task, the experimental results show how it is possible to improve content-based state-of-the-art categorization methods. In particular, this improvement can be obtained via a simple combination of the output of the content-based text categorizer with the output of the propagation schema defined by the graph regularization framework.

The data structure could also be enriched by integrating other information that, whereas external to the Web logs, it is commonly available to search engines. For example, we plan to study the integration of the Web graph in the proposed Web log graph. The Web graph represents the known HTML links between pairs of pages and, while noisy, it could improve to smear the information available in the Web logs about popular resources (which are covered by the Web logs with high recall) to more exotic pages. We plan to study this extension in the future, together with studying other Web mining applications like document ranking and user/document clustering. Toolbar logs can track the behavior of the users during the entire browsing session and they could also be directly integrated as user to document and/or document to document transitions. Another extension we plan to study is to integrate content-based information from classical IR, like query similarities using edit distances, or document pair-wise cosine similarities, as weighted connections on the graph. This would allow us to process content and behavioral information in a unified way.