Incorporating web browsing activities into anchor texts for web search

Zhou, Bo; Liu, Yiqun; Zhang, Min; Jin, Yijiang; Ma, Shaoping

doi:10.1007/s10791-010-9151-7

Incorporating web browsing activities into anchor texts for web search

Web Mining for Search
Published: 28 November 2010

Volume 14, pages 290–314, (2011)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Incorporating web browsing activities into anchor texts for web search

Download PDF

Bo Zhou¹,
Yiqun Liu¹,
Min Zhang¹,
Yijiang Jin¹ &
…
Shaoping Ma¹

346 Accesses
4 Citations
Explore all metrics

Abstract

Anchor texts complement Web page content and have been used extensively in commercial Web search engines. Existing methods for anchor text weighting rely on the hyperlink information which is created by page content editors. Since anchor texts are created to help user browse the Web, browsing behavior of Web users may also provide useful or complementary information for anchor text weighting. In this paper, we discuss the possibility and effectiveness of incorporating browsing activities of Web users into anchor texts for Web search. We first make an analysis on the effectiveness of anchor texts with browsing activities. And then we propose two new anchor models which incorporate browsing activities. To deal with the data sparseness problem of user-clicked anchor texts, two features of user’s browsing behavior are explored and analyzed. Based on these features, a smoothing method for the new anchor models is proposed. Experimental results show that by incorporating browsing activities the new anchor models outperform the state-of-art anchor models which use only the hyperlink information. This study demonstrates the benefits of Web browsing activities to affect anchor text weighting.

Modeling user interests from web browsing activities

Article 01 November 2016

Fabio Gasparetti

Accessing Information with Tags: Search and Ranking

The Research on Webpage Ranking Algorithm Based on Topic-Expert Documents

1 Introduction

Since anchor texts are initially created to help users navigate from one Web page to another, they usually provide short complementary descriptions to their destination Web pages. On the other hand, anchor texts resemble real-world user queries in terms of vocabulary distribution and length (Eiron and McCurley 2003), for example, they are both short and descriptive, which makes anchor texts have a better chance to match user queries than the content words of a Web page. These two reasons have motivated the extensive application of anchor texts in commercial search engines (Brin and Page 1998). Experiments have also shown their capability of improving Web search effectiveness (Craswell et al. 2001; Dou et al. 2009; Eiron and McCurley 2003).

Although many methods have been proposed to use (weight) anchor texts in Web search, these methods usually exploit the hyperlink information, i.e., the information provided by Web page editors. For example, the number of an anchor text from different Web pages is considered as a relevance signal of the destination page with respect to a query matching the anchor text (Dou et al. 2009). Since anchor data are initially created to help users browse the Web, user’s browsing activities may provide useful or complementary information for anchor text weighting. For instance, the number of visits on an anchor text by different Web users may also be assumed as a relevance signal of the destination page with respect to a query matching the anchor text. This is based on the fact that user’s browsing behavior is affected by anchor texts. Users tend to click anchor texts able to navigate them to ideal destination pages during Web browsing. Misleading anchor texts may attract clicks from few users but are not likely to attract clicks from majority of Web users. Even if a misleading anchor text has been clicked by a Web user, the user is not likely to click it again.

In this paper, we attempt to address the possibility and effectiveness of incorporating Web browsing activities into anchor texts for Web search. To the best of our knowledge, this is the first such effort. We first make an analysis on the effectiveness of anchor texts with Web browsing activities. Next we propose two new anchor models incorporating browsing activities, i.e., the anchor model aggregating browsing activities from different pages and the anchor model which aggregates browsing activities from different sites. Similar to other anchor models, the new anchor models also suffer from the anchor text sparseness problem (Metzler et al. 2009). To deal with it, we explore and analyze two features of user’s Web browsing behavior on a Web page, i.e., the entropy of browsing users and the entropy of browsing anchors. Based on these features, a smoothing method for the new anchor models is proposed. In experiments, we compare the new anchor models with the state-of-art anchor models (Craswell et al. 2001; Dou et al. 2009; Metzler et al. 2009). Experimental results show that by incorporating browsing information the new anchor models outperform the baseline models using only the hyperlink information. This paper demonstrates the value of Web browsing activities to affect anchor texts weighting.

The rest of this paper is organized as follows. Section 2 introduces related work. Section 3 describes some basic knowledge on the data used in this paper. In Sect. 4, we propose the new anchor models and the smoothing method. We report experimental results in Sect. 5 and conclude our work in Sect. 6.

2 Related work

Anchor text has been extensively explored and used in the area of Information Retrieval. Craswell et al. (2001) first propose the method of using anchor text for site finding. All anchor texts pointing to a destination Web page are collected as a descriptive anchor document. They discover that anchor document is especially useful for navigational queries. Eiron and McCurley (2003) make a statistical analysis on anchor texts by comparing anchor texts of a Web page with its title and content. They find that anchor texts are similar not only to titles of their destination pages but also to practical search queries. Westerveld et al. (2001) use a similar anchor document like Craswell et al. (2001) but they use language model (Ponte and Croft 1998) for anchor document retrieval instead of Okapi BM25 Model (Robertson et al. 1996). In these work, weight of each anchor text is defined as its frequency in an anchor document. Dou et al. (2009) define the weight of anchor text in an anchor document in a different way. They consider relationship between hyperlinks from different websites, which make use of the hyperlink structure. Metzler et al. (2009) build enriched anchor documents with anchor texts that have been aggregated across the hyperlink graph. Following these studies, we also use anchor documents for anchor text retrieval. Different from previous studies, Web browsing activities have been incorporated into anchor documents.

Anchor text has been used in other Web search related areas. In cross-language IR, anchor texts are considered as parallel texts for hyperlinks to the same Web page (Lu et al. 2004). For query classification, a method of using link distribution of anchor texts across destination pages is proposed in (Lee et al. 2005) to discriminate navigational queries from informational queries. This approach is further improved in (Fujii 2008) by using the distribution of query terms to deal with the situation in which the queries do not match the whole anchor texts. Anchor text has also been mined for query refinement in (Kraft and Zien 2004). They also find that anchor text can augment query refinement result produced by query log. Amitay and Paris (2000) utilize the structure of hypertext and the way people describe information in it to summarize websites automatically.

Browsing activities of Web users have been collected initially for user behavior analysis and prediction (Sarukkai 2000), and it has been used further in recommender algorithms for e-commerce (Sarwar et al. 2000). Recently, there are several studies on the utilization of browsing activities in Web search. Bilenko and White (2008) identify relevant information sources from the history of combined searching and browsing behavior of many Web users. Liu et al. (2008a, b) show that BrowseRank algorithm based on browsing information works better than hyperlink analysis algorithm such as PageRank. Although these studies use browsing behavior to improve retrieval, (Bilenko and White 2008) focus on combining post-search browsing behavior with user’s interaction with search engines, and Liu et al. (2008a, b) use Web browsing behavior to estimate Web page importance.

However, all these studies above haven’t taken browsing activities into account when using associated anchor texts to improve Web search. When users browse the Web by clicking anchor texts, they tend to click “better” anchor texts able to navigate them to ideal destination pages. This makes anchor texts with browsing activities differ from anchor texts with only the hyperlink information. The latter is easy to be influenced by websites functions and commercial intentions (Dou et al. 2009; Gyöngyi and Garcia-Molina 2005). In this paper, we propose methods to incorporate browsing activities into anchor text weighting for Web search.

3 The data

In this section, we introduce some basic knowledge and details about the data used in this paper.

3.1 Anchor data

A hyperlink is a relationship between two Web pages or two parts of the same Web page. The source Web page is the one containing the link. For instance, the source Web page www.google.com would contain text such as:

$$ < {\text{a}}\,{\text{href}} = \hbox{``}{\text{http}}://{\text{images}}.{\text{google}} . {\text{com}}\hbox{''} > {\text{Images}} < /{\text{a}} > $$

in its HTML script. The destination Web page is the one referred to by the link, in this case http://www.image.google.com. The link’s anchor appears to the user in the source page with anchor text: Images. If the user selects the anchor, their browser will display the destination page. A source page usually links to one or more different destination pages using different anchor texts and a destination page can also be linked by several source pages using different anchor texts. If a Web page links to a destination page with the same anchor text more than once, it counts only once in this paper.

The anchor data are obtained from a large scale of Web page collection, which contains 135.4 million Web pages from 5.3 million Chinese Web sites, and the total storage size is 5.0 TBytes. This Web page collection is large enough for us to study the problem in a more realistic setting and to make our experimental result applicable for practical Web search. This Web page collection is completely open online.^{Footnote 1}

We obtain 6.9 billion anchors from the Web page collection. Since there exists a large amount of noise such as “the previous one” “next”, “and more” etc. in original anchor texts, we ran a preprocessing process to filter out these less informative anchor texts. After preprocessing, we obtain 6.7 million distinct and more informative anchor texts. Note that one distinct anchor text usually links to many different destination pages.

3.2 Web browsing data

With the development of search engines, Web browser toolbars become more and more popular recently. In order to provide value-added services to users, most toolbar services collect anonymous click-through information from users’ browsing behavior. Previous work such as (Bilenko and White 2008) adopted this kind of click-through information to improve ranking performance. Our previous work (Liu et al. 2008a; Yiqun and Liyun 2006) proposed a Web spam identification algorithm based on this kind of user behavior data. In this paper, we also adopt Web browsing logs collected by search toolbar because this kind of data sources collect user behavior information at low cost without interrupting users’ browsing behavior. A browsing event recorded in these logs is shown in Table 1.

Table 1 A browsing event recorded in Web browsing logs

Full size table

If a user directly input a destination URL instead of clicking an anchor text on a source page, the “Source URL” item is set as “NULL”. From Table 1 we can see that no privacy information is included in the log data. The information shown can be easily recorded using browser toolbars by commercial search engine systems. Therefore it is practical and feasible to obtain these types of information.

In this paper we define a user session as a logical unit of user’s Web browsing. We use two common rules to segment user sessions (White et al. 2007; Liu et al. 2008a, b). First, if the time of the current event is 30 min behind that of the previous event, then the current event is regarded as the start of a new user session; otherwise, if the “Source URL” of the record is “NULL”, then we regard the current event as the start of a new user session.

With the help of a widely-used commercial Chinese search engine, Web browsing logs of 1 year were collected from the year 2008. Over 9.4 billion browsing events on 12.5 million Web sites were recorded in these logs.

3.3 Query logs

Query logs of search engines usually record not only users’ queries to search engines but also user’s interactions with search engines (e.g. URL clicked by users in search results). In this paper, we need query logs to analyze the effectiveness of anchor texts with browsing activities in Sect. 4.2.

We use query logs of a widely used Chinese commercial search engine. The query logs of 9 months from Jan. 1st, 2008 to Sep. 30th, 2008 are used in this paper. In total, we use over 4.3 million unique queries and over 45 million individual interactions with the search engine.

4 Anchor models

When anchor texts are used for Web search, building an anchor document^{Footnote 2} for the target Web page is reported to be effective (Craswell et al. 2001; Westerveld et al. 2001; Dou et al. 2009). We follow these studies to construct an anchor document for the target Web page. In this section, we first present general anchor document representation. Next we propose new anchor models incorporating Web browsing activities. Finally the smoothing method for the proposed anchor models is proposed.

4.1 General anchor document representation

We formalize anchor texts and their target Web pages as a weighted matrix shown in Fig. 1, in which, a _i represents an anchor text and d _j represents a target page (m and n equal to the total number of distinct anchor texts and Web pages respectively). w(a _i, d _j) represents weight assigned to a _i and d _j pair. The column vector <w(a ₁, d _j), w(a ₂, d _j), …, w(a _m, d _j)> in the matrix is regarded as the anchor document (representation) of the page d _j.

The definition of w(a _i, d _j) is important. Different anchor models have different definitions for w(a _i, d _j), which correspond to different types of anchor document representations. Craswell et al. (2001) and Westerveld et al. (2001) define w(a _i, d _j) as the count of hyperlinks to target page d _j with anchor text a _i. For example, if there are 1,000 different source pages containing hyperlinks to the destination page http://www.acm.org with the anchor text “ACM”, weight assigned to http://www.acm.org and “ACM” pair is 1,000. Dou et al. (2009) use hyperlink information at site level to define w(a _i, d _j), meanwhile the definition has taken account of the relationship among different sites. Two kinds of anchor document representations both make use of the hyperlink information. In order to improve ranking accuracy of anchor document representation, in this paper, we try to define a different kind of anchor document representation which makes a better estimation of w(a _i, d _j) by incorporating Web browsing activities. The anchor document representations proposed by Craswell et al. (2001) and Dou et al. in (2009) are used as baseline anchor models.

After an anchor document is constructed for its target page, different ranking methods, such as BM25 model (Robertson et al. 1996) and the language model (Ponte and Croft 1998), can be used to retrieve the anchor documents and to evaluate the relevance between user queries and anchor documents. In this paper, we use the same BM25 model as in (Craswell et al. 2001; Dou et al. 2009) to rank anchor documents.

Dou et al. (2009) has reported another effective ranking method for anchor document representations. When a search query q exactly matches an anchor text a _i, the transition probability

$$ P(d_{j} |a_{i} ) = {\frac{{w(a_{i} ,d_{j} )}}{{\sum\limits_{1 \le k \le n} {w(a_{i} ,d_{k} )} }}} $$

(1)

also reflects the relevance of query q to page d _j. In this paper, when a query exactly matches an anchor text, both the transition probability and BM25 score are used to evaluate relevance of a query to an anchor document; otherwise, only BM25 score is used.

4.2 Analysis on anchor texts with browsing activities

Eiron and McCurley (2003) conclude that the main reason for the effectiveness of anchor texts in Web search is that they resemble real-world search queries in terms of term distribution and length. Thus, if we can validate that anchor texts with browsing activities are more similar to search queries than other anchor texts, it’s reasonable to expect that incorporating browsing activities is able to provide better anchor texts weighting mechanism for Web search.

We first construct two kinds of anchor document for each Web page, i.e., an anchor document constructed using raw anchor data (all hyperlinks and associated anchor texts, RawAncDoc in short), and an anchor document constructed using anchor texts with browsing activities, ClkAncDoc in short (ClkAncDoc is RawAncDoc after eliminating those anchors without clicks). Next, we construct a query document for each Web page, which consists of search queries Web users submit to search engines and then click the Web page, QueryDoc in short (the search query and Web page pair are extracted from query logs described in Sect. 3.3). If the term frequency distribution of ClkAncDoc is more similar to the distribution of QueryDoc than RawAncDoc, we have reason to believe incorporating browsing activities is able to provide better anchor texts weighting. In this paper, we use Kullback–Leibler (KL) divergence with Laplace smoothing method to measure similarity between two term frequency distributions. The formula for the KL divergence is:

$$ {\text{KL}}(p_{q} ||p_{a} ) = \sum\limits_{{\omega \in V_{q} }} {p_{q} (\omega )\log {\frac{{p_{q} (\omega )}}{{p_{a} (\omega )}}}} $$

where p _q and p _a are the relative frequencies of term ω in the query and anchor document distributions respectively. V _q represents the vocabulary set of query document. In this ordering, the query distribution is considered to be the “true” distribution and the anchor document distribution is seen as an approximation to it. The relative frequency is calculated as follows:

$$ p_{q} (\omega ) = {\frac{{tf_{q} (\omega )}}{{\sum\limits_{{\omega \in V_{q} }} {tf_{q} (\omega )} }}} $$

where tf _q(ω) is the number of occurrences of term ω in the query document for a particular URL. Since the KL divergence becomes infinite whenever any term in the query document vocabulary but not in the anchor document vocabulary, we replace the relative frequency by a Laplace smoothed estimate p _a′(ω). By performing Laplace smoothing over all terms in the union of the vocabularies we guarantee that the KL divergence will be finite:

$$ p_{a}^{\prime } (\omega ) = {\frac{{tf_{a} (\omega )}}{{|V_{q} \cup V_{a} | + \sum\limits_{{\omega \in V_{a} }} {tf_{a} (\omega )} }}} $$

This smoothing method introduces no additional smoothing coefficient parameters.

Figure 2 shows the KL divergence between queries and anchor texts for 1,000 randomly sampled URLs; The URLs have been ordered from most similar to least. KL(p _q||p _ra) represents KL divergence between queries and raw anchor texts; KL(p _q||p _ca) represents KL divergence between queries and anchor texts with user clicks. This plot tells us how well the smoothed anchor document term distributions do at describing the term frequencies observed in the query data. Higher values on this graph indicate lower similarity between queries and anchor texts. The fact that KL(p _q||p _ra) series lies well above KL(p _q||p _ca) indicates that anchor texts with user clicks are more similar to search queries. Based on this fact, it’s reasonable to expect that incorporating browsing activities is able to provide better anchor texts weighting mechanism for Web search.

4.3 Incorporating browsing activities

In this section, we propose two new anchor models incorporating browsing activities. The first model aggregates user’s browsing activities from different pages and is abbreviated as UPM (U: User, P: Page, and M: Model); the second one aggregates browsing activities from different sites and is abbreviated as USM (S: Site). Two models have different definitions for w(a, d _t) in the general anchor document representation described in Sect. 4.1. The weight and transition probability in UPM are denoted as w _up(a, d _t) and p _up(d _t |a); The weight and transition probability in the USM are denoted as w _us(a, d _t) and p _us(d _t |a). For convenience, some basic definitions and symbols used in the remaining sections are pre-defined in Table 2.

Table 2 Basic definitions and symbols used in this paper

Full size table

4.3.1 Page-level aggregation (UPM)

To describe the anchor model incorporating browsing activities, we first build the Web browsing graph G _u(N, E _u) using Web browsing data mentioned in Sect. 3.2. The Web browsing graph is constructed according to the definition of G _u(N, E _u) in Table 2.

The weight w _up(a, d _t) in UPM between anchor text a and target page d _t is defined as follows.

$$ w_{up} (a,d_{t} ) = \sum\limits_{{d_{s} \in N}} {\sum\limits_{u \in U} {\tau ( < u,a,d_{s} ,d_{t} > \in E_{u} )} } $$

(2)

where, τ(cond.) is the indicative function defined in Table 2; <u, a, d _s, d _t> is a Web browsing event defined in Table 2; w _up(a, d _t) eventually equals to the total number of browsing events from different source pages to target page d _t with anchor text a. p _up(d _t|a) can be obtained accordingly based on the definition of (Eq. 1) described in Sect. 4.1.

Figure 3a illustrates an example of Web browsing data; five Web browsing events are listed in tetrads. For ease of explanation, only one distinct anchor text a is involved in this example. In Fig. 3b, each inverse triangle represents anchor text a on a source page; each circle node represents a target page. Meanwhile an arrow represents a hyperlink associated with anchor text a. The number (weight) associated with an arrow represents the count of browsing events navigated from a source page to a target page with anchor text a. For example, in two browsing events <u ₁, a, d _s1, d ₁> and <u ₂ , a, d _s1 , d ₁>, users click anchor text a on source page d _s1 and navigated to target page d ₁; and hence the weight of arrow <d _s1 , d ₁> equals to 2. Meanwhile, there is no user visit on the anchor text a on source page d _s2 and d _s3, so the corresponding weights of <d _s2 , d ₁> and <d _s3 , d ₁> all equal to 0. According to the definition of w _up(a, d _t), w _up(a, d ₁) = 2, w _up(a, d ₂) = 3, Fig. 3c shows the transition probability for different target pages given the anchor text a: p _up(d ₁|a) = 2/5 = 0.4; p _up(d ₂|a) = 3/5 = 0.6.

Note that, in this model, multiple Web browsing events initiated in one user session from the same source page to the same target page with the same anchor text are calculated only once. This strategy is used because of the fact that the multiple browsing events initiated in one user session on the same anchor text is not a strong indication of its importance for its target page.

UPM based anchor documents^{Footnote 3} can be regarded as being constructed by a double election process. The page authors first make a list of target page candidates for an anchor text, and then Web users vote for them unconsciously during Web browsing. Thus, to a certain extent, UPM based anchor documents reflect both the recommendation of page content editors and the preference of Web users.

4.3.2 Site-level aggregation (USM)

In UPM, w _up(a, d _t) is defined as the number of Web browsing events from different source pages to target page d _t with anchor text a. UPM doesn’t consider whether the browsing events are from source pages of only one website or from source pages of many websites. We assume that browsing events with source pages from more sites are better than those from fewer sites. Thus, in this section, we propose USM, which takes account of browsing events at site level. The weight w _us(a, d _t) in USM between anchor text a and target page d _t is defined as follows.

$$ w_{us} (a,d_{t} ) = \sum\limits_{{{\text{site}}_{i} \in S^{*} }} {\left( {{\frac{{\sum\limits_{{d_{s} \in {\text{site}}_{i} }} {\sum\limits_{u \in U} {\tau ( < u,a,d_{s} ,d_{t} > \in E_{u} )} } }}{{\sum\limits_{{d_{s} \in {\text{site}}_{i} }} {\tau ( < a,d_{s} ,d_{t} > \in E_{h} )} }}}} \right)} $$

(3)

where, $ \sum\nolimits_{{d_{s} \in {\text{site}}_{i} }} {\sum\nolimits_{u \in U} {\tau ( < u,a,d_{s} ,d_{t} > \in E_{u} )} } $ counts the total number of browsing events from source pages of site_i to target page d _t with anchor text a; E _h represents the edge set of the hyperlink graph G _h(N, E _h) defined in Table 2; $ \sum\nolimits_{{d_{s} \in {\text{site}}_{i} }} {\tau ( < a,d_{s} ,d_{t} > \in E_{h} )} $ counts the number of web pages in site_i with anchor text a to target d _t. $ \left( {{\frac{{\sum\nolimits_{{d_{s} \in {\text{site}}_{i} }} {\sum\nolimits_{u \in U} {\tau ( < u,a,d_{s} ,d_{t} > \in E_{u} )} } }}{{\sum\nolimits_{{d_{s} \in {\text{site}}_{i} }} {\tau ( < a,d_{s} ,d_{t} > \in E_{h} )} }}}} \right) $ equals to the average number of browsing events from site_i to target d _t with anchor text a. We have tried various methods for aggregating browsing events from different pages of the same site, such as maximum, minimum etc. Mean is finally adopted for its simplicity and effectiveness. S* represents all websites. w _us(a, d _t) actually counts the average number of browsing events from different websites to the target page d _t with the anchor text a.

In Fig. 3, page d _s1 and page d _s2 are from the same web site₁, meanwhile d _s3 and d _s4 are respectively from site₂ and site₃. According to the definition of USM, w _us(a, d₁) = (2 + 0)/2 + 0/1 = 1, w _us(a, d ₂) = 3/1 = 3, and p _us(d ₁|a) = 1/4 = 0.25, p _us(d ₂|a) = 3/4 = 0.75.

4.4 Smoothing

4.4.1 The smoothing method

Similar to other anchor models (Metzler et al. 2009), UPM and USM also suffer from the data sparseness problem. According to our statistics, 25.3% web pages have empty UPM/USM based anchor documents. For many other pages, UPM/USM based anchor documents are very short. We refer to this as the sparseness problem of user-clicked anchor texts. The major cause of this sparseness problem is the fact that not all anchor texts are clicked by users. Many anchor texts on Web pages receive no user click. Some of these non-clicked anchor texts are irrelevant to their target pages, while some of them are with high quality and relevance to the target page.

In UPM and USM, we assume that anchor texts with user clicks may be useful, and we only take account of these anchor texts. However, anchor texts with clicks are sparse. To deal with the problem, we introduce another assumption that anchor texts on qualified Web pages (explained later) are also useful. Thus, in our smoothing method, we consider both anchor texts with user clicks and anchor texts on qualified pages.

Our smoothing method is to enrich UPM/USM based anchor documents with anchor texts on qualified pages. Pages are qualified and selected into the heuristic page set N _H if they satisfy a certain condition.

$$ N_{H} = \left\{ {d|Criterion(d) > \delta ,d \in N} \right\} $$

where, Criterion(d) is the criterion function; δ is the smoothing parameter which controls the number of pages included in N _H, i.e., the size of N _H. All anchor texts on pages in N _H are added into UPM/USM based anchor documents of their target pages. The weight for an anchor text on a page in N _H is set as 1. In this way, the UPM/USM based anchor documents are expanded with useful anchor texts.

This smoothing method can be regarded as the process: due to various factors, the votes of Web users are incomplete; we use some strategies to select some page content editors to make up for the missing votes.

After smoothing, w _up(a, d _t) in UPM (Eq. 2 ) becomes w _up′ (a, d _t), which is calculated as follows.

$$ \sum\limits_{{d_{s} \in N}} {\sum\limits_{u \in U} {\tau ( < u,a,d_{s} ,d_{t} > \in E_{u} )} + \sum\limits_{{d_{s} \in N_{H} }} {\tau ( < a,d_{s} ,d_{t} > \in E_{h} )} } $$

After smoothing, w _us(a, d _t) in USM (Eq. 3 ) becomes w _us′(a, d _t), which is calculated as follows.

$$ \sum\limits_{{{\text{site}}_{i} \in S^{*} }} {\left( {{\frac{{\sum\limits_{{d_{s} \in {\text{site}}_{i} }} {\sum\limits_{u \in U} {\tau ( < u,a,d_{s} ,d_{t} > \in E_{u} ) + } \sum\limits_{{d_{s} \in {\text{site}}_{i} \cap N_{H} }} {\tau ( < a,d_{s} ,d_{t} > \in E_{h} )} } }}{{\sum\limits_{{d_{s} \in {\text{site}}_{i} }} {\tau ( < a,d_{s} ,d_{t} > \in E_{h} )} }}}} \right)} $$

In Fig. 3, the solid nodes represent the pages in N _H. After smoothing, w _up′(a, d ₁) = 2 + (0 + 1) + (0 + 1) = 4, w _up′(a, d ₂) = 3+1 = 4; w _us′(a, d ₁) = (2 + (0 + 1))/2 + (0 + 1)/1 = 2.5; w _us′(a, d ₂) = (3 + 1)/1 = 4.

Now the problem becomes how to define the criterion function Criterion(d). Since we expect the anchor texts on the qualified pages are effective for Web search (this makes UPM/USM based anchor documents enriched with highly effective anchor texts), the criterion function should score the effectiveness of anchor texts on a Web page. We explore two features which correlate with the effectiveness of anchor texts. Based on these features, we define the criterion function.

4.4.2 Entropy of browsing users (BUE)

The first feature is called as Entropy of Browsing Users (BUE):

$$ {\text{BUE}}(d_{s} ) = - \sum\limits_{u \in U} {P(u|d_{s} ){ \log }\,P(u|d_{s} )} $$

in which, U represents all user sessions; P(u|d _s) represents the probability that the page d _s is visited by user session u.

$$ P(u|d_{s} ) = {\frac{{\sum\limits_{a \in A} {\sum\limits_{{d_{t} \in N}} {\tau ( < u,a,d_{s} ,d_{t} > \in E_{u} )} } }}{{\sum\limits_{{u_{i} \in U}} {\sum\limits_{a \in A} {\sum\limits_{{d_{t} \in N}} {\tau ( < u_{i} ,a,d_{s} ,d_{t} > \in E_{u} )} } } }}} $$

BUE measures the degree to which a page is prone to be visited by different users. As we will show next, BUE correlates with the effectiveness of anchor texts.

To investigate the relation between BUE and effectiveness of anchor texts, we first divide all pages into two page sets: S1 and S2, and make sure that: $ \forall d_{i} \in S1 $, $ \forall d_{j} \in S2 $, BUE(d _i ) > BUE(d _j ). Then we use anchor texts on pages of S1 and S2 to construct anchor documents BUEGtDoc and BUELtDoc respectively. Again we compare the KL divergence between BUEDoc and QueryDoc for every page. The formula for KL divergence and the smoothing method are the same as those described in Sect. 4.2. Figure 4 shows the distribution of KL Divergence: KL(p _q||p _buegt) of 76% pages are smaller than 1.5, meanwhile KL(p _q||p _buelt) of 56% pages are smaller than 1.5; KL(p _q||p _buegt) of 24% pages are greater than 1.5, meanwhile KL(p _q||p _buelt) of 44% pages are greater than 1.5. KL(p _q||p _buegt) is averagely smaller than KL(p _q||p _buelt), which indicates that anchor texts on pages with more different visiting users are more similar to search queries. In other words, anchor texts on pages with higher BUE value are more effective for Web search than other anchor texts.

4.4.3 Entropy of browsing anchors (BAE)

A Web page usually contains many anchors on it. But not every anchor is clicked by Web users, such as “Copyright”, “Contact Us”, and some advertising anchors etc. Although BUE score reflects whether a page is prone to be visited by different users, it can’t reflect all the anchor texts on this page are preferable. For example, the pages containing only 1 or 2 useful anchors and a large number of anchors with commercial intentions often get high BUE value because many different users visit the useful anchors. If we use BUE directly as the criterion function and choose it, a large number of commercial intentional anchors will also be added into UPM/USM based anchor documents. These pages are very common in practice.

The criterion function is expected to select pages which are not only visited by many different users but with many user clicked anchors. So we explore another feature, i.e., Entropy of Browsing Anchors (BAE), to reflect how user clicks scatter on different anchors of a page. The definition of BAE is as follows.

$$ {\text{BAE}}(d_{s} ) = - \sum\limits_{a \in A} {P(a|d_{s} ){ \log }\,P(a|d_{s} )} $$

in which, A represents all anchor texts; P(a|d _s) represents the probability that anchor text a on source page d _s is clicked by different user sessions.

$$ P(a|d_{s} ) = {\frac{{\sum\limits_{u \in U} {\sum\limits_{{d_{t} \in N}} {\tau ( < u,a,d_{s} ,d_{t} > \in E_{u} )} } }}{{\sum\limits_{{a_{i} \in A}} {\sum\limits_{u \in U} {\sum\limits_{{d_{t} \in N}} {\tau ( < u,a_{i} ,d_{s} ,d_{t} > \in E_{u} )} } } }}} $$

BAE measures the degree to which user clicks on a source page are scattered on different outgoing anchors of this source page. It focuses on links from a page rather than links to a page.

BAE also correlates with the effectiveness of anchor texts. The approach for BAE analysis is similar to the one for BUE described in Sect. 4.4.2. Figure 5 shows that KL(p _q||p _baegt) is averagely smaller than KL(p _q||p _baelt). In other words, anchor texts on pages with higher BAE value are more effective for Web search than other anchor texts.

4.4.4 Criterion functions

As mentioned in Sects. 4.4.3 and 4.4.4, both BUE and BAE reflect the effectiveness of anchor texts on a page and they measure different aspects of user’s browsing behavior. In this section, based on BUE and BAE we define different criterion functions, which are shown in Table 3.

Table 3 Criterion functions

Full size table

CF1 and CF2 directly use BUE and BAE respectively. Since BUE and BAE reflect different aspects of user’s browsing behavior, the combination of them is expected to gain complementary effectiveness. In this paper, we use two common strategies for their combination, i.e., linear combination (CF3) and multiplication (CF4). Comparative experimental results are shown in Fig. 10 of Sect. 5.2.4 (CF4 is shown to be quite effective in practice). Since the focus of this paper is incorporating browsing activities, we leave the exploration of other features and criterion functions as future work.

5 Experiments

5.1 Settings and methodology

5.1.1 Anchor models

Our goal is to investigate the performance of different anchor models for Web search. To compare new anchor models with previous studies, we adopt some existing anchor models as baselines. The anchor models examined in our experiments are categorized into two types, i.e., the page-level models and the site-level models. The page-level models are as follows.

Baseline1: HPM (H: Hyperlink). This page-level hyperlink model has been adopted in (Craswell et al. 2001; Westerveld et al. 2001). In this model, w(a, d_t) is defined as the number of hyperlinks to the target page d_t with the anchor text a. HPM has been extensively described and evaluated in IR literature, and hence serves as a reproducible and common baseline.
Baseline2: HPAAM (AA: Aggregated Anchor). This model has been proposed in (Metzler et al. 2009). In this model, anchor document representations are enriched with anchor text that has been aggregated across the hyperlink graph. Metzler et al. (2009) has proposed six different weight aggregation schemes. In all our experiments, we use “Min” as the weight aggregation scheme, because it is shown to be the best (Metzler et al. 2009).
UPM: UPM aggregates browsing activities from different pages, which is described in Sect. 4.3.1.
UPM + SMTH: (SMTH: Smoothing). In this model, the UPM based document representations are smoothed using the smoothing method described in Sect. 4.4.

The site-level anchor models are as follows.

Baseline3: SRM: (R: Relationship). The site relationship model has been proposed by Dou et al. (2009). This model takes account of the relationships between Web sites (including the relationship between source site and destination site, and the relationship between different source sites). Dou has reported that SRM consistently outperforms HPM for different types of queries. Hence, SRM serves as a strong baseline.
USM: UPM aggregates browsing activities from different sites, which is described in Sect. 4.3.2.
USM + SMTH: In this model, the USM based document representations are smoothed using the smoothing method described in Sect. 4.4.

In our experiments, anchor data are obtained from the Web page collection mentioned in Sect. 3.1. This Web page collection contains 135.4 million Web pages from 5.3 million Chinese Web sites. We obtain 6.9 billion anchors from this Web page collection. Since there exists a large amount of noise such as “the previous one” “next”, “and more” etc. in original anchor texts, we ran a preprocessing process to filter out these less informative anchor texts. After preprocessing, we obtain 6.7 million distinct anchor texts. Note that one distinct anchor text usually links to many different destination pages.

For each page in this collection, different types of anchor documents are constructed respectively using different anchor models mentioned above. We index different types of anchor documents and ranking experiments are performed on them using different ranking methods. Besides, we also test the ranking performance of page content and the combination effectiveness of page content and anchor texts.

5.1.2 Ranking methods

We use the following ranking methods.

BM25 (and BM25F): Okapi BM25 model (Robertson et al. 1996) is one ranking model we adopt to retrieve the anchor documents constructed by different anchor models. Many previous studies have reported that this model is effective for anchor document retrieval (Craswell et al. 2001; Dou et al. 2009; Metzler et al. 2009). In our experiments, we use the same parameters (k1 = 2.0, b = 0.75) as the previous work (Craswell et al. 2001; Dou et al. 2009). When page contents and anchor texts are combined together, multi-field ranking model BM25F (Robertson et al. 2004) is used.
QAMatch + BM25: When a search query exactly matches an anchor text, the transition probability (Eq. 1) mentioned in Sect. 4.1 also signals the relevance of a query to a Web page. We call this ranking method as QAMatch. Dou et al. (2009) has reported the effectiveness of this method. In QAMatch + BM25 ranking method, both QAMatch score and BM25 score are normalized respectively and then combined linearly. The weight for QAMatch score and BM25 score is set to 0.5 and 0.5 in all our experiments. When a search query cannot exactly match an anchor text, only BM25 model makes a contribution to the final score.

5.1.3 Evaluation methodology

To evaluate the ranking performance, we need a dataset containing queries and their relevant answers. Our experiments make use of a dataset which contains 3,000 randomly sampled real search queries from query logs of a widely used commercial search engine. These queries are uniformly sampled from all unique queries received during Jan. 1st, 2008 to Sep. 30th, 2008. (In total, there are over 4.3 million unique queries. Also described in Sect. 3.3) For each query, we collect together top 20 search results produced by different anchor models and ranking methods. Each document has been judged by a human annotator and given a rating as to how relevant it is for the corresponding query. This rating is one of Perfect, Good, or Bad (both Perfect and Good are called “relevant” in our experiments). Furthermore, each query is manually labeled by a human annotator as one of the three types, i.e., navigational queries, informational queries, and transactional queries (Broder 2002). In the dataset, there are 1,152 (38.4%) navigational queries, 568 (18.9%) informational queries, and 1,280 (42.3%) transactional queries, and the proportions for each type of queries are in accordance with the statistics reported in (Broder 2002). The human annotators are all from a professional annotation team of the commercial search engine company.

We evaluate the ranking performance over a range of accepted information retrieval metrics as follows.

Precision at N (P@N): reports the fraction of Perfect and Good documents ranked in the top N results. The positions of relevant documents within the top N results are not considered in this measure. And this metric reflects overall user satisfaction with the top N results.
Mean Average Precision (MAP): Average precision emphasizes ranking relevant documents higher. It is the average of precisions computed at the point of each of the relevant documents in the ranked sequence. This metric has been extensively used in TREC for many years (Clarke et al. 2009).
NDCG at N (NDCG@N): NDCG is a measure devised specifically for Web search (Jarvelin and Kekalainen 2000). And it has been adopted in many studies for Web search (Dou et al. 2009). The premise of DCG is that highly relevant documents appearing lower in a search result list are penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.

5.2 Results

In Sect. 4, we have stated that our goal is to generate more accurate weights for each anchor text and to improve the overall ranking results. In this section, we first examine whether our models can generate better weights. Next, we investigate the performance of different anchor models for different types of queries. And then, we evaluate the retrieval performance when page content and anchor text are combined together. Finally the analysis on the smoothing method is carried out to compare performance of different criterion functions and to tune the settings of the smoothing parameter.

5.2.1 Performance of different anchor models with different ranking methods

Table 4 shows the document ranking performance of different anchor models with different ranking methods. The results are divided into two parts, i.e., the upper part (from Row 1 to Row 8) and the bottom part (from Row 9 to Row 14). The results of the page-level anchor models are shown in the upper part; the results of the site-level anchor models are listed in the bottom part. The relative improvements of the new anchor models versus the baseline anchor models are also shown in this table. Both HPM (B1) and HPAAM (B2) are the page-level baseline models; meanwhile SRM (B3) is the site-level baseline model. The results of two-tailed t test are reflected by the up marks or the down marks. The best result among different anchor models with a certain ranking method is shown in bold. We can make the following observations:

Table 4 Comparison of results for different types of anchor document representations with different ranking methods

Full size table

1.
When the ranking method BM25 is used with the page-level anchor models (Row 1–4), UPM and UPM + SMTH outperform the baseline anchor models (HPM and HPAAM) on NDCG@1/5/10, AveNDCG and MAP. It means that UPM either before or after smoothing is more effective than HPM and HPAAM for Web Search. The performance of UPM + SMTH is much better than UPM (UPM + SMTH $ \succ $ ^{Footnote 4} UPM) and achieves the best performance, which indicates that the UPM based anchor documents are expanded by highly effective anchor texts instead of irrelevant noises.
2.
When the ranking method QAMatch + BM25 is used with the page-level anchor models (Row 5–8), UPM is still effective for improving NDCG@1 and the overall ranking performance is labeled with two up marks (AveNDCG and MAP). However, it fails to gain statistically significant improvements on NDCG@5/10. The results indicate that UPM is effective for improving the accuracy of the top 1 search result and fail to improve the other top 10 search results under the ranking method QAMatch + BM25. This can be explained by the fact that the ranking method QAMatch in essence estimates the transition probability which is inherently sensitive to the data sparseness problem of user-clicked anchor texts (mentioned in Sect. 4.4). UPM + SMTH still significantly outperforms the baselines (B1 and B2) and achieves the best performance.
3.
When BM25 is used with the site-level anchor models (Row 9–11), similar observations have been found when BM25 is used with the page-level models. The new anchor models (USM and USM + SMTH) gain statistically significant improvements versus the site-level baseline model (SRM). Meanwhile, USM + SMTH achieves the best performance across all evaluation metrics.
4.
When QAMatch + BM25 is used with the site-level anchor models (Row 12–14), USM improves the baseline (SRM) on NDCG@1/5 but fails to improve the overall performance (AveNDCG and MAP). It is due to the same data sparseness problem of user-clicked anchor texts as UPM. Moreover, USM + SMTH significantly outperforms the site-level baseline (SRM) and achieves the best performance.
5.
By comparing the performance of the new anchor models at page-level with the new models at site-level (Row3 vs. Row10; Row4 vs. Row11; Row7 vs. Row13; Row8 vs. Row14), we find that the site-level models $ \succ $ the corresponding page-level models. This observation indicates that average number of browsing activities from different sites is stronger than the absolute number of browsing activities from different pages in terms of indicating the relevance between an anchor text and its target page.
6.
By comparing the performance of different ranking methods (Row1 vs. Row5; Row 2 vs. Row6; Row 3 vs. Row7; Row4 vs. Row8; Row9 vs. Row12; Row10 vs. Row13; Row11 vs. Row14), we find that QAMatch + BM25 $ \succ $ BM25 consistently for anchor document retrieval.

From Table 4, we know that the site-level anchor models $ \succ $ the corresponding page-level models. To further illustrate the performance of the site-level anchor models, P@1–10 for different site-level anchor models are shown in Fig. 6. Similarly, when BM25 is used with the site-level models, USM + SMTH $ \succ $ USM $ \succ $ SRM (all at p < 0.001). When BM25 + QAMatch is used, USM only outperform SRM on top 2 results and USM + SMTH achieves the best performance.

To sum up, USM + SMTH with QAMatch + BM25 consistently and significantly outperforms the baselines (B1, B2, and B3) and achieves the best performance. In addition, the results in Table 4 are consistent with the results of comparative analysis in Sect. 4.2 (although QueryDoc contains some bias such as URL’s ranking position etc.).

5.2.2 Performance of different anchor models for different types of queries

As (Craswell et al. 2001; Dou et al. 2009) mentioned in their studies, anchor models may perform differently for different types of search queries. In this section, we design an experiment to examine the performance of the anchor models for different types of queries mentioned in Sect. 5.1.3. To compare with the anchor models, we also experiment with the page content. Figure 7 shows the performance of the site-level anchor models with the ranking method QAMatch + BM25 for different types of queries. From this figure, we can observe the following facts:

1.
When the search queries are navigational, we find that the new models (USM and USM + SMTH) significantly outperform the baseline model SRM. The statistically significant improvements are all at p < 0.001 on NDCG@1/5/10 and AveNDCG. It indicates that the new anchor models are effective for improving ranking performance of navigational queries. Since there are often only one or two answers for navigational queries, the observation that USM is especially effective for top five results (Row 13 in Table 4) explains why USM$ \succ $SRM for navigational queries.
2.
When the search queries are informational, the performance of USM improves the top five result (NDCG@1/5, p < 0.001) but fail to improve NDCG@10 and AveNDCG (with statistically significant decrease for NDCG@10 and AveNDCG at p < 0.001). Meanwhile, USM + SMTH$ \succ $SRM for all metrics at p < 0.001. The performance of page content $ \succ $ all anchor models on NDCG@10 and AveNDCG (all at p < 0.001).
3.
When the search queries are transactional, the performance of the new anchor models (USM and USM + SMTH) significantly outperforms the baseline SRM (all at p < 0.001). It implies that browsing activities are also very useful for improving ranking performance of transactional queries. The relevant answers of transactional queries are often several pages providing certain kinds of transactional services, such as software downloading, video online watching etc. The relevance of these pages to transactional queries largely depend on the user experience or service quality of these pages, such as downloading speed, video watching experience etc. Since browsing activities of Web users on an anchor text to some extent reflect user experience or service quality of the target page, it is understandable that anchor models with browsing activities are effective for transactional queries.

To illustrate how browsing activities of users reflect the quality of transactional services provided by target pages, we compare the transition probabilities of UPM and HPM when the same anchor text is given. Figure 8a illustrate p _hp (d|a) and p _up (d|a) for the anchor texts “map online”; where, p _hp (d|a) represents the transition probability in HPM; p _up (d|a) denotes the transition probability in UPM. The target pages have been ordered from most to least according to p _hp (d|a). For the transactional Web service—map online, in Chinese market map.google.com and map.sogou.com provide very good user experience and service quality. They are so popular that users often want to find them in the search results when they submit the query “map online” to search engines. However, Fig. 8a shows that they have very low p _hp(d|a) value. Meanwhile the pages (http://www.51ditu.com and http://www.52maps.com), which have high p _hp(d|a) value, provide worse user experience and service quality. According to our investigation, the pages http://www.51ditu.com and http://www.52maps.com have high p _hp (d|a) value mainly because they are linked by many different websites for the purpose of advertisement. By contrast, map.google.com and map.sogou.com have higher p _up(d|a) value than http://www.51ditu.com and http://www.52maps.com, which more objectively reflects the actual user experience and service quality. Similarly, in Fig. 8b, the pages with better user experience and service quality (http://www.jipiaotaobao.com, http://www.qunar.com, and http://www.ctrip.com) are highly ranked by UPM, but get lower ranks in HPM. Meanwhile, the pages with more commercial incoming links (http://www.airtofly.com and fly.piao.com.cn) are highly ranked by HPM.

5.2.3 Combining page content and anchor text

Since page content and anchor texts may reflect different aspects of a document, they are usually combined together in modern search engines. We also test the approach combining page content and anchor texts. Robertson et al. (2004) pointed out that a linear combination of BM25 scores is problematic and proposed a linear combination of term frequencies (BM25F). In this paper, we use the same term frequency combination method as in (Robertson et al. 2004). Suppose W _content(t, d) and W _anchor(t, d) are the term frequencies of term t in page content and anchor text fields of page d. The BM25 score is calculated based on the aggregated term frequency W _bm25f(t, d) over page content and anchor text fields as follows:

$$ W_{bm25f} (t,d) = \alpha \cdot W_{\text{content}} (t,d) + (1 - \alpha ) \cdot W_{\text{anchor}} (t,d) $$

(4)

where α is a combination parameter that decides the weight of page content and anchor text in the aggregated BM25 score. The combined document length is also calculated using the same method.

Since Table 4 shows that QAMatch is effective for ranking anchor document representations, we also incorporate the QAMatch score when page content and anchor texts are combined together. To incorporate the QAMatch score into BM25F score, we use the method proposed in (Agichtein et al. 2006). I _d represents the rank of page d obtained from the QAMatch score, and O _d is the rank of page d obtained from the BM25F score. The merged score is calculated as follows.

$$ S_{M} (d,I_{d} ,O_{d} ,\omega_{I} ) = \left\{ {\begin{array}{*{20}c} {\omega_{I} \cdot {\frac{1}{{I_{d} + 1}}} + {\frac{1}{{O_{d} + 1}}}} & {{\text{if}}\,{\text{exists}}\,{\text{QAmatch}}\,{\text{score}}\,{\text{ for}}\,d} \\ {{\frac{1}{{O_{d} + 1}}}} & {\text{otherwise}} \\ \end{array} } \right. $$

(5)

where ω _I represents the importance of QAMatch and equals to 2 in our experiments. From (Eq. 5 ), we know that QAMatch score actually reorders the rank produced by the BM25F score. Figure 9 shows the experimental results with different settings of the combination parameter α in (Eq. 4 ). From this figure, we can observe the following facts:

1.
Figure 9a shows that USM outperforms SRM on NDCG@1 and USM + SMTH achieves the best performance (all at p < 0.001). This suggests that both USM and USM + SMTH can improve retrieval effectiveness if only top 1 results is sought.
2.
Figure 9b shows that USM fail to obtain statistically significant improvement versus SRM on NDCG@10. This is consistent with the conclusions we derive in Table 4. However, USM + SMTH consistently outperforms SRM (p < 0.001).

5.2.4 Analysis on the smoothing method

In this section, we provide deeper analysis on the smoothing method. We compare the performance of different criterion functions (shown in Table 3) with different settings of the smoothing parameter. We compare the performance of different anchor documents produced by USM + SMTH with different criterion functions and different settings of the smoothing parameter. Meanwhile the ranking method QAMatch + BM25 is used in the experiments. In order to make the tuning process convenient, the smoothing parameter δ is normalized between 0 and 1. When the Norm(δ) equals to 0, anchor texts on all source pages are aggregated into the USM based anchor documents of their target pages. When the Norm(δ) equals to 1, no anchor text is added into the USM based anchor documents; As a result, USM + SMTH construct anchor documents equal to USM.

Figure 10a shows AveNDCG results of different criterion functions with different settings of the smoothing parameter. CF1 and CF2 achieve the best performance when the smoothing parameter equals to 0.8, while CF3 and CF4 achieve the best performance at δ = 0.9. Figure 10b shows NDCG@1/5/10 results of different criterion functions (the smoothing parameter equals to 0.8 for CF1 and CF2 and equals to 0.9 for CF3 and CF4). Both CF3 and CF4 outperform CF1 and CF2, which implies that combining BUE and BAE together is better than using them respectively. Moreover, CF4 consistently outperforms CF3 (all at p < 0.001). That’s why the results of CF4 are shown in all experiments in previous sections. According to our investigation, USM + SMTH consistently outperforms UPM + SMTH for different criterion functions. Thus, for clarity of Fig. 10, we only show the experimental results of USM + SMTH with different criterion functions. The results of UPM + SMTH with different criterion functions are similar to USM + SMTH, i.e., CF4 consistently achieves the best performance.

6 Conclusions

Existing methods of anchor text weighting mainly focus on using the hyperlink based information, which to some extent reflects the recommendation of page content editors. Browsing behavior of Web users reflects the crowd wisdom and provides complementary information for anchor text weighting. However, this information has not been taken into account when associated anchor texts are used to improve Web search.

In this paper, we discuss the possibility and effectiveness of incorporating browsing activities into anchor texts for Web search. We first discuss the effectiveness of anchor text with browsing activities. We find that anchor texts with browsing activities are more effective for Web search than other anchor texts. Then, we propose two new anchor models (UPM and USM) which aggregate browsing activities at page-level and at site-level. UPM/USM based anchor documents can be regarded as being constructed by a double election process. The page authors first make a list of target page candidates for an anchor text, and then Web users vote for them unconsciously during Web browsing. Thus, to a certain extent, UPM/USM based anchor documents reflect both the recommendation of page content editors and the preference of Web users. To further tackle the data sparseness problem of user-clicked anchor texts, the smoothing method has been proposed. This smoothing method is based on two features (BUE and BAE) reflecting user’s browsing behavior on a Web page.

In experiments, we compare the new anchor models incorporating browsing activities with existing anchor models. We find that the new anchor models after smoothing consistently and significantly outperforms the state-of-art anchor models which use only the hyperlink based information. This paper demonstrates the value of Web browsing activities to affect anchor text weighting.

Notes

http://www.sogou.com/labs/dl/t.html.
Generally an anchor document of a target Web page consists of all anchor texts on different source pages with reference to the target page.
In this paper, we call anchor document (representations) constructed according to the definition of an anchor model XXX as XXX based anchor document (representations).
The label “$ \succ $” means the ranking performance of the left hand side anchor model/ranking method is better than the right hand side; “$ \prec $” has the opposite meaning.

References

Agichtein, E., Brill, E., & Dumais, S. (2006). Improving web search ranking by incorporating user behavior information. In Proceedings of the ACM conference on research and development on information retrieval (SIGIR). New York, NY, USA: ACM.
Amitay, E., & Paris, C. (2000). Automatically summarising websites: Is there a way around it? In Proceeding of CIKM ‘00 (pp. 173–179). New York, NY, USA: ACM.
Bilenko, M., & White, R. W. (2008). Mining the search trails of surfing crowds: identifying relevant websites from user activity. In Proceeding of WWW ‘08 (pp. 51–60). New York, NY, USA: ACM.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Seventh international world-wide web conference (WWW 1998), April 14–18, 1998, Brisbane, Australia. New York, NY, USA: ACM.
Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10. ACM.
Article Google Scholar
Clarke, C. L. A., Craswell, N., & Soboroff, I. (2009). Overview of the TREC 2009 webtrack. In Proceedings of the 18th text retrieval conference.
Craswell, N., Hawking, D., & Robertson, S. (2001). Effective site finding using link anchor information. In Proceeding of SIGIR ‘01 (pp. 250–257). New York, NY, USA: ACM.
Dou, Z., Song, R., Nie, J.-Y., & Wen, J.-R. (2009). Using anchor texts with their hyperlink structure for web search. In Proceeding of SIGIR ‘09 (pp. 227–234). New York, NY, USA: ACM.
Eiron, N., & McCurley, K. S. (2003). Analysis of anchor text for web search. In Proceeding of SIGIR ‘03 (pp. 459–460). New York, NY, USA: ACM.
Fujii, A. (2008). Modeling anchor text and classifying queries to enhance web document retrieval. In Proceeding of WWW’08 (pp. 337–346). New York, NY, USA: ACM.
Gyöngyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In the 1st international workshop on adversarial information retrieval on the web. AIRWeb ‘05. New York, USA: ACM.
Jarvelin, K., & Kekalainen, J. (2000). IR evaluation methods for retrieving highly relevant documents. In Proceedings of the ACM conference on research and development on information retrieval (SIGIR). New York, NY, USA: ACM.
Kraft, R., & Zien, J. (2004). Mining anchor text for query refinement. In Proceeding of WWW ‘04 (pp. 666–674). New York, NY, USA: ACM.
Lee, U., Liu, Z., & Cho, J. (2005). Automatic identification of user goals in web search. In Proceeding of WWW ‘05 (pp. 391–400). New York, NY, USA: ACM.
Liu, Y., Cen, R., Zhang, M., Ma, S., & Ru, L. (2008a). Identifying web spam with user behavior analysis. In the 4th international workshop on adversarial information retrieval on the web. AIRWeb ’08 (pp. 9–16). New York, NY: ACM.
Liu, Y., Gao, B., Liu, T.-Y., Zhang, Y., Ma, Z., He, S. et al. (2008b). BrowseRank: letting web users vote for page importance. In Proceeding of SIGIR’08 (pp. 451–458). New York, NY, USA: ACM.
Lu, W.-H., Chien, L.-F., & Lee, H.-J. (2004). Anchor text mining for translation of web queries: A transitive translation approach. ACM Transaction on Information System, 22(2), 242–269.
Article Google Scholar
Metzler, D., Novak, J., Cui, H., & Reddy, S. (2009). Building enriched document representations using aggregated anchor text. In Proceeding of SIGIR’09 (pp. 123–130). New York, NY, USA: ACM.
Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceeding of SIGIR ‘98 (pp. 275–281). New York, NY, USA: ACM.
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1996). Okapi at trec-3. In Proceedings of TREC–3 (pp. 109–126).
Robertson, S, Zaragoza, H., & Taylor, M. (2004). Simple bm25 extension to multiple weighted fields. In Proceedings of CIKM ‘04 (pp. 42–49). ACM.
Sarukkai, R. R. (2000). Link prediction and path analysis using Markov chains. Computer Networks, 33, 377–386.
Article Google Scholar
Sarwar, B. M., Karypis, G., Konstan, J. A., & Riedl, J. T. (2000). Analysis of recommender algorithms for e-commerce. In Proceedings of 2nd ACM Conference on electronic commerce (pp. 158–167). NewYork: ACM Press.
Westerveld, T., Kraaij, W., & Hiemstra, D. (2001). Retrieving web pages using content, links, urls and anchors. In Tenth text retrieval conference (pp. 663–672).
White, R. W., Bilenko, M., & Cucerzan, S. (2007). Studying the use of popular destinations to enhance web search interaction. In SIGIR ‘07 (pp. 159–166). New York, USA: ACM.
Yiqun, L., & Liyun Ru, S. M. (2006). Automatic query type identification based on click through information. In Proceeding of AIRS ‘06 (pp. 593–600).

Download references

Acknowledgments

Supported by Natural Science Foundation (60736044, 60903107) and Research Fund for the Doctoral Program of Higher Education of China (20090002120005).

Author information

Authors and Affiliations

State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, CS&T Department, Tsinghua University, 100084, Beijing, People’s Republic of China
Bo Zhou, Yiqun Liu, Min Zhang, Yijiang Jin & Shaoping Ma

Authors

Bo Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yiqun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yijiang Jin
View author publications
You can also search for this author in PubMed Google Scholar
Shaoping Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Zhou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, B., Liu, Y., Zhang, M. et al. Incorporating web browsing activities into anchor texts for web search. Inf Retrieval 14, 290–314 (2011). https://doi.org/10.1007/s10791-010-9151-7

Download citation

Received: 30 April 2010
Accepted: 21 October 2010
Published: 28 November 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10791-010-9151-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Incorporating web browsing activities into anchor texts for web search

Abstract

Similar content being viewed by others

Modeling user interests from web browsing activities

Accessing Information with Tags: Search and Ranking

The Research on Webpage Ranking Algorithm Based on Topic-Expert Documents

1 Introduction

2 Related work

3 The data

3.1 Anchor data

3.2 Web browsing data

3.3 Query logs

4 Anchor models

4.1 General anchor document representation

4.2 Analysis on anchor texts with browsing activities

4.3 Incorporating browsing activities

4.3.1 Page-level aggregation (UPM)

4.3.2 Site-level aggregation (USM)

4.4 Smoothing

4.4.1 The smoothing method

4.4.2 Entropy of browsing users (BUE)

4.4.3 Entropy of browsing anchors (BAE)

4.4.4 Criterion functions

5 Experiments

5.1 Settings and methodology

5.1.1 Anchor models

5.1.2 Ranking methods

5.1.3 Evaluation methodology

5.2 Results

5.2.1 Performance of different anchor models with different ranking methods

5.2.2 Performance of different anchor models for different types of queries

5.2.3 Combining page content and anchor text

5.2.4 Analysis on the smoothing method

6 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation