Elsevier

Computer Networks

Volume 51, Issue 1, 17 January 2007, Pages 177-189
Computer Networks

On the peninsula phenomenon in web graph and its implications on web search

https://doi.org/10.1016/j.comnet.2006.04.016Get rights and content

Abstract

Web masters usually place certain web pages such as home pages and index pages in front of others. Under such a design, it is necessary to go through some pages to reach the destination pages, which is similar to the scenario of reaching an inner town of a peninsula through other towns at the edge of the peninsula. In this paper, we try to validate that peninsulas are a universal phenomenon in the World-Wide Web, and clarify how this phenomenon can be used to enhance web search and study web connectivity problems. For this purpose, we model the web as a directed graph, and give a proper definition of peninsulas based on this graph. We also present an efficient algorithm to find web peninsulas. Using data collected from the Chinese web by Tianwang search engine, we perform an experiment on the distribution of sizes of peninsulas and their correlations with PageRank values, outdegrees, or indegrees of the ties with other outside vertices. The results show that the peninsula structure on a web graph can greatly expedite the computation of PageRank values; and it can also significantly affect the link extraction capability and information coverage of web crawlers.

Introduction

The World-Wide Web has continued its remarkable and seemingly exponential growth since its inception. The rapid increase in the number of hosts, web pages and link relations has put a great deal of pressure on web information systems such as general search engines and web archives, which encourages us to find good tools and policies for efficiently managing the huge amount of internet information. The basic graph model of the web is that web pages and their hyperlinks can be viewed as nodes and arcs separately, and the web, correspondingly is actually a directed graph. In recent years, many researchers have studied the link structures of web graphs, for example, the shape and distribution of indegrees of web pages. These studies have already played a key role in the design and implementation of web-based applications, and many useful tools have since been invented such as web spiders following those arcs to download web pages, and the Google PageRank algorithm [13] analyzing relative importance of pages using its intrinsic link structure.

This paper discusses the phenomenon known as peninsula in web graphs, which has been observed in our research concerning the Tianwang search engine [22] and Web Infomall system [23]. A peninsula in a web graph is a set of vertices, each of which is only reachable from a common node in the set which is called the tache, just like each geographical byland only being accessible from where it joins the mainland. We believe that the universal existence of peninsulas has two main causes: First, many websites deliberately place the contents behind front pages so that people can only access them through the front pages. For example, lots of websites in BBS style need a session identifier in URLs to visit pages inside; and many sites use a paragraph of javascript to jump from their home pages to the contents inside. And usually it is too hard for spiders to parse out the script-style URLs of the inner contents. Second, web resources are usually organized in a tree-like hierarchical directory, and most web pages outside the directory have only links to its index page instead of those contents themselves inside the directory.

In our research on search engines, we studied peninsula phenomenon for the following two reasons:

  • 1.

    Information coverage: One of the important metrics of a search engine is information coverage. It is important to understand how losing link extraction capability affects information coverage. We are interested in understanding how an unparsable URL will harm information coverage eventually. The peninsula phenomenon is the key to answer this question. If one page is lost and this page happens to be the tache of a peninsula, all other vertices in the peninsula will be lost, too.

  • 2.

    Pagerank computation: Peninsulas can be viewed as some relatively isolated sets of nodes. Both the PageRank algorithm and the HITS algorithm [9] are based on the link structure of a web graph, and their solutions are all based on the power iteration of the matrix eigenvalues. Based on peninsula concept, web graphs can be grouped into some natural blocks. Therefore, power iteration calculation can be greatly expedited when the scale of a web graph is huge.

The rest of this paper is structured as follows. In Section 2, we present a clear definition of peninsulas in web graphs and present an efficient method for finding a peninsula with a given base node. In Section 3, we discuss the experiment results on the distribution of sizes of peninsulas and their relationships with three link-attributes (PageRank values, outdegrees and indegrees) of taches. The experiment is based on 251 million web pages collected by the Tianwang search engine. In Section 4, we explain the application of peninsulas in a web search. In Section 5, we conclude our work and discuss future work.

The structure of web graphs has been deeply analyzed both theoretically and experimentally. Levene et al. [14] presented a stochastic model for the evolution of the web and proved that both indegrees and outdegrees of web pages obey a power law: fi  Ci−(1+ρ). Another similar theoretical work was completed in [11], which successfully used a method of rate equation to prove the power law. Several authors have reported experimental results on the power law based on web crawling: Kumar et al. [10] examined a web crawl with about 40 millions pages since 1997, and proved the power law of indegrees and outdegrees. Similar work has also been done by Barabasi and Albert [3]. Our earlier work [16] also proved experimentally that many link-attributes such as PageRank values and indegrees obey the power law. Our experiments reported in this paper will prove that the sizes of peninsulas also obey the power law.

The shape of a web graph was studied in [1], which proposed the famous “bow tie” structure described in Fig. 1. In this work, the web is mainly divided into five parts: SCC (strongly connected component), IN (consisting of pages that can reach the SCC, but cannot be reached from the SCC), OUT (consisting of pages that are accessible from the SCC, but do not link back to it), TENDRILS (containing pages that can neither reach the SCC nor be reached from the SCC), DISC (disconnected component). Their proportions are 27.7%, 21.3%, 21.3%, 21.5% and 8.2%, respectively. This means that if we follow the links to crawl the web, we can at most achieve only about 70% of all the pages. This structure is also used to enhance the speed of the power iteration in PageRank computation in [2]. Kamvar et al. [20] also presented a method to divide the web graph into host-blocks according to strong local link relations inside websites, which can expedite the iteration as well. This paper introduces a method which outperforms both of the above.

In our previous work [15], a model for evaluating the information coverage of search engines was proposed. In the model, we defined quantity coverage for ordinary pages and quality coverage for important pages. Cho and Garcia-Molina [5] explained the relationship between crawling modes and information coverage. He suggested, in a firewall mode where URLs are not transparent to servers, and where there are only a small number of clusters running in parallel, information coverage is not significantly affected by the number of seeds (i.e. the URLs which the crawling starts from). This result is not complete since it only used 40 million out of the total 2 billion static pages [7], and did not directly consider the link extraction capability.

In this paper, we report a large scale experiment on peninsulas based on the web data collected by our crawlers. We achieve the following results:

  • 1.

    We discover the universal peninsula phenomenon in web graph and give it a proper definition. We propose an efficient algorithm to find the peninsula with a given node as the tache.

  • 2.

    Based on about 251 million pages and hyperlinks among them, we study the distribution of the sizes of peninsulas. We find out that the distribution of peninsula sizes follows the power law. We also perform experiments to learn about the relationships between the sizes of peninsulas and indegrees, outdegrees, or PageRank values of their taches.

  • 3.

    For peninsulas of average size, we find that a slight loss in link extraction capability will significantly affect the final information coverage. The higher PageRank values the pages have, the less significant the effect on their coverage will become.

  • 4.

    The locality of link relations in peninsulas can be used to greatly expedite the PageRank computation. Our analysis shows that we can use peninsulas to divide web graphs into blocks and complete the power iteration with better efficiency than the previous work.

Section snippets

Modeling the peninsula structure of web

The definition of web peninsula is different from that of the geographical one: First, since a web graph is directed, if a set of pages form a peninsula, this means each page is only reachable from the tache irrespective of its own outreach. Second, we limit the relationship between a peninsula and other outside vertices to the tache, which is a single node, but borders of real peninsulas on the earth are usually a curve.

Web experiment setup and results

We reconstructed the original web graph with the help of crawlers. In this section, we will first introduce the crawlers we use, and analyze the data collection sampled from the Chinese web, then present methodologies of studying peninsulas and at last, the results.

Implications on web search

The universal existence of peninsulas has two direct applications: first, we can study the loss in information coverage caused by a quantitative disability in link extraction capability; second, the local link structure of a peninsula can expedite the computation of PageRank values.

Conclusion and future works

In this paper, we perform a comprehensive study on peninsulas in a web graph, which is a non-obvious but universal phenomenon. We first analyze the reasons of peninsula existence: some site administrators place the content behind home pages so that there is no link from outside; also, web resources are usually organized in a directory tree structure. Then, we present the definition of peninsulas. We also describe some of their characteristics and propose two searching algorithms: one can find

Acknowledgements

We thank Professor Xiaoming Li, Dr. Weihong Wang, Dr. Bo Peng, Zhifeng Yang, Lianen Huang, Jiaji Zhu, Jiajing Li, Bihong Gong, Xiubo Zhao from Peking University, and Qu Li from China University of Geosceinces for their comments. Also thanks to Jing Zhao from Hong Kong University of Science and Technology, Xiaojie Gao from California Institute of Technology, and Hang Su from Vanderbilt University for their proof reads. The work of Tao Meng was supported by NSFC Grant 60435020 and NSFC Grant

Tao Meng received his bachelor’s degree in computer science from Peking University in 2002. He is currently a Ph.D. candidate at Peking University, supervised by Prof. Xiaoming Li. His research interests mainly include search engine and web mining. Meng joined the group of Tianwang Search Engine in 2001, and worked primarily in web crawlers and link analysis from then on. His recent design and implementation in 2005 made Tianwang system capable to download more than one billion web pages and

References (24)

  • P.L. Krapivsky et al.

    A statistical physics perspective on web growth

    Computer Networks

    (2002)
  • Mark Levene et al.

    A stochastic model for the evolution of the web

    Computer Networks

    (2002)
  • Hongfei Yan et al.

    Architectural design and evaluation of an efficient web-crawling system

    Journal of System and Software

    (2002)
  • Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, Janet...
  • Arvind Arasu, Jasmine Novak, Andrew S. Tomkins, John A. Tomlin, Pagerank computation and the structure of the web:...
  • A. Barabasi et al.

    Wmergence of scaling in random networks

    Science

    (1999)
  • P. Boldi, S. Vigna, WebGraph framework i: compression techniques, in: Proceedings of the Thirteenth International WWW...
  • Junghoo Cho, Hector Garcia-Molina, Parallel crawlers, in: The Proceedings of 11th World Wide Web Conference, Hawaii,...
  • The China Internet Network Information Center, China’s Internet Development and Usage Report. Available from:...
  • Cyveillance, Inc., White Papers. Sizing the Internet. Available from:...
  • N. Eiron, K.S. McCurley, Locality, hierarchy, and bidirectionality on the web, in: The Workshop on Web Algorithms and...
  • J. Kleinberg, S.R. Kumar, P. Raphavan, S. Rajagopalan, A. Tomkins, The web as a graph: measurements, models and...
  • Cited by (5)

    • Using Google latent semantic distance to extract the most relevant information

      2011, Expert Systems with Applications
      Citation Excerpt :

      Most users have little training about how to enter appropriate keywords. Furthermore, if the user has little experience in the knowledge domain represented by the Webpage or article which he is reading, it will be more difficult for him to determine the keywords that will produce the best results (Agichtein, Lawrence, & Gravano, 2004; Meng & Yan, 2007). In the past few years, many different solutions to this problem have been developed.

    • Word AdHoc Network: Using Google Core Distance to extract the most relevant information

      2011, Knowledge-Based Systems
      Citation Excerpt :

      A user’s browsing or using behavior profiles need to be collected to form a personal profile. Then, some techniques can be used to construct a tree-based or probability-based model in advance so that the search engine system can provide some customized information [29,36,44]. Cui et al. [10] propose a log-based query expansion and suppose that a user will choose the relevant information about the document being read.

    • The size distribution of peninsula in a random graph process

      2012, Lecture Notes in Electrical Engineering
    • Focused web crawling strategy based on concept context graph

      2009, Journal of Computational Information Systems

    Tao Meng received his bachelor’s degree in computer science from Peking University in 2002. He is currently a Ph.D. candidate at Peking University, supervised by Prof. Xiaoming Li. His research interests mainly include search engine and web mining. Meng joined the group of Tianwang Search Engine in 2001, and worked primarily in web crawlers and link analysis from then on. His recent design and implementation in 2005 made Tianwang system capable to download more than one billion web pages and perform link analysis such as pagerank computation for them in a short period.

    Hong-Fei Yan received his Ph.D. in computer science from Peking University in 2002. He is currently an associate professor at Peking University. His research interests involve information retrieval and distributed system. He was ever in charge of Tianwang Search Engine’s parallel upgrade and made it become a tens of millions pages search engine from one million pages one. He has also successfully pioneered the deployment of the first large-scale Chinese Web Test collection with 100 GB web pages (CWT100g) and has been continuously organizing annual Workshop on Chinese Web Information Retrieval Evaluation since 2004.

    View full text