skip to main content
article

The Web as a graph: How far we are

Published: 01 February 2007 Publication History

Abstract

In this article we present an experimental study of the properties of webgraphs. We study a large crawl from 2001 of 200M pages and about 1.4 billion edges, made available by the WebBase project at Stanford, as well as several synthetic ones generated according to various models proposed recently. We investigate several topological properties of such graphs, including the number of bipartite cores and strongly connected components, the distribution of degrees and PageRank values and some correlations; we present a comparison study of the models against these measures.Our findings are that (i) the WebBase sample differs slightly from the (older) samples studied in the literature, and (ii) despite the fact that these models do not catch all of its properties, they do exhibit some peculiar behaviors not found, for example, in the models from classical random graph theory.Moreover we developed a software library able to generate and measure massive graphs in secondary memory; this library is publicy available under the GPL licence. We discuss its implementation and some computational issues related to secondary memory graph algorithms.

References

[1]
Abello, J., Pardalos, P. M., and Resende, M. G. C. 2002. Handbook of massive data sets. Kluwer Academic Publishers.
[2]
Adler, M. and Mitzenmacher, M. 2001. Towards compressing web graphs. in the Proceedings of the Data Compression Conference (DCC'01)
[3]
Bollobas, B. and Riordan, O. 2003. Robustness and vulnerability of scale-free random graphs. Internet Math. 1, 1, 1--35.
[4]
Barabasi, A. and Albert, A. 1999. Emergence of scaling in random networks. Science 286, 509.
[5]
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2002. Ubicrawler: A scalable fully distributed web crawler.
[6]
Boldi, P. and Vigna, S. 2004. The webgraph framework i: compression techniques. In WWW '04: Proceedings of the 13th International Conference on World Wide Web. ACM Press, 595--602.
[7]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117.
[8]
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, S., Tomkins, A., and Wiener, J. 2000. Graph structure in the web. In Proceedings of the 9th WWW conference.
[9]
Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the 11th International World Wide Web Conference.
[10]
Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., and Wesley, G. 2006. Stanford WebBase Components and Applications. ACM Trans. Internet Tech. 6, 2.
[11]
Cormen, T. H., Leiserson, C. E., and Rivest, R. L. 1992. Introduction to Algorithms, 6th ed. MIT Press and McGraw-Hill Book Company.
[12]
cyvellance. www.cyvellance.com. Cyvellance.
[13]
Diestel, R. 1997. Graph Theory. Springer, New York.
[14]
Dill, S., Kumar, R., McCurley, K., Rajagopalan, S., Sivakumar, D., and Tomkins, A. 2001. Self-similarity in the web. In Proceedings of the 27th VLDB Conference.
[15]
Donato, D., Laura, L., Leonardi, S., and Millozzi, S. 2004a. Large scale properties of the Webgraph. Europ. J. Phys. B 38, 2, 239--243.
[16]
Donato, D., Laura, L., Leonardi, S., and Millozzi, S. 2004b. Simulating the Webgraph: A comparative analysis of models. Computing in Science and Engineering 6, 6, 84--89.
[17]
Donato, D., Laura, L., Leonardi, S., and Millozzi, S. 2004c. A software library for generating and measuring massive Webgraphs. Tech. Rep. D13, COSIN European Research Project. http://www.dis.uniroma1.it/~cosin/html_pages/COSIN-Tools.htm.
[18]
Erdös, P. and Rényi, A. 1960. On the evoluation of random graphs Publ. Math. Inst. Hung. Acad. Sci 5.
[19]
Gleich, D., Zuchov, L., and Berkhin, P. 2004. Fast Parallel PageRank: A Linear System Approach. Tech. Rep. 038, Yahoo! Research.
[20]
Gulli, A. and Signorini, A. 2005. The Indexable Web is More than 11.5 Billion Pages. In Proceedings of WWW-05, International Conference on the World Wide Web.
[21]
Harary, F. 1969. Graph Theory. Addison-Wesley, Reading, MA.
[22]
Haveliwala, T. H. 1999. Efficient computation of PageRank. Tech. rep., Stanford University.
[23]
Kleinberg, J. 1997. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632.
[24]
Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. The Web as a graph: measurements, models and methods. In Proceedings of the International Conference on Combinatorics and Computing. 1--18.
[25]
Knuth, D. E. 1997. Seminumerical Algorithms, Third ed. The Art of Computer Programming, vol. 2. Addison-Wesley, Reading, Massachusetts.
[26]
Kraft, R., Hastor, E., and Stata, R. 2003. Timelinks: Exploring the link structure of the evolving Web. In Second Workshop on Algorithms and Models for the Web-Graph (WAW2003).
[27]
Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., and Upfal, E. 2000. Stochastic models for the Web graph. In Proceedings of the 41st FOCS. 57--65.
[28]
Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. Trawling the Web for emerging cyber communities. In Proceedings of the 8th WWW Conference. 403--416.
[29]
Laura, L., Leonardi, S., Caldarelli, G., and De Los Rios, P. 2002. A multi-layer model for the Webgraph. In On-line proceedings of the 2nd International Workshop on Web Dynamics.
[30]
Mitzenmacher, M. 2003. A brief history of generative models for power law and lognormal distributions. Internet Math. 1, 2.
[31]
Pandurangan, G., Raghavan, P., and Upfal, E. 2002. Using PageRank to characterize Web structure. In Proceedings of the 8th Annual International Conference on Combinatorics and Computing (COCOON), Springer-Verlag, Ed. LNCS 2387. 330--339.
[32]
Pennock, D., Flake, G., Lawrence, S., Glover, E., and Giles, C. 2002. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences 99, 8 (April), 5207--5211.
[33]
Sibeyn, J., Abello, J., and Meyer, U. 2002. Heuristics for semi-external depth first search on directed graphs. In Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA). 282--292.
[34]
Tarjan, R. E. 1972. Depth-first search and linear graph algorithms. SIAM J. Comput. 1, 2, 146--160.
[35]
Vitter, J. and Shriver, E. 1994a. Algorithms for parallel memory i: Two-level memories. Algorithmica 12, 2-3, 107--114.
[36]
Vitter, J. and Shriver, E. 1994b. Algorithms for parallel memory ii: Hierarchical multilevel memories. Algorithmica 12, 2-3, 148--169.
[37]
Walker, A. 1977. An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Softw. 3, 3, 253--256.
[38]
webbase. The Stanford Webbase project. http://www-diglib.stanford.edu/~testbed/doc2/WebBase/.
[39]
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes (2nd ed.): Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers Inc.

Cited By

View all
  • (2024)Small World in Social NetworksSocial Network Computing10.1007/978-981-97-4084-0_7(219-256)Online publication date: 2-Nov-2024
  • (2023)Web Science: An Interdisciplinary Approach to Understanding the WebLinking the World’s Information10.1145/3591366.3591374(67-84)Online publication date: 5-Sep-2023
  • (2023)Predictive Behavior Modeling Through Web Graphs: Enhancing Next Page Prediction Using Dynamic Link Repository2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT59888.2023.00068(415-420)Online publication date: 26-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology
ACM Transactions on Internet Technology  Volume 7, Issue 1
February 2007
184 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/1189740
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2007
Published in TOIT Volume 7, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Graph structure
  2. World-Wide-Web
  3. models

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Small World in Social NetworksSocial Network Computing10.1007/978-981-97-4084-0_7(219-256)Online publication date: 2-Nov-2024
  • (2023)Web Science: An Interdisciplinary Approach to Understanding the WebLinking the World’s Information10.1145/3591366.3591374(67-84)Online publication date: 5-Sep-2023
  • (2023)Predictive Behavior Modeling Through Web Graphs: Enhancing Next Page Prediction Using Dynamic Link Repository2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT59888.2023.00068(415-420)Online publication date: 26-Oct-2023
  • (2022)Maximal paths recipe for constructing Web user sessionsWorld Wide Web10.1007/s11280-022-01024-325:6(2455-2485)Online publication date: 1-Nov-2022
  • (2021)Exploring the Topological Properties of the Tor Dark WebIEEE Access10.1109/ACCESS.2021.30555329(21746-21758)Online publication date: 2021
  • (2019)Detecting Web Spam in Webgraphs with Predictive Model Analysis2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006282(4299-4308)Online publication date: Dec-2019
  • (2018)On Web’s contact structureJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-018-1002-1Online publication date: 4-Sep-2018
  • (2017)What's Inside a Bow-TieProceedings of the 2017 International Conference on Information System and Data Mining10.1145/3077584.3077589(39-43)Online publication date: 1-Apr-2017
  • (2015)Analyzing topological characteristics of the Korean blogosphereJournal of Web Engineering10.5555/2871254.287126314:1-2(151-178)Online publication date: 1-Mar-2015
  • (2015)Personalized Information Access Using Semantic KnowledgeSmart Information Systems10.1007/978-3-319-14178-7_7(181-211)Online publication date: 15-Jan-2015
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media