ABSTRACT
We analyze the traffic-weighted Web host graph obtained from a large sample of real Web users over about seven months. A number of interesting structural properties are revealed by this complex dynamic network, some in line with the well-studied boolean link host graph and others pointing to important differences. We find that while search is directly involved in a surprisingly small fraction of user clicks, it leads to a much larger fraction of all sites visited. The temporal traffic patterns display strong regularities, with a large portion of future requests being statistically predictable by past ones. Given the importance of topological measures such as PageRank in modeling user navigation, as well as their role in ranking sites for Web search, we use the traffic data to validate the PageRank random surfing model. The ranking obtained by the actual frequency with which a site is visited by users differs significantly from that approximated by the uniform surfing/teleportation behavior modeled by PageRank, especially for the most important sites. To interpret this finding, we consider each of the fundamental assumptions underlying PageRank and show how each is violated by actual user behavior
- L. Adamic and B. Huberman. Power-law distribution of the World Wide Web. Science, 287:2115, 2000.Google ScholarCross Ref
- E. Agichtein, E. Brill, and S. Dumais. Improving Web search ranking by incorporating user behavior information. In Proc. 29th ACM SIGIR Conf., 2006. Google ScholarDigital Library
- R. Albert, H. Jeong, and A.-L. Barabási. Diameter of the World Wide Web. Nature, 401(6749):130--131, 1999.Google ScholarCross Ref
- E. Almaas, B. Kovacs, T. Vicsek, Z. N. Oltvai, and A.-L. Barabasi. Global organization of metabolic fluxes in the bacterium escherichia coli. Nature, 427(6977):839--843, 2004.Google ScholarCross Ref
- R. Baeza-Yates, F. Saint-Jean, and C. Castillo. Web structure, dynamics and page quality. In A. H. F. Laender and A. L. Oliveira, editors, Proc. 9th Intl. Symp. on String Processing and Information Retrieval (SPIRE 2002), volume 2476 of Lecture Notes in Computer Science, pages 117--130. Springer, 2002. Google ScholarDigital Library
- M. Barthelemy, B. Gondranb, and E. Guichardc. Spatial structure of the internet traffic. Physica A, 319:633--642, March 2003.Google ScholarCross Ref
- K. Bharat, B.-W. Chang, M. Kenzinger, and M. Ruhl. Who links to whom: Mining linkage between web sites. In Proceedings of First IEEE International Conference on Data Mining (ICDM'01), 2001. Google ScholarDigital Library
- P. Boldi, M. Santini, and S. Vigna. Do your worst to make the best: Paradoxical effects in pagerank incremental computations. Internet Mathematics, 2(3):387--404, 2005.Google ScholarCross Ref
- P. Boldi, M. Santini, and S. Vigna. Pagerank as a function of the damping factor. In WWW'05: Proceedings of the 14th international conference on World Wide Web, pages 557--566, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1-7):107--117, 1998. Google ScholarDigital Library
- A. Broder, S. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks, 33(1-6):309--320, 2000. Google ScholarDigital Library
- L. D. Catledge and J. E. Pitkow. Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN Systems, 27(6):1065--1073, 1995. Google ScholarDigital Library
- J. Cho and S. Roy. Impact of search engines on page popularity. In S. I. Feldman, M. Uretsky, M. Najork, and C. E. Wills, editors, Proc. 13th intl. conf. on World Wide Web, pages 20--29. ACM, 2004. Google ScholarDigital Library
- A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. Technical report, arXiv:0706.1062v1 {physics.data-an}, 2007.Google Scholar
- A. Cockburn and B. McKenzie. What do Web users do? An empirical analysis of Web use. Intl. Journal of Human-Computer Studies, 54(6):903--922, 2001. Google ScholarDigital Library
- S. Dill, R. Kumar, K. S. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins. Self-similarity in the web. ACM Transactions on Internet Technology, 2(3):205--223, 2002. Google ScholarDigital Library
- D. Donato, L. Laura, S. Leonardi, and S. Millozzi. Large scale properties of the webgraph. Eur. Phys. J. B, 38:239--243, 2004.Google ScholarCross Ref
- J. Erman, A. Mahanti, M. Arlitt, and C. Williamson. Identifying and discriminating between web and peer-to-peer traffic in the network core. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 883--892, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library
- S. Fortunato and A. Flammini. Random walks on directed networks: the case of pagerank. International Journal of Bifurcation and Chaos, 2007. Forthcoming.Google ScholarCross Ref
- S. Fortunato, A. Flammini, and F. Menczer. Scale-free network growth by ranking. Phys. Rev. Lett., 96(21):218701, 2006.Google ScholarCross Ref
- S. Fortunato, A. Flammini, F. Menczer, and A. Vespignani. Topical interests and the mitigation of search engine bias. Proc. Natl. Acad. Sci. USA, 103(34):12684--12689, 2006.Google ScholarCross Ref
- M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. In Proc. 9th International World Wide Web Conference, 2000. Google ScholarDigital Library
- O. Herfindahl. Copper Costs and Prices: 1870--1957. John Hopkins University Press, Baltimore, MD, 1959.Google Scholar
- A. Hirschman. The paternity of an index. American Economic Review, 54(5):761--762, 1964.Google Scholar
- L. Introna and H. Nissenbaum. Defining the web: The politics of search engines. IEEE Computer, 33(1):54--62, January 2000. Google ScholarDigital Library
- M. Kendall. A new measure of rank correlation. Biometrika, 30:81--89, 1938.Google ScholarCross Ref
- J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
- J. Luxenburger and G. Weikum. Query-Log Based Authority Analysis for Web Information Search, volume 3306 of Lecture Notes in Computer Science, pages 90--101. Springer Berlin/Heidelberg, 2004.Google Scholar
- M. Meiss, F. Menczer, and A. Vespignani. On the lack of typical behavior in the global Web traffic network. In Proc. 14th International World Wide Web Conference, pages 510--518, 2005. Google ScholarDigital Library
- B. Mobasher, R. Cooley, and J. Srivastava. Automatic personalization based on web usage mining. Communications of the ACM, 43(8):141--151, 2000. Google ScholarDigital Library
- A. Mowshowitz and A. Kawaguchi. Bias on the Web. Commun. ACM, 45(9):56--60, 2002. Google ScholarDigital Library
- M. Najork and J. L. Wiener. Breadth-first search crawling yields high-quality pages. In Proc. 10th International World Wide Web Conference, 2001. Google ScholarDigital Library
- F. Qiu, Z. Liu, and J. Cho. Analysis of user web traffic with a focus on search activities. In A. Doan, F. Neven, R. McCann, and G. J. Bex, editors, Proc. 8th International Workshop on the Web and Databases (WebDB), pages 103--108, 2005.Google Scholar
- M. Richardson, A. Prakash, and E. Brill. Beyond pagerank: machine learning for static ranking. In Proc. 15th International World Wide Web Conference, pages 707--715, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- M. A. Serrano, A. Maguitman, M. Boguna, S. Fortunato, and A. Vespignani. Decoding the structure of the WWW: A comparative analysis of Web crawls. ACM Trans. Web, 1(2):10, 2007. Google ScholarDigital Library
- M. Sydow. Can link analysis tell us about web traffic? In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 954--955, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- Q. Yang and H. H. Zhang. Web-log mining for predictive web caching. IEEE Trans. on Knowledge and Data Engineering, 15(4):1050--1053, 2003. Google ScholarDigital Library
Index Terms
- Ranking web sites with real user traffic
Recommendations
Focused ranking in a vertical search engine
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalSince the debut of PageRank and HITS, hyperlink-induced Web document ranking has come a long way. The Web has become increasingly vast and topically diverse. Such vastness has led many into the area of topic-sensitive ranking and its variants. We ...
Ranking web sites using domain ontology concepts
Many web search engines retrieve enormous amounts of irrelevant information in answer to users' queries. The semantic web provides a promising approach to improve search operation. For specific domains, ontologies can capture concepts to help machines ...
Content and link-structure perspective of ranking webpages: A review
AbstractThe delivery of ranked relevant results is probably the most important factor in making a web search engine acceptable to its users. This inspiration has led the search engine engineers and researchers to conceive ranking algorithms ...
Comments