Abstract
Using data sampled from top-level Web pages across five high-level domains and from sample pages within individual websites, the authors investigate the frequency distribution of outlinks in Web pages. The observed distributions were fitted to different theoretical distributions to determine the best-fitting model for representing outlink frequency across Web pages. Theoretical models tested include the modified power law (MPL), Mandelbrot (MDB), generalized Waring (GW), generalized inverse Gaussian-Poisson (GIGP), and generalized negative binomial (GNB) distributions. The GIGP and GNB provided good fits for data sets for top-level pages across the high level domains tested, with the GIGP performing slightly better. The lumpiness and bimodal nature of two of the observed outlink distributions from Web pages within a given website resulted in poor fits of the theoretical models. The GIGP was able to provide better fits to these data sets after the top components were truncated. The ability to effectively model Web page attributes, such as the distribution of the number of outlinks per page, paves the way for simulation models of Web page structural content, and makes it possible to estimate the number of outlinks that may be encountered within Web pages of a specific domain or within individual websites.
Similar content being viewed by others
References
Adamic, L. A., Huberman, B. A. (2001). The Web's hidden order. Communications of the ACM, 44(9): 55-59.
Ajiferuke, I., Wolfram, D. (submitted). Analysis of image tag distribution characteristics in Web pages.
Albert, R., Barabasi, A. L. (2000). Topology of evolving networks: Local events and universality. Physical Review Letters, 85(24): 5234-5237.
Albert, R., Jeong, H., Barabasi, A. L. (1999). Diameter of the world-wide web. Nature, 401: 130-131.
Baayen, R. H. (2001). Word Frequency Distributions. Boston: Kluwer.
Barford, P., Crovella, M. (1998). Generating representative web workloads for network and server performance evaluation. In: ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pp. 151-160, July 1998.
Bates, M. J., Lu, S. (1997). An explanatory profile of personal home pages: content, design, metaphors. Online & CDROM Review, 21(6): 331-340.
Brin, S., Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Available from: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm [2003, April 15th]
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Staa, R., Tomlins, A., Wiener, J. (2000). Graph structure in the Web. Computer Networks and ISDN Systems, 30: 209-320. Also in: Proceedings of the 9th International World Wide Web Conference, May 2000. http://www9.org/w9cdrom/160/160.html
Burrell, Q. L., Fenton, M. R. (1993). Yes, the GIGP really does work — and is workable! Journal of the American Society for Information Science, 44: 61-69.
Craven, T. C. (2001). Description meta tags in pages returned on different search engines. The Canadian Journal of Information and Library Science, 26(1): 1-17.
cache/cond-mat/pdf/0009/0009090.pdf
Egghe, L., Rousseau, R. (1990). Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science. Amsterdam: Elsevier.
Famoye, F. (1997). Parameter estimation for generalized negative binomial distribution. Communications in Statistics: Simulation & Computation, 26(1): 269-279.
Huberman, B. A. (2001). The Laws of the Web: Patterns in the Ecology of Information. Cambridge, MA: The MIT Press.
Huberman, B. A., Adamic, L. A. (1999). Growth dynamics of the World Wide Web. Nature, 401: 131-133.
Irwin, J. O. (1975). The generalized Waring distribution: Part 1, part 2, part 3. Journal of the Royal Statistical Society, Series A, 138: 18-31, 204–227, 374–384.
Johnson, N. L., Kotz, S., Kemp, A. W. (1993). Univariate Discrete Distributions. 2nd edition. New York: John Wiley & Sons, Inc.
Larson, R. R. (1996). Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace, Available: http://sherlock.berkeley.edu/asis96/asis96.html [2003, April 19th].
Levene, M., Fenner, T., Loizou, G., Wheeldon, R. (2002). A stochastic model for the evolution of the Web. Computer Networks, 39(3): 277-287.
Mandelbrot, B. (1954). Structure formelle des textes et communication: Deux etudes. Word, 10: 1-27.
Nelson, M. J. (1989). Stochastic models for the distribution of index terms. Journal of Documentation, 45(3): 227-237.
Nelson, M., Downie, J. S. (2002). Informetric analysis of a music database. Scientometrics, 54(2): 243-255.
Nielsen, J. (1997a). Do Websites Have Increasing Returns? Available: http://www.useit.com/alertbox/9704b.html [2003, April 19th].
Nielsen, J. (1997b). Zipf Curves and Website Popularity. Available: http://www.useit.com/alertbox/zipf.html [2003, April 19th].
Pennock, D. M., Flake, G. W., Lawrence, S., Glover, E. J., Giles, C. L. (2002). Winners don.t take all: Characterizing the competition for links on the Web. Proceedings of the National Academic of Sciences of the United States of America, 99(8): 5207-5211.
PIROLLI, P., PITKOW, J., RAO, R. (1996). Silk from a sow's ear: Extracting usable structures from the Web. In: R. BILGER, S. GUEST, M. J. TAUBER (Eds) CHI 96 – Electronic Proceedings. Available: http://www.acm.org/sigchi/chi96/proceedings/papers/Pirolli_2/pp2.html [2003, April 19th].
Rousseau, R. (1997). Sitations: An exploratory study. Cybermetrics, 1(1). Available: http://www.cindoc.csic.es/cybermetrics/articles/v1i1p1.html [2003, April 19th].
Sichel, H. S. (1985). A bibliometric distribution which really works. Journal of the American Society for Information Science, 3(5): 314-321.
Sichel, H. S. (1992). Anatomy of the generalized inverse Gaussian-Poisson distribution with special applications to bibliometric studies. Information Processing & Management, 28(1): 5-17.
Simon, H. A. (1955). On a class of skew distribution functions, Biometrika, 42: 425-440.
Snyder, H., Rosenbaum, H. (1999). Can search engines be used as tools for web-link analysis? A critical review. Journal of Documentation, 55(4): 375-384.
Wolfram, D. (2003) Applied Informetrics for Information Retrieval Research. Westport, CT: Libraries Unlimited.
WOODRUFF, A., AOKI, P. M., BREWER, E., HAUTHIER, P., ROWE, L. A.(1996). An investigation of documentsfrom the World Wide Web. In: Proceedings of the Fifth International World Wide Web Conference,Paris, France, May 6-10, 1996. Available: http://www5conf.inria.fr/fich_html/papers/P7/Overview.html[2003, April 19th].
Yule, G. U. (1944). Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press.
Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Cambridge: Addison-Wesley.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Isola, A., Dietmar, W. Modelling the characteristics of Web page outlinks. Scientometrics 59, 43–62 (2004). https://doi.org/10.1023/B:SCIE.0000013298.22207.2b
Issue Date:
DOI: https://doi.org/10.1023/B:SCIE.0000013298.22207.2b