Skip to main content
Log in

Modelling the characteristics of Web page outlinks

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Using data sampled from top-level Web pages across five high-level domains and from sample pages within individual websites, the authors investigate the frequency distribution of outlinks in Web pages. The observed distributions were fitted to different theoretical distributions to determine the best-fitting model for representing outlink frequency across Web pages. Theoretical models tested include the modified power law (MPL), Mandelbrot (MDB), generalized Waring (GW), generalized inverse Gaussian-Poisson (GIGP), and generalized negative binomial (GNB) distributions. The GIGP and GNB provided good fits for data sets for top-level pages across the high level domains tested, with the GIGP performing slightly better. The lumpiness and bimodal nature of two of the observed outlink distributions from Web pages within a given website resulted in poor fits of the theoretical models. The GIGP was able to provide better fits to these data sets after the top components were truncated. The ability to effectively model Web page attributes, such as the distribution of the number of outlinks per page, paves the way for simulation models of Web page structural content, and makes it possible to estimate the number of outlinks that may be encountered within Web pages of a specific domain or within individual websites.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Adamic, L. A., Huberman, B. A. (2001). The Web's hidden order. Communications of the ACM, 44(9): 55-59.

    Google Scholar 

  • Ajiferuke, I., Wolfram, D. (submitted). Analysis of image tag distribution characteristics in Web pages.

  • Albert, R., Barabasi, A. L. (2000). Topology of evolving networks: Local events and universality. Physical Review Letters, 85(24): 5234-5237.

    Google Scholar 

  • Albert, R., Jeong, H., Barabasi, A. L. (1999). Diameter of the world-wide web. Nature, 401: 130-131.

    Google Scholar 

  • Baayen, R. H. (2001). Word Frequency Distributions. Boston: Kluwer.

    Google Scholar 

  • Barford, P., Crovella, M. (1998). Generating representative web workloads for network and server performance evaluation. In: ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pp. 151-160, July 1998.

  • Bates, M. J., Lu, S. (1997). An explanatory profile of personal home pages: content, design, metaphors. Online & CDROM Review, 21(6): 331-340.

    Google Scholar 

  • Brin, S., Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Available from: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm [2003, April 15th]

  • Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Staa, R., Tomlins, A., Wiener, J. (2000). Graph structure in the Web. Computer Networks and ISDN Systems, 30: 209-320. Also in: Proceedings of the 9th International World Wide Web Conference, May 2000. http://www9.org/w9cdrom/160/160.html

    Google Scholar 

  • Burrell, Q. L., Fenton, M. R. (1993). Yes, the GIGP really does work — and is workable! Journal of the American Society for Information Science, 44: 61-69.

    Google Scholar 

  • Craven, T. C. (2001). Description meta tags in pages returned on different search engines. The Canadian Journal of Information and Library Science, 26(1): 1-17.

    Google Scholar 

  • cache/cond-mat/pdf/0009/0009090.pdf

  • Egghe, L., Rousseau, R. (1990). Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science. Amsterdam: Elsevier.

    Google Scholar 

  • Famoye, F. (1997). Parameter estimation for generalized negative binomial distribution. Communications in Statistics: Simulation & Computation, 26(1): 269-279.

    Google Scholar 

  • Huberman, B. A. (2001). The Laws of the Web: Patterns in the Ecology of Information. Cambridge, MA: The MIT Press.

    Google Scholar 

  • Huberman, B. A., Adamic, L. A. (1999). Growth dynamics of the World Wide Web. Nature, 401: 131-133.

    Google Scholar 

  • Irwin, J. O. (1975). The generalized Waring distribution: Part 1, part 2, part 3. Journal of the Royal Statistical Society, Series A, 138: 18-31, 204–227, 374–384.

    Google Scholar 

  • Johnson, N. L., Kotz, S., Kemp, A. W. (1993). Univariate Discrete Distributions. 2nd edition. New York: John Wiley & Sons, Inc.

    Google Scholar 

  • Larson, R. R. (1996). Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace, Available: http://sherlock.berkeley.edu/asis96/asis96.html [2003, April 19th].

  • Levene, M., Fenner, T., Loizou, G., Wheeldon, R. (2002). A stochastic model for the evolution of the Web. Computer Networks, 39(3): 277-287.

    Google Scholar 

  • Mandelbrot, B. (1954). Structure formelle des textes et communication: Deux etudes. Word, 10: 1-27.

    Google Scholar 

  • Nelson, M. J. (1989). Stochastic models for the distribution of index terms. Journal of Documentation, 45(3): 227-237.

    Google Scholar 

  • Nelson, M., Downie, J. S. (2002). Informetric analysis of a music database. Scientometrics, 54(2): 243-255.

    Google Scholar 

  • Nielsen, J. (1997a). Do Websites Have Increasing Returns? Available: http://www.useit.com/alertbox/9704b.html [2003, April 19th].

  • Nielsen, J. (1997b). Zipf Curves and Website Popularity. Available: http://www.useit.com/alertbox/zipf.html [2003, April 19th].

  • Pennock, D. M., Flake, G. W., Lawrence, S., Glover, E. J., Giles, C. L. (2002). Winners don.t take all: Characterizing the competition for links on the Web. Proceedings of the National Academic of Sciences of the United States of America, 99(8): 5207-5211.

    Google Scholar 

  • PIROLLI, P., PITKOW, J., RAO, R. (1996). Silk from a sow's ear: Extracting usable structures from the Web. In: R. BILGER, S. GUEST, M. J. TAUBER (Eds) CHI 96 – Electronic Proceedings. Available: http://www.acm.org/sigchi/chi96/proceedings/papers/Pirolli_2/pp2.html [2003, April 19th].

  • Rousseau, R. (1997). Sitations: An exploratory study. Cybermetrics, 1(1). Available: http://www.cindoc.csic.es/cybermetrics/articles/v1i1p1.html [2003, April 19th].

  • Sichel, H. S. (1985). A bibliometric distribution which really works. Journal of the American Society for Information Science, 3(5): 314-321.

    Google Scholar 

  • Sichel, H. S. (1992). Anatomy of the generalized inverse Gaussian-Poisson distribution with special applications to bibliometric studies. Information Processing & Management, 28(1): 5-17.

    Google Scholar 

  • Simon, H. A. (1955). On a class of skew distribution functions, Biometrika, 42: 425-440.

    Google Scholar 

  • Snyder, H., Rosenbaum, H. (1999). Can search engines be used as tools for web-link analysis? A critical review. Journal of Documentation, 55(4): 375-384.

    Google Scholar 

  • Wolfram, D. (2003) Applied Informetrics for Information Retrieval Research. Westport, CT: Libraries Unlimited.

    Google Scholar 

  • WOODRUFF, A., AOKI, P. M., BREWER, E., HAUTHIER, P., ROWE, L. A.(1996). An investigation of documentsfrom the World Wide Web. In: Proceedings of the Fifth International World Wide Web Conference,Paris, France, May 6-10, 1996. Available: http://www5conf.inria.fr/fich_html/papers/P7/Overview.html[2003, April 19th].

  • Yule, G. U. (1944). Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press.

    Google Scholar 

  • Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Cambridge: Addison-Wesley.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Isola, A., Dietmar, W. Modelling the characteristics of Web page outlinks. Scientometrics 59, 43–62 (2004). https://doi.org/10.1023/B:SCIE.0000013298.22207.2b

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:SCIE.0000013298.22207.2b

Keywords

Navigation