Elsevier

Computer Networks

Volume 50, Issue 10, 14 July 2006, Pages 1464-1473
Computer Networks

Dynamics of the Chilean Web structure

https://doi.org/10.1016/j.comnet.2005.10.017Get rights and content

Abstract

In this paper we present a large scale study on the evolution of the Web structure of the Chilean domain (.cl) from 2000 to 2004, focusing on the Web site transitions in the structure. This is the study of the largest time span and the most detailed of its kind. Our results show that there are many stable Web sites, but also a majority of chaotic changes. We also present the first known results on the death behavior of Web sites.

Introduction

The Web is highly dynamic and not too much is known about its evolution. There has been some work on page evolution, obtaining models that predict when a page will change, but this differs a lot from site to site. There are also generative models for Web growth, but they usually do not include Web death (an exception is [5]).

In this study we focus on the Web site graph or host-graph. Web sites are better study subjects than Web pages for many reasons. First, a Web site most of the time is a logical information unit, this being less true for pages. Second, the main events on the evolution of the Web are related to sites. In fact, new Web sites appear and others disappear, but little is known about how this happens. Third, most external links in a site are to home pages, so the Web structure of sites is the glue of the Web connectivity. Fourth, most sites are strongly connected (it is enough to have a link to the home page in every page). Otherwise, a Web site would have pages in more than one component of the structure, which does not make any sense as a Web site should be atomic with respect to the overall structure (see similar and additional arguments in [6]).

The only paper that focuses in the dynamics of the host-graph is [6], but it does not study the structure of the host-graph. In [3] we presented the evolution of the structure composition of the Chilean Web at the site and domain level, based on data gathered from a search engine targeted to this country’s Internet domain, TodoCL.cl, between the years 2000 and 2002. We extended our results and their analysis to 2003 in [4]. In this paper we include data of 2004, extending our previous results and visualizations. We focus not only on macro statistics, but also on the transitions of Web sites among different structural components. That is, we try to answer the following question: are the size changes in the Web structural components due to a small number of sites going from one component to another in one direction or to a larger number of sites that go in both directions? Our results show that for some Web components the first is true, while for others the second is true.

We define the Chilean Web as all .cl sites, which in practice represent more than 98% of the sites (other non .cl sites hosted in Chile are estimated to number less than 1000). The first year the crawl started from an initial sample of sites, but subsequent years it started with all .cl domains thanks to NIC Chile (www.nic.cl). Hence, the number of unconnected sites was low the first year. Also, the last three crawls contain more dynamic pages, which in general do not change the Web structure. In addition, the last two crawls, although larger in pages compared to 2002, may not reflect an actual growth in the Chilean Web as the number of sites did not increase that much. Table 1 shows the data gathered for our study. Although our results depend on our crawling policies, we have used always the same crawler, changing only the seed URLs. Obviously, each year our seed set is larger.

Our results present how the structure evolves, how sites migrate from one component to another component, and where sites appear and disappear in the structure. The changes are dramatic, showing more chaos than order, and we elaborate on this in the conclusions. This is a first step to measure and follow the evolution of the structure of a part of the Web, as well as try to understand the process behind the changes. To the best of our knowledge there are no other studies on Web structure composition as detailed as ours, both in results and time span. Most statistical studies deal with global attributes such as language or size.

In Section 2 we review the results on the structure of the Web and the problems faced to obtain it. Section 3 shows the evolution of this structure, and Section 4 analyzes the migrations of Web sites in the structure in relation to the expected typical life cycle of a Web site. In Section 5 we analyze the dynamics of the size of Web sites. The last section contains our concluding remarks.

Section snippets

Web structure

The most complete (and unique) study of the Web structure [7] focuses on page connectivity. One problem with this is that a page is not a logical unit (for example, a page can describe several documents and one document can be stored in several pages). Hence, we started by studying the structure of how Web sites were connected, as Web sites are closer to being real logical units. Not surprisingly, we found in [1] that the structure at the Website level was similar to that of the global Web, and

Evolution of the structure composition

Table 2 shows the number of sites that have appeared and disappeared from year to year, from a total of 78,477 different sites belonging to 69,073 domains, crawled at some point. As of April 6, 2005, there were 119,408 registered domains in .cl, with 94,348 having a DNS server. Hence, in the worst case our data covers 73% of all domains in .cl. However, we estimate that the coverage is over 80%. The last three rows represent the new sites (NEW), the sites that were not crawled but exist

Analysis of Website migration

In this section, we analyze how sites migrate in the structure. If a year a site S is in component A and the next year it is found in component B (B  A), we say that S migrated from A to B (a state transition in the structure). In Table 4 we show the sorted percentage of aggregated transitions for all the years.

In Appendix A we give the absolute numbers for the migration of sites per year among all the components. In most cases the UNKNOWN component sites will belong to ISLANDS or OUT, although

Web size dynamics

Another issue is the dynamics of the sites’ contents, which is far more difficult and complex. One first estimation is to look at the changes in the number of pages. For example, the largest 100 sites (in pages) per year, involve 408 sites for all years (so there are many changes in page size), and only 10 and 72 sites were in the top for 3 and 2 years, respectively. Fig. 7 shows the number of pages of the 10 largest Web sites per year from 2000 to 2003 (in total 39 different Web sites).

Concluding remarks

The Web is very young and in Chile the first Web site appeared at the end of 1993 in our CS department. As we have data for five years, our study covers more than 40% of the main part of the lifetime of the Chilean Web.

The overall number of sites of the Chilean Web almost doubles each year, as we believe that the last year did not reflect the actual growth, mainly due to the prevalence of dynamic pages. This growth is the result of about a 100% increase plus a 20% death. So, one might use a

Acknowledgments

We thank the help of Edgardo Krell and Sebastian Castro from NIC Chile for providing the .CL domain data, as well as the support of Millennium Nucleus Grant P04-067-F from Mideplan, Chile.

Ricardo Baeza-Yates received his Ph.D. in CS from the University of Waterloo, Canada, in 1989. In 1992, he was elected president of the Chilean Computer Science Society (SCCC) until 1995, being elected again in 1997. In 1993, he received the Organization of American States award for young researchers in exact sciences. In 1997 with two Brazilian colleagues obtained the COMPAQ prize to best Brazilian research article in CS. He was international coordinator of CYTED (Iberoamerican cooperation in

References (7)

  • R. Baeza-Yates et al.

    Relating Web characteristics with link analysis

  • R. Baeza-Yates et al.

    Web dynamics, structure, and link ranking

  • R. Baeza-Yates et al.

    Evolution of the Chilean Web structure composition

There are more references available in the full text version of this article.

Cited by (14)

  • Characterization of the evolution of a news Web site

    2008, Journal of Systems and Software
    Citation Excerpt :

    An overview of these topics is presented in Ke et al. (2006), where the focus was on the four factors that characterize Web dynamics, i.e., size, pages, link structures, and user interests, and on their influence on the design and development of search engines. Large-scale studies dealing with the characterization of the evolution of Web sites have been presented in several papers (see e.g., (Baeza-Yates et al., 2007; Baeza-Yates and Poblete, 2006; Brewington and Cybenko, 2000a; Brewington and Cybenko, 2000b; Cho and Garcia-Molina, 2003; Fetterly et al., 2004; Risvik and Michelsen, 2002)). In particular, Fetterly et al. (2004) analyzed several million of pages with the aim of measuring the rate and the degree of changes to Web pages.

  • The evolution of the (hidden) web and its hidden data

    2017, The Dark Web: Breakthroughs in Research and Practice
  • The Dawn of today's popular domains: A study of the archived German Web over 18 years

    2016, Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
  • The evolution of the (hidden) web and its hidden data

    2015, Design Strategies and Innovations in Multimedia Presentations
View all citing articles on Scopus

Ricardo Baeza-Yates received his Ph.D. in CS from the University of Waterloo, Canada, in 1989. In 1992, he was elected president of the Chilean Computer Science Society (SCCC) until 1995, being elected again in 1997. In 1993, he received the Organization of American States award for young researchers in exact sciences. In 1997 with two Brazilian colleagues obtained the COMPAQ prize to best Brazilian research article in CS. He was international coordinator of CYTED (Iberoamerican cooperation in S&T) on applied electronics and informatics from 2000 to 2004. During 2002–2004, he was a member of the Board of Governors of the IEEE Computer Society. In 2003, he was incorporated to the Chilean Science Academy, being the first computer scientist to achieve that status. Currently he is professor and director of the Center for Web Research at the CS department of the University of Chile, where he was the chairperson in the periods 1993–1995 and 2003–2004. He is also ICREA Professor at the Department of Technology of the Pompeu Fabra University at Barcelona, Spain. His research interests include information retrieval, algorithms, and information visualization. He is co-author of the book Modern Information Retrieval (Addison-Wesley, 1999), as well as co-author of the second edition of the Handbook of Algorithms and Data Structures (Addison-Wesley, 1991); and co-editor of Information Retrieval: Algorithms and Data Structures, (Prentice-Hall, 1992), among other publications in journals published by ACM, IEEE or SIAM. He has been visiting professor or invited speaker at several conferences and universities all around the world, as well as referee of several journals, conferences, NSF, etc. He is member of the ACM, EATCS, IEEE (senior), SCCC (distinguished) and SIAM.

Barbara Poblete is currently a second year Ph.D. student at the University Pompeu Fabra (UPF) in Barcelona, Spain. She obtained a B.Sc. and M.Sc. in Computer Science and a Computing Engineering professional degree from the University of Chile in Santiago, Chile. She is a member of the Web Research Group at the UPF, and administrator of the Chilean vertical search engine TodoCL (http://www.todocl.cl). She obtained the second place in the XII Latin American Master’s Thesis Contest in 2005. Her current research interests are Web mining, Information Retrieval and Web dynamics.

View full text