skip to main content
article

Characterization of national Web domains

Published: 01 May 2007 Publication History

Abstract

During the last few years, several studies on the characterization of the public Web space of various national domains have been published. The pages of a country are an interesting set for studying the characteristics of the Web because at the same time these are diverse (as they are written by several authors) and yet rather similar (as they share a common geographical, historical and cultural context).
This article discusses the methodologies used for presenting the results of Web characterization studies, including the granularity at which different aspects are presented, and a separation of concerns between contents, links, and technologies. Based on this, we present a side-by-side comparison of the results of 12 Web characterization studies, comprising over 120 million pages from 24 countries. The comparison unveils similarities and differences between the collections and sheds light on how certain results of a single Web characterization study on a sample may be valid in the context of the full Web.

References

[1]
Alonso, J. L., Figuerola, C. G., and Zazo, Á. F. 2003. Cibermetría: Nuevas Técnicas de Estudio Aplicables al Web. Ediciones TREA, Spain.
[2]
Arlitt, M., Friedrich, R., and Jin, T. 1999. Workload characterization of a Web proxy in a cable modem environment. SIGMETRICS Perfor. Evaluat. Rev. 27, 2, 25--36.
[3]
Baeza-Yates, R. and Castillo, C. 2000. Caracterizando la Web chilena. In Encuentro Chileno de Ciencias de la Computación. Sociedad Chilena de Ciencias de la Computación, Punta Arenas, Chile.
[4]
Baeza-Yates, R. and Castillo, C. 2001. Relating Web characteristics with link-based Web page ranking. In Proceedings of String Processing and Information Retrieval (SPIRE). IEEE Computer Society Press, 21--32.
[5]
Baeza-Yates, R. and Castillo, C. 2002. Balancing volume, quality and freshness in Web crawling. In Soft Computing Systems---Design, Management and Applications. IOS Press Amsterdam, 565--572.
[6]
Baeza-Yates, R. and Castillo, C. 2004. Crawling the infinite Web: Five levels are enough. In Proceedings of the 3rd Workshop on Web Graphs (WAW). Lecture Notes in Computer Science, vol. 3243. Springer, 156--167.
[7]
Baeza-Yates, R. and Castillo, C. 2005. Características de la Web chilena 2004. Tech. rep., Center for Web Research, University of Chile.
[8]
Baeza-Yates, R., Castillo, C., and Lopez, V. 2006. Características de la Web de Espaa. El Profesional de la Informacin 15, 1 (Jan.).
[9]
Baeza-Yates, R. and Lalanne, F. 2004. Characteristics of the Korean Web. Tech. rep., Korea--Chile IT Cooperation Center (ITCC).
[10]
Baeza-Yates, R. and Navarro, G. 2004. Modeling text collections and its application to the Web. In Applied Probability: Recent Advances, Kluwer Academic Publishing.
[11]
Baeza-Yates, R. and Poblete, B. 2003. Evolution of the chilean Web structure composition. In Proceedings of Latin American Web Conference. IEEE Computer Society Press, 11--13.
[12]
Baeza-Yates, R., Poblete, B., and Saint-Jean, F. 2003. Evolución de la Web Chilena 2001--2002. Tech. rep., Center for Web Research, University of Chile.
[13]
Barr, D. 1996. RFC 1912: Common DNS operational and configuration errors. http://www.ietf.org/rfc/rfc1912.txt.
[14]
Bharat, K., Chang, B. W., Henzinger, M., and Ruhl, M. 2001. Who links to whom: Mining linkage between Web sites. In International Conference on Data Mining (ICDM). IEEE Computer Society, 51--58.
[15]
Björneborn, L. and Ingwersen, P. 2004. Toward a basic framework for webometrics. J. Amer. Soc. Inform. Sci. Techn. 55, 14 (Aug.), 1216--1227.
[16]
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2002. Structural properties of the African Web. In Proceedings of the 11th International Conference on World Wide Web. ACM Press.
[17]
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004. Ubicrawler: A scalable fully distributed Web crawler. Softw. Practice Exper. 34, 8, 711--726.
[18]
Brewington, B., Cybenko, G., Stata, R., Bharat, K., and Maghoul, F. 2000. How dynamic is the Web? In Proceedings of the 9th Conference on the World Wide Web. ACM Press.
[19]
Brin, S., Motwani, R., Page, L., and Winograd, T. 1998. What can you do with a Web in your pocket? IEEE Data Engin. Bull. 21, 2, 37--47.
[20]
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. 2000. Graph structure in the Web: Experiments and models. In Proceedings of the 9th Conference on the World Wide Web. ACM Press, 309--320.
[21]
Cavnar, W. B. and Trenkle, J. M. 1994. N-gram-based text categorization. In Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR' 94). 161--175.
[22]
da Silva, A. S., Veloso, E. A., Golgher, P. B., Berthier, Laender, A. H. F., and Ziviani, N. 1999. Cobweb---A crawler for the Brazilian Web. In Proceedings of String Processing and Information Retrieval (SPIRE). IEEE Computer Society Press, 184--191.
[23]
Dill, S., Kumar, R., Mccurley, K. S., Rajagopalan, S., Sivakumar, D., and Tomkins, A. 2002. Self-similarity in the Web. ACM Trans. Intern. Techn. 2, 3, 205--223.
[24]
Downey, A. B. 2001. The structural cause of file size distributions. In Proceedings of the 9th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems (MASCOTS). IEEE Computer Society Press.
[25]
Efthimiadis, E. and Castillo, C. 2004. Charting the Greek Web. In Proceedings of the Conference of the American Society for Information Science and Technology (ASIST). American Society for Information Science and Technology.
[26]
Eiron, N., Curley, K. S., and Tomlin, J. A. 2004. Ranking the Web frontier. In Proceedings of the 13th International Conference on the World Wide Web. ACM Press, 309--318.
[27]
Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam Web pages. In Proceedings of the 7th Workshop on the Web and Databases (WebDB). 1--6.
[28]
Gomes, D. and Silva, M. J. 2005. Characterizing a national community Web. ACM Trans. Intern. Techn. 5, 3.
[29]
Grefenstette, G. and Nioche, J. 2000. Estimation of english and non-english language use on the www. In Proceedings of Content-Based Multimedia Information Access (RIAO). 237--246.
[30]
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web.
[31]
Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. World Wide Web Conference 2, 4 (April), 219--229.
[32]
Huberman, B. A. and Adamic, L. A. 1999. Growth dynamics of the World-Wide Web. Nature 399.
[33]
Jaimes, A., Ruiz, Verschae, R., Baeza-Yates, R., Castillo, C., Yaksic, D., and Davis, E. 2004. On the image content of a Web segment: Chile as a case study. J. Web Engin. 3, 2, 153--168.
[34]
Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632.
[35]
Kleinberg, J. M., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. S. 1999. The Web as a graph: Measurements, models and methods. In Proceedings of the 5th Annual International Computing and Combinatorics Conference (COCOON). Lecture Notes in Computer Science, vol. 1627. Springer, 1--18.
[36]
Mitzenmacher, M. 2003. Dynamic models for file sizes and double Pareto distributions. Intern. Mathe. 1, 3, 305--333.
[37]
Modesto, M., Pereira, Ä., Ziviani, N., Castillo, C., and Baeza-Yates, R. 2005. Um novo retrato da Web Brasileira. In Proceedings of 32nd SEMISH. So Leopoldo, Brazil, 2005--2017.
[38]
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the Web. Tech. rep., Stanford Digital Library Technologies Project.
[39]
Pandurangan, G., Raghavan, P., and Upfal, E. 2002. Using PageRank to characterize Web structure. In Proceedings of the 8th Annual International Computing and Combinatorics Conference (COCOON). Lecture Notes in Computer Science, vol. 2387. Springer, 330--390.
[40]
Pitkow, J. E. 1999. Summary of WWW characterizations. WWW 2, 1-2, 3--13.
[41]
Rauber, A., Aschenbrenner, A., Witvoet, O., Bruckner, R. M., and Kaiser, M. 2002. Uncovering information hidden in Web archives. D-Lib Magazine 8, 12.
[42]
Sanguanpong, S., Nga, P. P., Keretho, S., Poovarawan, Y., and Warangrit, S. 2000. Measuring and analysis of the Thai World Wide Web. In Proceeding of the Asia Pacific Advance Network Conference. Beijing, China, 225--230.
[43]
Sanguanpong, S. and Warangrit, S. 1998. Nontrisearch: Search engine for campus network. In National Computer Science and Engineering Conference. Bangkok, Thailand.
[44]
Suel, T. and Yuan, J. 2001. Compressing the graph structure of the Web. In Proceedings of the Data Compression Conference DCC. IEEE Computer Society Press.
[45]
Veloso, E. A., de Moura, E., Golgher, P., da Silva, A., Almeida, R., Laender, A., Neto, R. B., and Ziviani, N. 2000. Um retrato da Web Brasileira. In Proceedings of Simposio Brasileiro de Computacao. Curitiba, Brasil.
[46]
Yossef, Z. B., Broder, A. Z., Kumar, R., and Tomkins, A. 2004. Sic transit gloria telae: Towards an understanding of the web's decay. In Proceedings of the 13th Conference on the World Wide Web. ACM Press.
[47]
Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Cambridge, MA.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology
ACM Transactions on Internet Technology  Volume 7, Issue 2
May 2007
152 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/1239971
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2007
Published in TOIT Volume 7, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Web characterization
  2. Web measurement

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Bias and the WebIntroduction to Digital Humanism10.1007/978-3-031-45304-5_28(435-462)Online publication date: 21-Dec-2023
  • (2021)Was this the real Web? Quantitative overview of the Polish… ccTLD Internet Archive data (1996–2001)Archeion10.4467/26581264ARC.21.015.14495122(44-68)Online publication date: 23-Dec-2021
  • (2020)Representativeness of Abortion Legislation Debate on Twitter: A Case Study in Argentina and ChileCompanion Proceedings of the Web Conference 202010.1145/3366424.3383561(765-774)Online publication date: 20-Apr-2020
  • (2018)The Evolution of the (Hidden) Web and Its Hidden DataThe Dark Web10.4018/978-1-5225-3163-0.ch006(84-113)Online publication date: 2018
  • (2018)Trends in the creation of Spanish web sites and their active serviceProceedings of the 5th Spanish Conference on Information Retrieval10.1145/3230599.3230613(1-6)Online publication date: 26-Jun-2018
  • (2018)Bias on the webCommunications of the ACM10.1145/320958161:6(54-61)Online publication date: 23-May-2018
  • (2018)Statistical Analysis of Extracted Data from Video Site by Using Web CrawlerProceedings of the 2018 International Conference on Computing and Artificial Intelligence10.1145/3194452.3194466(41-46)Online publication date: 12-Mar-2018
  • (2018)Web Characteristics and EvolutionEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_456(4605-4608)Online publication date: 7-Dec-2018
  • (2017)THE INTERNET ADDRESSES OF AGROTOURISTIC FARMS AND THEIR INFORMATIVE VALUEFolia Turistica10.5604/01.3001.0012.050845(81-93)Online publication date: 31-Dec-2017
  • (2017)Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web PagesACM Transactions on Information Systems10.1145/304165636:1(1-34)Online publication date: 5-Jun-2017
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media