A short walk in the Blogistan
Introduction
The word blog is short for the neologism “weblog”, which is often a personal journal maintained on the Web. Blogs have grown rapidly in the last 2 years as a new communication mechanism between aficionados who appear to avidly follow the opinions, stories, and observations. Blogs are similar in spirit to Usenet newsgroups except each newsgroup is a single person’s view; some blogs allow for comments and a few blogs are shared between multiple authors. Often, a blog is one long Web page, partitioned into archives, with links to other URLs on the Web. In this sense, it is no different from a “home page” of a user. However, blogs in practice have turned out to be writings about a variety of topics, typically updated on a much more regular basis than homepages. Unlike homepages that are often maintained on individual sites owned by users, many popular blogs are on content hosting sites that provide space, software to maintain blogs, and generate indices, reverse pointer collections, etc. Blogs are basically large queues (the term blogroll is used within the community) with additions appearing at the top of the page and older material scrolling down. Unlike a moderated site, additions to a blog are immediately available to anyone accessing the URL of the blog.
A typical blog consists of some text paragraphs often with embedded links (either internal links to another section of the same blog or external links), occasionally a few images, pointers to older sections of the same blog, and (in some cases) a set of reverse pointers to the blog itself made in other blogs. Many of the paragraphs (or blog sections) include a link (a paragraph-specific URL that others can use to refer to in their blogs). While typical Web pages have a single point of entry (the URL), blogs have multiple locations of interest (the various paragraphs) and thus the link to specific paragraph has value.
Blogs are the fastest growing section of the World Wide Web in the last 2 years [1] and are emerging as an important communication mechanisms that is used by an increasing number of people. Although blogs began appearing several years ago they never crossed over to widespread popularity until 2000. By several estimations there are hundreds of thousands of blogs and as one might expect Zipfian in popularity and update frequency. Much like popular Web sites, blogs that are updated more regularly tend to be more popular.
Blogs are a distinct component of the Web from the viewpoint of content. There are several blogs that represent a small community of authors, i.e., content creators, with some communities (such as political blogs) that have a wide readership. The political blog talkingpointsmemo.com claims to have more than three hundred thousand unique readers in a month. The content creators routinely monitor related blogs and add links to items to related items on those blogs. This is done when there is a item in concert with views expressed, or when contrary views are discussed, or simply because it is relevant. This is one of the key differences between ordinary Web pages and a blog: the constant updating of content as well as links to other sites that are themselves changing.
There are several applications that can benefit from a characterization of blogs and we will discuss a few in this paper. Blogs offer a window into what many individual readers find interesting especially when new issues emerge. It has the potential for providing an early warning of hotspots and flash crowds. The Web site slashdot.org is an early example of a blog with significant impact on the future (and often short-lived) popularity of a particular Web site or Web page. Prior to popular blogs, often the source for hot news items were the news Websites. Unlike news sites, most popular blogs (with a few exceptions) are edited by a single individual. Many blogs allow anyone to comment on the contents. In fact the multiplicity of comments and additions of links to a new issue can be an early indicator of its rising popularity. A key distinction achieved by blogs is the original goal of making the Web a two- or multi-way medium rather than the widely prevalent “write once read many” model. Unlike news sites, popular blogs have the property of a large in-degree especially when one considers that links are not to the top-level URL but to a specific section of the blog. Blogs also offer interesting new collaborative filtering applications. Authors of blogs may be interested in finding out new stories that are related to stories they were previously interested in, blogs that are related to their blog, or ones that have commented on or linked to their blog. The last item is partly expressed through the publication of referrer links—a common blog phenomenon.1
A blog can be a Web page or a site depending on how popular it is, where it is hosted, etc. Although blogs change slowly, they are dynamic sites and represent the middle portion of the continuum between largely static sites (which are the vast majority on the Web) and the truly dynamic sites (sites that change regularly such as news sites). Search engines have distinguished between mostly static sites (home pages), dynamic sites (news, etc.), truly dynamic sites (page generated upon visit each time, often ignored by search engines). The crawling, indexing, and search return phases of a search engine have taken appropriate action accordingly. Blogs create interesting new problems and opportunities in this regard.
We refer to the blog space as the Blogistan to describe the collection of blogs. Our contribution is threefold: We explore how emerging interests and patterns can be extracted by tracking a seed collection of blogs that have been modified fairly recently. By doing so we develop a methodology to identify emerging patterns on general data sets that comprise evolving communication networks. We examine the size and nature of the blogistan based on a recent collection of blogs. Finally, we present a collection of inferences and observations based on our study on identifying blogs, the growing spam problem in blogs, and how blog sites are accessed.
The rest of this paper is organized as follows: Section 2 characterizes blogs and discusses how they qualitatively differ from “traditional” Web sites. Section 3 describes the mechanics of our study and some key statistics related to it. Section 4 presents the analysis of the seed blog URL collection fetched repeatedly over a five week period in the autumn of 2003. We mine this data to identify emerging interests and patterns. Section 5 presents a walk through a large connected portion of the blogistan reached from our seed set. We examine the domain distribution of blog hosting sites and issues involving the HTTP protocol and blogs. Section 6 discusses inferences gleaned from our study with a preliminary analysis of Web server logs of a couple of very popular blog sites. We conclude with a look at work in progress on continued data gathering and analysis.
Section snippets
Differences between Web sites and blogs
There are several key differences between regular Web sites and blogs. Chief among them is that a blog is often a single page site; i.e., there are several related pages to the blog but found in archives and accessible from the main entry point page. The nature, number, and quality of links from a blog are quite different from ordinary Web pages. The primary reason for this is that blogs are often written to be read by many people, some of whom correspond with the blog authors to point out
Data gathering
We wanted to get a reasonable collection of Web logs to perform some characterization and measurement study. Since the Web has been around for over a decade there are several sites that rate popularity of Web sites. One reason for this is economical: Web sites used their ratings for computing rates for advertisement. Popular blogs have advertisement charges directly pegged to number of unique visitors and number of page views in a month [5]. The duration of popularity metric has allowed for
Seed collection analysis
In this section, we show how emerging interests and patterns can be identified by tracking our seed collection of the 8679 recently-changed Weblogs. We detect new referenced urls and study their emergence patterns. We also investigate to what extent standard tools, in particular hyperlink-based methods [12], [13] can be used to mine emerging new references from blogs. The first issue is the rate of change [14], [15] of blogs with respect to regular pages; most blogs do not change frequently. To
The blogistan
Moving beyond our seed collection, we wanted to examine a larger fraction of the available set of blogs—the blogistan. A few studies have reported the size of the blogistan to range from 1 Million to over 4 Million. A study [19] (based on a collection of about 3500 blogs) reported that a significant fraction of blogs—about two thirds—had not been updated in two months. However, this study only used blogs maintained on certain blog-hosting services. Of the 375 popular blogs in our seed set
Inferences from our analysis
Thus far, we have presented the results of our study comprising of analyzing a seed collection of blogs, crawling to get a large fraction of the blogistan, and the subsequent protocol-level analysis. Below we present observations gleaned from our study. The study can help us provide a clear idea of what a blog looks like. Automatic identification of a Web page as a blog can help in modifying crawler algorithms, smarter indexing, and distilling search results. Identifying sub-communities of
Conclusion and ongoing work
Given the popularity of blogs in popular culture [32] and their rapid growth as a distinctive part of the Web, it is natural to examine this phenomenon. Blogs provide a multi-way communication paradigm on the Web that typical Web pages do not. The rate of change of blogs is quite different from traditional Web pages and the nature and count of links between blogs and other Web pages are quite distinct. The rich content reflecting the human authoring and the steady updating indicates continuing
Edith Cohen is a researcher at AT&T Labs-Research. She did her undergraduate and M.Sc. studies at Tel-Aviv University, and received a Ph.D in Computer Science from Stanford University in 1991. She joined Bell Laboratories in 1991 (now AT&T Labs). During 1997, she was in UC Berkeley as a visiting professor. Her research interests include design and analysis of algorithms, combinatorial optimization, Web performance, networking, and data mining.
References (32)
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
(1997)- R. Blood, weblogs: a history and perspective. Available from:...
- D. Barry, The (Un)official Dave Barry Blog. Available from:...
- J.C. Mogul, F. Douglis, A. Feldmann, B. Krishnamurthy, Potential benefits of delta encoding and data compression for...
- J. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann, Y. Goland, A. van Hoff, D. Hellerstein, Delta encoding in HTTP,...
- Blogads for opinion makers. Available from:...
- Salon radio community server. Available from:...
- Top 100 Technorati. Available from:...
- Most watched blogs. Available from:...
- The blogosphere power rankings—the most popular political blogs on the net. Available from:...
Manasse Wiener, a large-scale study of the evolution of web pages
Software Practice and Experience
Authoritative sources in a hyperlinked environment
Journal of the ACM
Cited by (38)
A neural network based approach for sentiment classification in the blogosphere
2011, Journal of InformetricsCitation Excerpt :Blogs are one of the fastest growing sections of the emerging communication mechanisms (Cohen & Krishnamurthy, 2006; Lambiotte, Ausloos, & Thelwall, 2007; Singh, Veron-Jackson, & Cullinane, 2008; Tang, Tan, & Cheng, 2009).
Web 2.0 and medical physics
2010, Zeitschrift fur Medizinische PhysikBlogging PR: An exploratory analysis of public relations weblogs
2008, Public Relations ReviewIdentifying the large-scale structure of the blogosphere
2009, Advances in Complex SystemsA Novel Cluster based Over-sampling Approach for Classifying Imbalanced Sentiment Data
2021, IAENG International Journal of Computer ScienceThe tools of the second generation of the Web and their role in the educational process in kindergarten institutions.
2019, Studies in Childhood and Education
Edith Cohen is a researcher at AT&T Labs-Research. She did her undergraduate and M.Sc. studies at Tel-Aviv University, and received a Ph.D in Computer Science from Stanford University in 1991. She joined Bell Laboratories in 1991 (now AT&T Labs). During 1997, she was in UC Berkeley as a visiting professor. Her research interests include design and analysis of algorithms, combinatorial optimization, Web performance, networking, and data mining.
Balachander Krishnamurthy is a member of technical staff at AT&T Labs–Research.