Search engines and Web dynamics

doi:10.1016/S1389-1286(02)00213-X

Computer Networks

Volume 39, Issue 3, 21 June 2002, Pages 289-302

https://doi.org/10.1016/S1389-1286(02)00213-X Get rights and content

Abstract

In this paper we study several dimensions of Web dynamics in the context of large-scale Internet search engines. Both growth and update dynamics clearly represent big challenges for search engines. We show how the problems arise in all components of a reference search engine model.

Furthermore, we use the FAST Search Engine architecture as a case study for showing some possible solutions for Web dynamics and search engines. The focus is to demonstrate solutions that work in practice for real systems. The service is running live at www.alltheweb.com and major portals worldwide with more than 30 million queries a day, about 700 million full-text documents, a crawl base of 1.8 billion documents, updated every 11 days, at a rate of 400 documents/second.

We discuss future evolution of the Web, and some important issues for search engines will be scheduling and query execution as well as increasingly heterogeneous architectures to handle the dynamic Web.

Introduction

Search engines have grown into by far the most popular way for navigating the Web. The evolution of search engines started with the static Web and relatively simple tools such as WWWW [16]. In 1995 AltaVista was launched and created a bigger focus on search engines [17]. The marketplace for search engines is still dynamic, and actors like FAST (www.alltheweb.com), Google, Inktomi and AltaVista are still working on different technical solutions and business models in order to make a viable business, including paid inclusion, paid positioning, advertisements, OEM searching, etc.

A large number of analyses have been made on the structure and dynamics of the Web itself. Conclusions are drawn that the Web is still growing at a high pace, and the dynamics of the Web is shifting. More and more dynamic and real-time information is made available on the Web. The dynamics of the Web creates a set of tough challenges for all search engines.

In Section 2 we define a reference model for Internet search engines. In Section 3 we survey some of the existing studies on the dynamics of the Web. Our focus is on the growth of the Web and the update dynamics of individual documents on the Web. In Section 4 we provide an overview of the FAST Crawler and describe how its design meets the challenges of Web growth and update dynamics. We continue in Section 5 with a similar description of the indexing and search engines. Finally, we outline some future challenges and provide some benchmarking figures in Sections 6 and 7, respectively.

The FAST Search Engine technology is used as a case study throughout the paper. The focus of the paper is on how Web dynamics pose key challenges to large-scale Internet search engines and how these challenges can be addressed in a practical, working system. The main contribution of this paper is to offer some insight into how a large-scale, commercially operated Internet search engine is actually designed and implemented.

Section snippets

A search engine reference model

Most practical and commercially operated Internet search engines are based on a centralized architecture that relies on a set of key components, namely Crawler, Indexer and Searcher. This architecture can be seen in systems including WWWW [16], Google [3], and our own FAST Search Engine.

•
Definition: Crawler. A crawler is a module aggregating data from the World Wide Web in order to make them searchable. Several heuristics and algorithms exists for crawling, most of them are based upon following

The dynamics of the Web

In this section we outline the nature of Web dynamics. We define the different aspects of Web dynamics, and we review the literature on the topic. We do not attempt to provide a complete review of the published studies but rather focus on a number of representative and significant works.

Aggregation of dynamic content

In this section, we use the FAST Crawler as a case study to illustrate how we have addressed the challenges of scaling with the size of the Web and ensuring the freshness of our local store.

Searching dynamic content

The second and third components in the reference search engine model, the Indexer and the Searcher, need also to handle the different dimensions of Web dynamics. Traditionally, search engines have been based upon batch-oriented processes to update and build indices. To handle the growth in size of the Web and the update dynamics, most traditional designs fall short. In this section, we will study several aspects and solutions for an indexer and a searcher to handle a dynamic Web.

Future challenges

The evolution of the dynamic Web raises several significant challenges for search engines. First, the increasing dynamics and size makes intelligent scheduling increasingly important. Being able to update the most important parts of an index at a timely rate will be crucial in order to bring relevant search results to the users. Intelligent scheduling, heterogeneous crawling and push technology will be crucial to building aggregation and search systems capable of scaling with the Web at a

Conclusion

We have discussed several dimensions of Web dynamics. Both growth and update dynamics clearly represent big challenges for search engines. We have shown how the problems arise in all components of a reference search engine model.

The FAST Search Engine architecture copes with several of these problems by its key properties. The overall architecture that we have described in this paper is quite simple and does not represent very novel ideas. The system architecture is relatively simple, and this

Knut Magne Risvik graduated from the Norwegian University of Science and Technology in 1997. He had joined FAST in April 1997, and serves as the Director of search technology. He directs research and development of search technology and has been a key architect behind the FAST Search technology. Mr. Risvik holds two patents and has applied for three other patents. His main fields of interest are search technology, parallel architectures and scalable computing. Mr. Risvik is pursuing a Ph.D.

References (17)

T. Bray, Measuring the Web, in: Proceedings of the Fifth International World Wide Web Conference (WWW5),...
B.E. Brewington, G. Cybenko, How dynamic is the Web? in: Proceedings of the Ninth International World Wide Web...
S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, in: Proceedings of the Seventh...
The Deep Web: Surfacing Hidden Value. White Paper, Bright Planet,...
A. Broder et al. Graph structure in the Web, in: Proceedings of the Ninth International World Wide Web Conference...
J. Cho, H. Garcia-Molina, Synchronizing a database to improve freshness, in: Proceedings of 2000 ACM International...
J. Cho, H. Garcia-Molina, The evolution of the Web and implications for an incremental crawler, in: Proceedings of the...
J. Cho, H. Garcia-Molina, Estimating frequency of change,...

There are more references available in the full text version of this article.

Cited by (72)

Syntactic complexity of Web search queries through the lenses of language models, networks and users
2016, Information Processing and Management
Citation Excerpt :
Searching information on the World Wide Web by issuing queries to commercial search engines is one of the most common activities engaged in by almost every Web user Jansen and Spink (2006). The Web has grown extensively over the past two decades, and search engines have kept pace by incorporating progressively smarter algorithms to keep all the information at our fingertips (Ntoulas, Cho, & Olston, 2004; Risvik & Michelsen, 2002; Schwartz, 1998). This co-evolution of the Web and search engines have driven users to formulate progressively longer and more complex queries, as seen by a rise in mean lengths from 2.4 through 3.5 to about four words per unique query over the last twelve years (Pass, Chowdhury, & Torgeson, 2006; Saha Roy, Choudhury, & Bali, 2012a; Spink, Wolfram, Jansen, & Saracevic, 2001).
Across the world, millions of users interact with search engines every day to satisfy their information needs. As the Web grows bigger over time, such information needs, manifested through user search queries, also become more complex. However, there has been no systematic study that quantifies the structural complexity of Web search queries. In this research, we make an attempt towards understanding and characterizing the syntactic complexity of search queries using a multi-pronged approach. We use traditional statistical language modeling techniques to quantify and compare the perplexity of queries with natural language (NL). We then use complex network analysis for a comparative analysis of the topological properties of queries issued by real Web users and those generated by statistical models. Finally, we conduct experiments to study whether search engine users are able to identify real queries, when presented along with model-generated ones. The three complementary studies show that the syntactic structure of Web queries is more complex than what n-grams can capture, but simpler than NL. Queries, thus, seem to represent an intermediate stage between syntactic and non-syntactic communication.
Estimating evolution of freshness in Internet cache directories under the capture-recapture methodology
2010, Computer Networks
In this paper, we describe a new web sampling scheme for measuring the evolution of freshness in search engines. The methodology used is the capture–recapture, which is mainly applied for estimating evolution rates in wildlife biological studies. After modifications and amendments, necessary for web paradigm application, we conducted three capture–recapture experiments of different duration over the caches of Google and MSN. In parallel, we used a typical sampling scheme, similar to many other web sampling approaches used in the literature, to evaluate the robustness of our proposal. The paper provides the implementation details of a web-based capture–recapture model along with its assessment. The results show that through the capture–recapture methodology we are able not only to measure the freshness of the tested search services but also to monitor its evolution over time, with a substantially lower amount of required sampling instances. It was not our intention to compare the performance of Google and MSN. However, through our experiments, we observed that although one sometimes presents better refresh rates than the other, in general both search services have virtually equal capabilities in refreshing their directories and providing new and up-to-date results to their users.
Spec-Crawl: Domain Specific Crawler and Specifications Management for Search Engines
2023, AIP Conference Proceedings
Quality of Web-Based Sickle Cell Disease Resources for Health Care Transition: Website Content Analysis
2023, JMIR Pediatrics and Parenting
Understanding Search Engines
2023, Understanding Search Engines
The Influence of Digital Assistants on Search Engine Strategies: Recommendations for Voice Search Optimization
2022, Smart Innovation, Systems and Technologies

View all citing articles on Scopus

Rolf Michelsen received his Siv.ing. degree from the Norwegian Institute of Technology (NTH) in 1992. He worked as a researcher at SINTEF specializing in information security in open, distributed systems for five years before he joined Fast Search & Transfer in 1999. At FAST he has been responsible for the data aggregation platform and worked on the overall Internet search architecture. Now he is responsible for overall engineering of the Internet search platform.

View full text