Search engines and Web dynamics
Introduction
Search engines have grown into by far the most popular way for navigating the Web. The evolution of search engines started with the static Web and relatively simple tools such as WWWW [16]. In 1995 AltaVista was launched and created a bigger focus on search engines [17]. The marketplace for search engines is still dynamic, and actors like FAST (www.alltheweb.com), Google, Inktomi and AltaVista are still working on different technical solutions and business models in order to make a viable business, including paid inclusion, paid positioning, advertisements, OEM searching, etc.
A large number of analyses have been made on the structure and dynamics of the Web itself. Conclusions are drawn that the Web is still growing at a high pace, and the dynamics of the Web is shifting. More and more dynamic and real-time information is made available on the Web. The dynamics of the Web creates a set of tough challenges for all search engines.
In Section 2 we define a reference model for Internet search engines. In Section 3 we survey some of the existing studies on the dynamics of the Web. Our focus is on the growth of the Web and the update dynamics of individual documents on the Web. In Section 4 we provide an overview of the FAST Crawler and describe how its design meets the challenges of Web growth and update dynamics. We continue in Section 5 with a similar description of the indexing and search engines. Finally, we outline some future challenges and provide some benchmarking figures in Sections 6 and 7, respectively.
The FAST Search Engine technology is used as a case study throughout the paper. The focus of the paper is on how Web dynamics pose key challenges to large-scale Internet search engines and how these challenges can be addressed in a practical, working system. The main contribution of this paper is to offer some insight into how a large-scale, commercially operated Internet search engine is actually designed and implemented.
Section snippets
A search engine reference model
Most practical and commercially operated Internet search engines are based on a centralized architecture that relies on a set of key components, namely Crawler, Indexer and Searcher. This architecture can be seen in systems including WWWW [16], Google [3], and our own FAST Search Engine.
- •
Definition: Crawler. A crawler is a module aggregating data from the World Wide Web in order to make them searchable. Several heuristics and algorithms exists for crawling, most of them are based upon following
The dynamics of the Web
In this section we outline the nature of Web dynamics. We define the different aspects of Web dynamics, and we review the literature on the topic. We do not attempt to provide a complete review of the published studies but rather focus on a number of representative and significant works.
Aggregation of dynamic content
In this section, we use the FAST Crawler as a case study to illustrate how we have addressed the challenges of scaling with the size of the Web and ensuring the freshness of our local store.
Searching dynamic content
The second and third components in the reference search engine model, the Indexer and the Searcher, need also to handle the different dimensions of Web dynamics. Traditionally, search engines have been based upon batch-oriented processes to update and build indices. To handle the growth in size of the Web and the update dynamics, most traditional designs fall short. In this section, we will study several aspects and solutions for an indexer and a searcher to handle a dynamic Web.
Future challenges
The evolution of the dynamic Web raises several significant challenges for search engines. First, the increasing dynamics and size makes intelligent scheduling increasingly important. Being able to update the most important parts of an index at a timely rate will be crucial in order to bring relevant search results to the users. Intelligent scheduling, heterogeneous crawling and push technology will be crucial to building aggregation and search systems capable of scaling with the Web at a
Conclusion
We have discussed several dimensions of Web dynamics. Both growth and update dynamics clearly represent big challenges for search engines. We have shown how the problems arise in all components of a reference search engine model.
The FAST Search Engine architecture copes with several of these problems by its key properties. The overall architecture that we have described in this paper is quite simple and does not represent very novel ideas. The system architecture is relatively simple, and this
Knut Magne Risvik graduated from the Norwegian University of Science and Technology in 1997. He had joined FAST in April 1997, and serves as the Director of search technology. He directs research and development of search technology and has been a key architect behind the FAST Search technology. Mr. Risvik holds two patents and has applied for three other patents. His main fields of interest are search technology, parallel architectures and scalable computing. Mr. Risvik is pursuing a Ph.D.
References (17)
- T. Bray, Measuring the Web, in: Proceedings of the Fifth International World Wide Web Conference (WWW5),...
- B.E. Brewington, G. Cybenko, How dynamic is the Web? in: Proceedings of the Ninth International World Wide Web...
- S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, in: Proceedings of the Seventh...
- The Deep Web: Surfacing Hidden Value. White Paper, Bright Planet,...
- A. Broder et al. Graph structure in the Web, in: Proceedings of the Ninth International World Wide Web Conference...
- J. Cho, H. Garcia-Molina, Synchronizing a database to improve freshness, in: Proceedings of 2000 ACM International...
- J. Cho, H. Garcia-Molina, The evolution of the Web and implications for an incremental crawler, in: Proceedings of the...
- J. Cho, H. Garcia-Molina, Estimating frequency of change,...
Cited by (72)
Syntactic complexity of Web search queries through the lenses of language models, networks and users
2016, Information Processing and ManagementCitation Excerpt :Searching information on the World Wide Web by issuing queries to commercial search engines is one of the most common activities engaged in by almost every Web user Jansen and Spink (2006). The Web has grown extensively over the past two decades, and search engines have kept pace by incorporating progressively smarter algorithms to keep all the information at our fingertips (Ntoulas, Cho, & Olston, 2004; Risvik & Michelsen, 2002; Schwartz, 1998). This co-evolution of the Web and search engines have driven users to formulate progressively longer and more complex queries, as seen by a rise in mean lengths from 2.4 through 3.5 to about four words per unique query over the last twelve years (Pass, Chowdhury, & Torgeson, 2006; Saha Roy, Choudhury, & Bali, 2012a; Spink, Wolfram, Jansen, & Saracevic, 2001).
Spec-Crawl: Domain Specific Crawler and Specifications Management for Search Engines
2023, AIP Conference ProceedingsQuality of Web-Based Sickle Cell Disease Resources for Health Care Transition: Website Content Analysis
2023, JMIR Pediatrics and ParentingUnderstanding Search Engines
2023, Understanding Search EnginesThe Influence of Digital Assistants on Search Engine Strategies: Recommendations for Voice Search Optimization
2022, Smart Innovation, Systems and Technologies
Knut Magne Risvik graduated from the Norwegian University of Science and Technology in 1997. He had joined FAST in April 1997, and serves as the Director of search technology. He directs research and development of search technology and has been a key architect behind the FAST Search technology. Mr. Risvik holds two patents and has applied for three other patents. His main fields of interest are search technology, parallel architectures and scalable computing. Mr. Risvik is pursuing a Ph.D. related to search technology while holding his position with FAST.
Rolf Michelsen received his Siv.ing. degree from the Norwegian Institute of Technology (NTH) in 1992. He worked as a researcher at SINTEF specializing in information security in open, distributed systems for five years before he joined Fast Search & Transfer in 1999. At FAST he has been responsible for the data aggregation platform and worked on the overall Internet search architecture. Now he is responsible for overall engineering of the Internet search platform.