Elsevier

Journal of Systems and Software

Volume 60, Issue 3, 15 February 2002, Pages 185-193
Journal of Systems and Software

Architectural design and evaluation of an efficient Web-crawling system

https://doi.org/10.1016/S0164-1212(01)00091-7Get rights and content

Abstract

This paper presents an architectural design and evaluation result of an efficient Web-crawling system. The design involves a fully distributed architecture, a URL allocating algorithm, and a method to assure system scalability and dynamic reconfigurability. Simulation experiment shows that load balance, scalability and efficiency can be achieved in the system. Currently this distributed Web-crawling subsystem has been successfully integrated with WebGather, a well-known Chinese and English Web search engine, aimed at collecting all the Web pages in China and keeping pace with the rapid growth of Chinese Web information. In addition, we believe that the design can also be useful in other context such as digital library, etc.

Introduction

During the short history of the World Wide Web (Web), internet resources grow day by day and the number of home pages increases rapidly. How to quickly and accurately find what you need in the Web? Search engine is a useful tool and it becomes more and more important. The number of indexed pages is a primary indicator for the capability of a search engine. By indexing a larger number of pages visited, a search engine is able to satisfy users' requests in a better light. Besides, another important factor for a Web search system is the freshness of the pages it indexes. We hope a search engine reflects the changes in the Web in a timely way. More specifically, a search engine should be able to collect enough pages in a limited time frame, say 30 days. Thus, the efficiency in collecting pages is essential for a quality search engine.

It is natural to think of “parallel processing” or “parallel distributed system” when talking about efficiently executing tasks with large data set. Previously, WebGather 1.0 (Liu et al., 2000), which answers more than 30,000 queries every day, adopted a centralized architecture to collect Web pages (i.e., a single main process coordinates many crawlers to work in parallel), and one million page indices are maintained after the pages are crawled and parsed. With the capability of crawling 100,000 pages a day, WebGather 1.0 takes about ten days to refresh the whole Web pages it hosts. We note that (Google, 2000), born of Stanford University, could index 560 million pages in July 2000 (Sullivan, 2000). The centralized version, WebGather 1.0, is incompetent to update the database in a reasonable period of time. For example, with the crawling capability of WebGather 1.0, it will take 100 days to collect 10 million pages. Because pages are often refreshed, some of the collected pages will lose their value. Of course, it is quite likely to accelerate the performance of the system by improving crawling algorithm, adopting more powerful machines and higher network bandwidth. Due to the exponential increase of Web pages, the above approaches are not good enough to cope with the ever-increasing pages. So adopting parallel processing technology to collect more pages in a limited time frame is essential in developing a large-scale search engine.

This paper primarily concerns design of a parallel and distributed scheme to achieve the design goal. We will present an architecture and propose methods of collecting pages on the Web for a distributed search engine system. Based on WebGather 1.0 and its log data, we have designed and implemented an experiment model to validate the architecture, design ideas and methods. The result is being used in the construction of WebGather 2.0.

Section snippets

Harvest: a typical distributed architecture

Harvest (Bowman et al., 1995) is a typical system that makes use of distributed methods to collect and index Web pages. Harvest is made of several subsystems. The Gatherer subsystem collects indexing information (such as keywords, author names, and titles, etc.) from the resources available at Provider sites (such as FTP and HTTP servers). The Broker subsystem retrieves indexing information from one or more Gatherers, eliminates duplicate information, incrementally indexes the collected

Terminology

For clarity, we first define the special terms used in the following sections.

Main-controller is the process that manages multiple gatherers to fetch pages from the Web. Its main function is to assign URLs to gatherers, and save abstracts of Web ages returned by gatherers. It is the core part of WebGather and each workstation runs one main-controller in our distributed system.

Coordinator is the module that coordinates all main-controllers in the distributed system. It runs on one of the

Simulation results

In Jun 2000, while WebGather is running, we utilize a program to get simulative data which are 507 megabytes including Web page URLs and cross URLs. After running the program, we get simulative Web data with 761,129 Web pages. The data is our experiment object. All of our measurements were made on a general Intel PC with two 550 MHz Intel processors, 512 megabytes of memory and 36-gigabyte hard disk. The operating system is Solaris 8.0.

Based on the above environment for experiment, we

Experimental results

The simulation results demonstrate that the system realizes our design goal. So, we applied the architecture and method to implement WebGather 2.0 and got the following results of the actual system. All the results of the actual system are got from PCs which have the same configurations as the experimental PC, and each main controller runs on an independent PC.

Conclusion

The parallel and distributed architecture described in this paper provides a method for efficiently crawling massive amount of Web pages. The simulation results demonstrate that the system realizes our design goal. The fully distributed crawling architecture excels Google's centralized architecture (Brin and Page, 1998) and scales well as more crawling processors are added. The approach for achieving dynamic reconfigurability described in this paper is simple but effective, which introduces

Acknowledgements

The work of this paper was supported by the National Grand Fundamental Research Program (973) of China (Grant No. G1999032706), and we are grateful to Zhengmao Xie, Jianghua Zhao, and Songwei Shan for their helpful comments.

Honfei Yan received his B.S and M.S. from Harbin Engineering University, PR China, in 1996 and in 1999, respectively. Currently he is a Ph.D. candidate in the Department of Computer Science and Technology of Peking University. His research interests include distributed systems and algorithms, Web information retrieval and database.

References (8)

  • Bowman, C. Mic, et al., 1995. The Harvest Information Discovery and Access System, Technical Report, University of...
  • Brin, S., Page, L., 1998. The anatomy of a large-scale hypertextual Web search engine. In: Seventh International World...
  • CNNIC, 2000. China Internet network development status statistical reports....
  • CERNIC, 2000. Information service....
There are more references available in the full text version of this article.

Cited by (22)

View all citing articles on Scopus

Honfei Yan received his B.S and M.S. from Harbin Engineering University, PR China, in 1996 and in 1999, respectively. Currently he is a Ph.D. candidate in the Department of Computer Science and Technology of Peking University. His research interests include distributed systems and algorithms, Web information retrieval and database.

Jianyong Wang is an assistant professor of the Department of Computer Science and Technology at Peking University, PR China. He received his B.S. degree from LanZhou University in 1991 and his Ph.D. degree from Institute of Computing Technology, Chinese Academy of Sciences in 1999, both in computer science. His research interests include distributed systems and algorithms, Web information retrieval, data mining and business intelligence.

Xiaoming Li is a Professor of Department of Computer Science and Technology of Peking University, PR China. He received his B.S from Harbin Institute of Technology, PR China, in 1982, M.S. and Ph.D. from Stevens Institute of Technology, Hoboken, USA, in 1983 and 1986, respectively. His research interests include programming model for SMP clusters, high performance and content-based Web information retrieval, distributed Web services, and intelligent broadband multimedia networks.

Lin Guo is an undergraduate student in the Department of Computer Science and Technology at Peking University, PR China. Her research interests include distributed systems and algorithms, Web information retrieval and networks.

View full text