research-article

Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load

Authors:

Fabrizio Silvestri,

Raffaele Perego,

Ricardo Baeza-YatesAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 28, Issue 2

Article No.: 5, Pages 1 - 36

https://doi.org/10.1145/1740592.1740593

Published: 10 June 2010 Publication History

Abstract

This article introduces an architecture for a document-partitioned search engine, based on a novel approach combining collection selection and load balancing, called load-driven routing. By exploiting the query-vector document model, and the incremental caching technique, our architecture can compute very high quality results for any query, with only a fraction of the computational load used in a typical document-partitioned architecture. By trading off a small fraction of the results, our technique allows us to strongly reduce the computing pressure to a search engine back-end; we are able to retrieve more than 2/3 of the top-5 results for a given query with only 10% the computing load needed by a configuration where the query is processed by each index partition. Alternatively, we can slightly increase the load up to 25% to improve precision and get more than 80% of the top-5 results. In fact, the flexibility of our system allows a wide range of different configurations, so as to easily respond to different needs in result quality or restrictions in computing power. More important, the system configuration can be adjusted dynamically in order to fit unexpected query peaks or unpredictable failures. This article wraps up some recent works by the authors, showing the results obtained by tests conducted on 6 million documents, 2,800,000 queries and real query cost timing as measured on an actual index.

References

[1]

Badue, C. S., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., and Ziviani, N. 2007. Analyzing imbalance among homogeneous index servers in a Web search system. Inform. Process. Manage. 43, 3, 592--608.

Digital Library

[2]

Baeza-Yates, R., Castillo, C., Junqueira, F., Plachouras, V., and Silvestri, F. 2007a. Challenges in distributed information retrieval (invited paper). In Proceedings of International Conference on Data Engineering (ICDE). IEEE CS Press.

[3]

Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., and Silvestri, F. 2007b. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 183--190.

Digital Library

[4]

Baeza-Yates, R., Gionis, A., Junqueira, F. P., Murdock, V., Plachouras, V., and Silvestri, F. 2008. Design trade-offs for search engine caching. ACM Trans. Web 2, 4, 1--28.

Digital Library

[5]

Barroso, L., Dean, J., and Hölze, U. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 22, 2.

Digital Library

[6]

Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2004. Hourly analysis of a very large topically categorized Web query log. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 321--328.

Digital Library

[7]

Boldi, P. and Vigna, S. 2004. The webgraph framework I: compression techniques. In Proceedings of the 13th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 595--602.

Digital Library

[8]

Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh International Conference on World Wide Web (WWW). Elsevier Science Publishers B. V., Amsterdam, The Netherlands, 107--117.

Digital Library

[9]

Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. In Selected Papers from the Sixth International Conference on World Wide Web. Elsevier Science Publishers Ltd., Amsterdam, The Netherlands, 1157--1166.

Digital Library

[10]

Callan, J., Lu, Z., and Croft, W. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, 21--28.

Digital Library

[11]

Chierichetti, F., Panconesi, A., Raghavan, P., Sozio, M., Tiberi, A., and Upfal, E. 2007. Finding near neighbors through cluster pruning. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS). ACM, New York, NY, 103--112.

Digital Library

[12]

Chowdhury, A., Frieder, O., Grossman, D., and McCabe, M. C. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inform. Syst. 20, 2, 171--191.

Digital Library

[13]

Dean, J. and Ghemawat, S. 2004. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th Conference Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Berkeley, CA, 10--10.

Digital Library

[14]

Dhillon, I. S., Mallela, S., and Modha, D. S. 2003. Information-theoretic co-clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 89--98.

Digital Library

[15]

Fagni, T., Perego, R., Silvestri, F., and Orlando, S. 2006. Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Trans. Inform. Syst. 24, 1, 51--78.

Digital Library

[16]

Frieder, O. and Siegelmann, H. T. 1991. On the allocation of documents in multiprocessor information retrieval systems. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, New York, NY, 230--239.

Digital Library

[17]

Google. 2007. Google begins move to universal search. http://www.google.com/intl/en/press/pressrel/universalsearch_20070516.html.

[18]

Gravano, L. and Garcia-Molina, H. 1995. Generalizing GlOSS to vector-space databases and broker hierarchies. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 78--89.

Digital Library

[19]

Gravano, L., Garcia-Molina, H., and Tomasic, A. 1994. Precision and recall of GlOSS estimators for database discovery. Techn. note number STAN-CS-TN-94-10, Stanford University.

[20]

Hoad, T. C. and Zobel, J. 2003. Methods for identifying versioned and plagiarized documents. J. Amer. Soc. Inform. Sci. Tech. 54, 3, 203--215.

Digital Library

[21]

Jain, A. and Dubes, R. 1988. Algorithms for Clustering Data. Prentice Hall.

Digital Library

[22]

Jansen, B. and Spink, A. 2006. How are we searching the World Wide Web&quest; A comparison of nine search engine transaction logs. Inform Proc. and Management 42, 248--263.

Digital Library

[23]

Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T. 1998. Real life information retrieval: a study of user queries on the Web. SIGIR Forum 32, 1, 5--17.

Digital Library

[24]

Karedla, R., Love, J. S., and Wherry, B. G. 1994. Caching strategies to improve disk system performance. Computer 27, 3, 38--46.

Digital Library

[25]

Larkey, L. S., Connell, M. E., and Callan, J. 2000. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 282--289.

Digital Library

[26]

Lempel, R. and Moran, S. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web (WWW). ACM, New York, NY, 19--28.

Digital Library

[27]

Liu, X. and Croft, W. B. 2004. Cluster-based retrieval using language models. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, New York, NY, 186--193.

Digital Library

[28]

Markatos, E. P. 2001. On caching search engine query results. Comput. Comm. 24, 2, 137--143.

Digital Library

[29]

Moffat, A. and Zobel, J. 1994. Information retrieval systems for large document collections. In Proceedings of the Text REtrieval Conference. 85--94.

[30]

Papadimitriou, S. and Sun, J. 2008. Disco: Distributed co-clustering with map-reduce. In Proceedings of the IEEE International Conference on Data Mining (ICDM).

Digital Library

[31]

Pew Internet and American Life Project. 2005. Search engine use shoots up in the past year and edges towards email as the primary internet application. http://www.pewinternet.org/pdfs/PIP_SearchData_1105.pdf.

[32]

Poblete, B. and Baeza-Yates, R. 2008. Query-sets: using implicit feedback and query patterns to organize Web documents. In Proceedings of the 17th International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 41--50.

Digital Library

[33]

Puppin, D. 2008. Collection selection… now, with more documents&excl; In Proceedings of the 2nd International Conference on Scalable Information Systems (InfoScale). ICST, Brussels, Belgium.

Digital Library

[34]

Puppin, D. and Silvestri, F. 2006. The query-vector document model. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, 880--881.

Digital Library

[35]

Puppin, D., Silvestri, F., and Laforenza, D. 2006. Query-driven document partitioning and collection selection (invited paper). In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale). ACM, New York, NY, USA, 34.

Digital Library

[36]

Puppin, D., Silvestri, F., Perego, R., and Baeza-Yates, R. 2007. Load-balancing and caching for collection selection architectures. In Proceedings of the 2nd International Conference on Scalable Information Systems (InfoScale). ICST, Brussels, Belgium, 1--10.

Digital Library

[37]

Raghavan, V. V. and Sever, H. 1995. On the reuse of past optimal queries. In SIGIR '95: Proceedings of the 18th annual international ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 344--350.

Digital Library

[38]

Randall, K. H., Stata, R., Wiener, J. L., and Wickremesinghe, R. G. 2002. The link database: Fast access to graphs of the Web. In Proceedings of the Data Compression Conference (DCC). IEEE Computer Society, Los Alamitos, CA, 122.

Digital Library

[39]

Robertson, S. E. and Walker, S. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Springer-Verlag, Berlin, Germany, 232--241.

Digital Library

[40]

Silverstein, C., Henzinger, M., Marais, H., and Moricz, M. 1999. Analysis of a very large Web search engine query log. In ACM SIGIR Forum. 6--12.

Digital Library

[41]

Silvestri, F. 2007. Sorting out the document identifier assignment problem. In Proceedings of the European Conference on IR Research (ECIR). G. Amati, C. Carpineto, and G. Romano, Eds. Lecture Notes in Computer Science, vol. 4425. Springer, 101--112.

Digital Library

[42]

Tomasic, A., Gravano, L., Lue, C., Schwarz, P., and Haas, L. 1997. Data structures for efficient broker implementation. ACM Trans. Inform. Syst. 15, 3, 223--253.

Digital Library

[43]

Van Rijsbergen, C. 1979. Information Retrieval. Butterworths.

Digital Library

[44]

Webber, W., Moffat, A., Zobel, J., and Baeza-Yates, R. 2006. A pipelined architecture for distributed text query evaluation. Inform. Retrieval. 10, 3.

Digital Library

[45]

Xie, Y. and O'Hallaron, D. R. 2002. Locality in search engine queries and its implications for caching. In Proceedings of the 21st Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM).

[46]

Xu, J. and Callan, J. 1998. Effective retrieval with distributed collections. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR).

Digital Library

[47]

Xu, J. and Croft, W. B. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR). 254--261.

Digital Library

[48]

Yuwono, B. and Lee, D. L. 1997. Server ranking for distributed text retrieval systems on the Internet. In Proceedings of the 5th International Conference on Database Systems for Advanced Applications (DASFAA). World Scientific Press, 41--50.

Digital Library

Cited By

Duan KLi YMarbach TWang GLiu X(2020)Improving Load Balance via Resource Exchange in Large-Scale Search EnginesProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404402(1-11)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404402
Altin SBaeza-Yates RCambazoglu B(2020)Pre-indexing Pruning StrategiesString Processing and Information Retrieval10.1007/978-3-030-59212-7_13(177-193)Online publication date: 13-Oct-2020
https://dl.acm.org/doi/10.1007/978-3-030-59212-7_13
Kucukyilmaz TCambazoglu BAykanat CBaeza-Yates R(2017)A machine learning approach for result caching in web search enginesInformation Processing and Management: an International Journal10.1016/j.ipm.2017.02.00653:4(834-850)Online publication date: 1-Jul-2017
https://dl.acm.org/doi/10.1016/j.ipm.2017.02.006
Show More Cited By

Index Terms

Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load
1. Information systems
  1. Information retrieval
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage

Recommendations

Evaluating leading web search engines on children's queries
HCII'11: Proceedings of the 14th international conference on Human-computer interaction: users and applications - Volume Part IV

This study compared retrieved results, relevance ranking, and overlap across Google, Yahoo!, Bing, Yahoo Kids!, and Ask Kids on 15 queries constructed by middle school children. Queries included one word, two words, and multiple words/phrases/natural ...
The effectiveness of web search engines for retrieving relevant ecommerce links

Ecommerce is developing into a fast-growing channel for new business, so a strong presence in this domain could prove essential to the success of numerous commercial organizations. However, there is little research examining ecommerce at the individual ...
The comparative effectiveness of sponsored and nonsponsored links for Web e-commerce queries

The predominant business model for Web search engines is sponsored search, which generates billions in yearly revenue. But are sponsored links providing online consumers with relevant choices for products and services? We address this and related issues ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 28, Issue 2

May 2010

165 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1740592

Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2010

Accepted: 01 March 2009

Revised: 01 December 2008

Received: 01 February 2008

Published in TOIS Volume 28, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
606
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Duan KLi YMarbach TWang GLiu X(2020)Improving Load Balance via Resource Exchange in Large-Scale Search EnginesProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404402(1-11)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404402
Altin SBaeza-Yates RCambazoglu B(2020)Pre-indexing Pruning StrategiesString Processing and Information Retrieval10.1007/978-3-030-59212-7_13(177-193)Online publication date: 13-Oct-2020
https://dl.acm.org/doi/10.1007/978-3-030-59212-7_13
Kucukyilmaz TCambazoglu BAykanat CBaeza-Yates R(2017)A machine learning approach for result caching in web search enginesInformation Processing and Management: an International Journal10.1016/j.ipm.2017.02.00653:4(834-850)Online publication date: 1-Jul-2017
https://dl.acm.org/doi/10.1016/j.ipm.2017.02.006
Saoud ZKechid SSaoud MDoucet A(2017)Exploiting Social Annotations to Generate Resource Descriptions in a Distributed Environment: Cooperative Multi-Agent Simulation on Query-Based SamplingThe Review of Socionetwork Strategies10.1007/s12626-017-0001-611:1(83-93)Online publication date: 1-Jun-2017
https://doi.org/10.1007/s12626-017-0001-6
Hafizoglu FKucukoglu EAltingovde I(2017)On the Efficiency of Selective SearchAdvances in Information Retrieval10.1007/978-3-319-56608-5_69(705-712)Online publication date: 8-Apr-2017
https://doi.org/10.1007/978-3-319-56608-5_69
Jiang KYang Y(2016)Efficient dynamic pruning on largest scores first (LSF) retrievalFrontiers of Information Technology & Electronic Engineering10.1631/FITEE.150019017:1(1-14)Online publication date: 9-Jan-2016
https://doi.org/10.1631/FITEE.1500190
Mendoza MMarín MGil-Costa VFerrarotti F(2016)Reducing hardware hit by queries in web search enginesInformation Processing and Management: an International Journal10.1016/j.ipm.2016.04.00852:6(1031-1052)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1016/j.ipm.2016.04.008
Zamora JMendoza MAllende H(2016)Hashing-based clustering in high dimensional dataExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.06.00862:C(202-211)Online publication date: 15-Nov-2016
https://dl.acm.org/doi/10.1016/j.eswa.2016.06.008
Cambazoglu BBaeza-Yates R(2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
https://doi.org/10.2200/S00662ED1V01Y201508ICR045
Kulkarni ACallan J(2015)Selective SearchACM Transactions on Information Systems10.1145/273803533:4(1-33)Online publication date: 23-Apr-2015
https://dl.acm.org/doi/10.1145/2738035
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents