skip to main content
research-article

Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load

Published: 10 June 2010 Publication History

Abstract

This article introduces an architecture for a document-partitioned search engine, based on a novel approach combining collection selection and load balancing, called load-driven routing. By exploiting the query-vector document model, and the incremental caching technique, our architecture can compute very high quality results for any query, with only a fraction of the computational load used in a typical document-partitioned architecture. By trading off a small fraction of the results, our technique allows us to strongly reduce the computing pressure to a search engine back-end; we are able to retrieve more than 2/3 of the top-5 results for a given query with only 10% the computing load needed by a configuration where the query is processed by each index partition. Alternatively, we can slightly increase the load up to 25% to improve precision and get more than 80% of the top-5 results. In fact, the flexibility of our system allows a wide range of different configurations, so as to easily respond to different needs in result quality or restrictions in computing power. More important, the system configuration can be adjusted dynamically in order to fit unexpected query peaks or unpredictable failures. This article wraps up some recent works by the authors, showing the results obtained by tests conducted on 6 million documents, 2,800,000 queries and real query cost timing as measured on an actual index.

References

[1]
Badue, C. S., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., and Ziviani, N. 2007. Analyzing imbalance among homogeneous index servers in a Web search system. Inform. Process. Manage. 43, 3, 592--608.
[2]
Baeza-Yates, R., Castillo, C., Junqueira, F., Plachouras, V., and Silvestri, F. 2007a. Challenges in distributed information retrieval (invited paper). In Proceedings of International Conference on Data Engineering (ICDE). IEEE CS Press.
[3]
Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., and Silvestri, F. 2007b. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 183--190.
[4]
Baeza-Yates, R., Gionis, A., Junqueira, F. P., Murdock, V., Plachouras, V., and Silvestri, F. 2008. Design trade-offs for search engine caching. ACM Trans. Web 2, 4, 1--28.
[5]
Barroso, L., Dean, J., and Hölze, U. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 22, 2.
[6]
Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2004. Hourly analysis of a very large topically categorized Web query log. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 321--328.
[7]
Boldi, P. and Vigna, S. 2004. The webgraph framework I: compression techniques. In Proceedings of the 13th International Conference on World Wide Web (WWW). ACM Press, New York, NY, 595--602.
[8]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh International Conference on World Wide Web (WWW). Elsevier Science Publishers B. V., Amsterdam, The Netherlands, 107--117.
[9]
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. In Selected Papers from the Sixth International Conference on World Wide Web. Elsevier Science Publishers Ltd., Amsterdam, The Netherlands, 1157--1166.
[10]
Callan, J., Lu, Z., and Croft, W. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, 21--28.
[11]
Chierichetti, F., Panconesi, A., Raghavan, P., Sozio, M., Tiberi, A., and Upfal, E. 2007. Finding near neighbors through cluster pruning. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS). ACM, New York, NY, 103--112.
[12]
Chowdhury, A., Frieder, O., Grossman, D., and McCabe, M. C. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inform. Syst. 20, 2, 171--191.
[13]
Dean, J. and Ghemawat, S. 2004. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th Conference Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Berkeley, CA, 10--10.
[14]
Dhillon, I. S., Mallela, S., and Modha, D. S. 2003. Information-theoretic co-clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 89--98.
[15]
Fagni, T., Perego, R., Silvestri, F., and Orlando, S. 2006. Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data. ACM Trans. Inform. Syst. 24, 1, 51--78.
[16]
Frieder, O. and Siegelmann, H. T. 1991. On the allocation of documents in multiprocessor information retrieval systems. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, New York, NY, 230--239.
[17]
Google. 2007. Google begins move to universal search. http://www.google.com/intl/en/press/pressrel/universalsearch_20070516.html.
[18]
Gravano, L. and Garcia-Molina, H. 1995. Generalizing GlOSS to vector-space databases and broker hierarchies. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., San Francisco, CA, 78--89.
[19]
Gravano, L., Garcia-Molina, H., and Tomasic, A. 1994. Precision and recall of GlOSS estimators for database discovery. Techn. note number STAN-CS-TN-94-10, Stanford University.
[20]
Hoad, T. C. and Zobel, J. 2003. Methods for identifying versioned and plagiarized documents. J. Amer. Soc. Inform. Sci. Tech. 54, 3, 203--215.
[21]
Jain, A. and Dubes, R. 1988. Algorithms for Clustering Data. Prentice Hall.
[22]
Jansen, B. and Spink, A. 2006. How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Inform Proc. and Management 42, 248--263.
[23]
Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T. 1998. Real life information retrieval: a study of user queries on the Web. SIGIR Forum 32, 1, 5--17.
[24]
Karedla, R., Love, J. S., and Wherry, B. G. 1994. Caching strategies to improve disk system performance. Computer 27, 3, 38--46.
[25]
Larkey, L. S., Connell, M. E., and Callan, J. 2000. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM). ACM Press, New York, NY, 282--289.
[26]
Lempel, R. and Moran, S. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web (WWW). ACM, New York, NY, 19--28.
[27]
Liu, X. and Croft, W. B. 2004. Cluster-based retrieval using language models. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, New York, NY, 186--193.
[28]
Markatos, E. P. 2001. On caching search engine query results. Comput. Comm. 24, 2, 137--143.
[29]
Moffat, A. and Zobel, J. 1994. Information retrieval systems for large document collections. In Proceedings of the Text REtrieval Conference. 85--94.
[30]
Papadimitriou, S. and Sun, J. 2008. Disco: Distributed co-clustering with map-reduce. In Proceedings of the IEEE International Conference on Data Mining (ICDM).
[31]
Pew Internet and American Life Project. 2005. Search engine use shoots up in the past year and edges towards email as the primary internet application. http://www.pewinternet.org/pdfs/PIP_SearchData_1105.pdf.
[32]
Poblete, B. and Baeza-Yates, R. 2008. Query-sets: using implicit feedback and query patterns to organize Web documents. In Proceedings of the 17th International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 41--50.
[33]
Puppin, D. 2008. Collection selection… now, with more documents! In Proceedings of the 2nd International Conference on Scalable Information Systems (InfoScale). ICST, Brussels, Belgium.
[34]
Puppin, D. and Silvestri, F. 2006. The query-vector document model. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, 880--881.
[35]
Puppin, D., Silvestri, F., and Laforenza, D. 2006. Query-driven document partitioning and collection selection (invited paper). In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale). ACM, New York, NY, USA, 34.
[36]
Puppin, D., Silvestri, F., Perego, R., and Baeza-Yates, R. 2007. Load-balancing and caching for collection selection architectures. In Proceedings of the 2nd International Conference on Scalable Information Systems (InfoScale). ICST, Brussels, Belgium, 1--10.
[37]
Raghavan, V. V. and Sever, H. 1995. On the reuse of past optimal queries. In SIGIR '95: Proceedings of the 18th annual international ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 344--350.
[38]
Randall, K. H., Stata, R., Wiener, J. L., and Wickremesinghe, R. G. 2002. The link database: Fast access to graphs of the Web. In Proceedings of the Data Compression Conference (DCC). IEEE Computer Society, Los Alamitos, CA, 122.
[39]
Robertson, S. E. and Walker, S. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Springer-Verlag, Berlin, Germany, 232--241.
[40]
Silverstein, C., Henzinger, M., Marais, H., and Moricz, M. 1999. Analysis of a very large Web search engine query log. In ACM SIGIR Forum. 6--12.
[41]
Silvestri, F. 2007. Sorting out the document identifier assignment problem. In Proceedings of the European Conference on IR Research (ECIR). G. Amati, C. Carpineto, and G. Romano, Eds. Lecture Notes in Computer Science, vol. 4425. Springer, 101--112.
[42]
Tomasic, A., Gravano, L., Lue, C., Schwarz, P., and Haas, L. 1997. Data structures for efficient broker implementation. ACM Trans. Inform. Syst. 15, 3, 223--253.
[43]
Van Rijsbergen, C. 1979. Information Retrieval. Butterworths.
[44]
Webber, W., Moffat, A., Zobel, J., and Baeza-Yates, R. 2006. A pipelined architecture for distributed text query evaluation. Inform. Retrieval. 10, 3.
[45]
Xie, Y. and O'Hallaron, D. R. 2002. Locality in search engine queries and its implications for caching. In Proceedings of the 21st Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM).
[46]
Xu, J. and Callan, J. 1998. Effective retrieval with distributed collections. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR).
[47]
Xu, J. and Croft, W. B. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR). 254--261.
[48]
Yuwono, B. and Lee, D. L. 1997. Server ranking for distributed text retrieval systems on the Internet. In Proceedings of the 5th International Conference on Database Systems for Advanced Applications (DASFAA). World Scientific Press, 41--50.

Cited By

View all
  • (2020)Improving Load Balance via Resource Exchange in Large-Scale Search EnginesProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404402(1-11)Online publication date: 17-Aug-2020
  • (2020)Pre-indexing Pruning StrategiesString Processing and Information Retrieval10.1007/978-3-030-59212-7_13(177-193)Online publication date: 13-Oct-2020
  • (2017)A machine learning approach for result caching in web search enginesInformation Processing and Management: an International Journal10.1016/j.ipm.2017.02.00653:4(834-850)Online publication date: 1-Jul-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 28, Issue 2
May 2010
165 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1740592
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2010
Accepted: 01 March 2009
Revised: 01 December 2008
Received: 01 February 2008
Published in TOIS Volume 28, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed IR
  2. Web search engines
  3. collection selection
  4. incremental caching

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Improving Load Balance via Resource Exchange in Large-Scale Search EnginesProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404402(1-11)Online publication date: 17-Aug-2020
  • (2020)Pre-indexing Pruning StrategiesString Processing and Information Retrieval10.1007/978-3-030-59212-7_13(177-193)Online publication date: 13-Oct-2020
  • (2017)A machine learning approach for result caching in web search enginesInformation Processing and Management: an International Journal10.1016/j.ipm.2017.02.00653:4(834-850)Online publication date: 1-Jul-2017
  • (2017)Exploiting Social Annotations to Generate Resource Descriptions in a Distributed Environment: Cooperative Multi-Agent Simulation on Query-Based SamplingThe Review of Socionetwork Strategies10.1007/s12626-017-0001-611:1(83-93)Online publication date: 1-Jun-2017
  • (2017)On the Efficiency of Selective SearchAdvances in Information Retrieval10.1007/978-3-319-56608-5_69(705-712)Online publication date: 8-Apr-2017
  • (2016)Efficient dynamic pruning on largest scores first (LSF) retrievalFrontiers of Information Technology & Electronic Engineering10.1631/FITEE.150019017:1(1-14)Online publication date: 9-Jan-2016
  • (2016)Reducing hardware hit by queries in web search enginesInformation Processing and Management: an International Journal10.1016/j.ipm.2016.04.00852:6(1031-1052)Online publication date: 1-Nov-2016
  • (2016)Hashing-based clustering in high dimensional dataExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.06.00862:C(202-211)Online publication date: 15-Nov-2016
  • (2015)Scalability Challenges in Web Search EnginesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00662ED1V01Y201508ICR0457:6(1-138)Online publication date: 29-Dec-2015
  • (2015)Selective SearchACM Transactions on Information Systems10.1145/273803533:4(1-33)Online publication date: 23-Apr-2015
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media