Skip to main content
Log in

Algebraic query optimization for distributed top-k queries

  • Themenheft Datenbanksysteme
  • Published:
Informatik - Forschung und Entwicklung

Abstract

Distributed top-k query processing is increasingly becoming an essential functionality in a large number of emerging application classes. This paper addresses the efficient algebraic optimization of top-k queries in wide-area distributed data repositories where the index lists for the attribute values (or text terms) of a query are distributed across a number of data peers and the computational costs include network latency, bandwidth consumption, and local peer work. We use a dynamic programming approach to find the optimal execution plan using compact data synopses for selectivity estimation that is the basis for our cost model. The optimized query is executed in a hierarchical way involving a small and fixed number of communication phases. We have performed experiments on real web data that show the benefits of distributed top-k query optimization both in network resource consumption and query response time.

Zusammenfassung

In dieser Arbeit beschäftigen wir uns mit der Optimierung verteilter top-k Anfragen, bei denen die Daten auf verschiedene Rechner verteilt sind. Die Kosten, die es zu minimieren gilt, umfassen die Netzwerklast, den Verbrauch lokaler Rechenleistung und letztendlich die Zeit der Anfrageausführung. Wir benutzen dynamische Programmierung, um den optimalen Anfrageplan zu finden. Die Kostenschätzung basiert dabei auf kompakten Repräsentationen der eigentlichen Score-Verteilungen. Die optimierte Anfrage wird anschließend in einer hierachischen Weise ausgeführt, bei der nur eine kleine und fest vorgegebene Anzahl von Kommunikationsschritten angewendet wird. Umfassende Experimente mit Daten aus der realen Welt zeigen beachtliche Gewinne sowohl in der Reduktion der Netzwerklast als auch in der Reduktion der Anfragezeit.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal S, Chaudhuri S, Das G, Gionis A (2003) Automated ranking of database query results. CIDR

  2. Allen AO (1990) Probability, statistics, and queueing theory with computer science applications. Academic Press Professional Inc., San Diego, CA, USA

    MATH  Google Scholar 

  3. Anh VN, de Kretser O, Moffat A (2001) Vector-space ranking with effective early termination. SIGIR

  4. Babcock B, Olston C (2003) Distributed top-k monitoring. SIGMOD

  5. Balke WT, Nejdl W, Siberski W, Thaden U (2005) Progressive distributed top k retrieval in peer-to-peer networks. ICDE

  6. Bawa M, Jr R, Rajagopalan S, Shekita E (2003) Make it fresh, make it quick – searching a network of personal webservers. WWW

  7. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is ‘‘nearest neighbor’’ meaningful? Lecture Notes in Computer Science 1540

  8. Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426

    Article  MATH  Google Scholar 

  9. Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations (extended abstract). STOC

  10. Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci 60(3):630–659

    Article  MATH  MathSciNet  Google Scholar 

  11. Bruno N, Chaudhuri S, Gravano L (2002) Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans Database Syst (TODS) 27(2):153–187

    Article  Google Scholar 

  12. Bruno N, Gravano L, Marian A (2002) Evaluating top-k queries over web-accessible databases. ICDE

  13. Byers JW, Considine J, Mitzenmacher M, Rost S (2004) Informed content delivery across adaptive overlay networks. IEEE/ACM Trans Netw 12(5):767–780

    Article  Google Scholar 

  14. Cao P, Wang Z (PODC 2004) Efficient top-k query calculation in distributed networks. PODC

  15. Carey MJ, Kossmann D (1998) Reducing the braking distance of an sql query engine. VLDB

  16. Chakrabarti S (2002) Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco

    Google Scholar 

  17. Chang KCC, won Hwang S (2002) Minimal probing: supporting expensive predicates for top-k queries. SIGMOD

  18. Chaudhuri S, Das G, Hristidis V, Weikum G (2004) Probabilistic ranking of database query results. VLDB

  19. Church K, Gale W (1995) Poisson mixtures. Nat Lang Eng 1(2):163–190

    Article  Google Scholar 

  20. Cormode G, Korn F, Muthukrishnan S, Srivastava D (2004) Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. SIGMOD

  21. Cranor CD, Johnson T, Spatscheck O, Shkapenyuk V (2003) Gigascope: A stream database for network applications. SIGMOD

  22. Croft WB, Lafferty J (2003) Language Modeling for Information Retrieval, Vol. 13. Kluwer International Series on Information Retrieval

  23. Deligiannakis A, Kotidis Y, Roussopoulos N (2004) Hierarchical in-network data aggregation with quality guarantees. EDBT

  24. Fagin R (1999) Combining fuzzy information from multiple systems. J Comput Syst Sci 58(1):83–99

    Article  MATH  MathSciNet  Google Scholar 

  25. Fagin R, Lotem A, Naor M (2003) Optimal aggregation algorithms for middleware. J Comput Syst Sci 66(4):614–656

    Article  MATH  MathSciNet  Google Scholar 

  26. Flajolet P, Martin GN (1985) Probabilistic counting algorithms for data base applications. J Comput Syst Sci 31(2):182–209

    Article  MATH  MathSciNet  Google Scholar 

  27. Graefe G (1995) The cascades framework for query optimization. IEEE Data Eng Bull 18(3):19–29

    Google Scholar 

  28. Graefe G, McKenna WJ (1993) The volcano optimizer generator: Extensibility and efficient search. ICDE

  29. Gravano L, Marian A (2004) Optimizing top-k selection queries over multimedia repositories. IEEE TKDE

  30. Guntzer U, Balke WT, Kiesling W (2000) Optimizing multi-feature queries for image databases. In: VLDB Journal, pp. 419–428

  31. Güntzer U, Balke WT, Kießling W (2001) Towards efficient multi-feature queries in heterogeneous environments. In: ITCC, pp 622–628

  32. Guo L, Shao F, Botev C, Shanmugasundaram J (2003) Xrank: Ranked keyword search over xml documents

  33. Haas LM, Freytag JC, Lohman GM, Pirahesh H (1989) Extensible query processing in starburst. In: SIGMOD, pp 377–388

  34. Harter S (1975) A probabilistic approach to automatic keyword indexing (part 1). J Am Soc Comput Sci 24(4):197–206

    Article  Google Scholar 

  35. Ilyas IF, Shah R, Aref WG, Vitter JS, Elmagarmid AK (2004) Rank-aware query optimization. In: SIGMOD Conference

  36. Kaushik R, Krishnamurthy R, Naughton JF, Ramakrishnan R (2004) On the integration of structure indexes and inverted lists. SIGMOD

  37. Kossmann D (2000) The state of the art in distributed query processing. ACM Comput Surv 32(4):422–469

    Article  Google Scholar 

  38. Kossmann D, Stocker K (2000) Iterative dynamic programming: a new class of query optimization algorithms. ACM Trans Database Syst 25(1):43–82

    Article  Google Scholar 

  39. Li C, Chang KCC, Ilyas IF, Song S (2005) Ranksql: Query algebra and optimization for relational top-k queries. SIGMOD

  40. Long X, Suel T (2003) Optimized query execution in large search engines with global page ordering. VLDB

  41. M Tamer Ozsu PV (1999) Principles of Distributed Database Systems. Prentice-Hall

  42. Marian A, Bruno N, Gravano L (2004) Evaluating top-k queries over web-accessible databases. ACM Trans Database Syst 29(2):319–362

    Article  Google Scholar 

  43. Michel S, Triantafillou P, Weikum G (2005) KLEE: A framework for distributed top-k query algorithms. VLDB

  44. Michel S, Bender M, Triantafillou P, Weikum G (2006) IQN Routing: Integrating quality and novelty for web search. EDBT

  45. Natsev A, Chang YC, Smith JR, Li CS, Vitter JS (2001) Supporting incremental join queries on ranked inputs. VLDB Journal

  46. Nepal S, Ramakrishna MV (1999) Query processing issues in image (multimedia) databases. In: ICDE, pp 22–29

  47. Neumann T, Michel S (2007) Algebraic Query Optimization for Distributed Top-k Queries. 12. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web (BTW)

  48. Persin M, Zobel J, Sacks-Davis R (1996) Filtered document retrieval with frequency-sorted indexes. J Am Soc Inf Sci 47(10):749–764

    Article  Google Scholar 

  49. Robertson SE, Walker S (1994) Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. SIGIR

  50. Salomoni D, Luitz S (2000) High performance throughput tuning/measurement. http://www.slac.stanford.edu/grp/scs/net/talk/High_perf_ppdg_jul2000.ppt

  51. Suel T, Mathur C, wen Wu J, Zhang J, Delis A, Kharrazi M, Long X, Shanmugasundaram K (2003) Odissea: A peer-to-peer architecture for scalable web search and information retrieval. WebDB

  52. Theobald M, Weikum G, Schenkel R (2004) Top-k query evaluation with probabilistic guarantees. VLDB

  53. Theobald M, Schenkel R, Weikum G (2005) An efficient and versatile query engine for topx search. VLDB

  54. Tirumala et al A (2003) iperf: Testing the limits of your network. http://dast.nlanr.net/projects/iperf/

  55. Yu C, Sharma P, Meng W, Qin Y (2001) Database selection for processing k nearest neighbors queries in distributed environments. In: JCDL ’01

  56. Yu CT, Philip G, Meng W (2003) Distributed top-n query processing with possibly uncooperative local systems. VLDB

  57. Yu H, Li HG, Wu P, Agrawal D, Abbadi AE (2005) Efficient processing of distributed top- queries. DEXA

  58. Zeinalipour-Yazti D, Vagena Z, Gunopulos D, Kalogeraki V, Tsotras V, Vlachos M, Koudas N, Srivastava D (2005) The threshold join algorithm for top-k queries in distributed sensor networks. DMSN

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Neumann.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Neumann, T., Michel, S. Algebraic query optimization for distributed top-k queries . Informatik Forsch. Entw. 21, 197–211 (2007). https://doi.org/10.1007/s00450-007-0024-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-007-0024-2

Keywords

Schlagworte

Navigation