Algebraic query optimization for distributed top-k queries

Neumann, Thomas; Michel, Sebastian

doi:10.1007/s00450-007-0024-2

Algebraic query optimization for distributed top-k queries

Themenheft Datenbanksysteme
Published: 01 June 2007

Volume 21, pages 197–211, (2007)
Cite this article

Informatik - Forschung und Entwicklung

Thomas Neumann¹ &
Sebastian Michel¹

63 Accesses
2 Citations
Explore all metrics

Abstract

Distributed top-k query processing is increasingly becoming an essential functionality in a large number of emerging application classes. This paper addresses the efficient algebraic optimization of top-k queries in wide-area distributed data repositories where the index lists for the attribute values (or text terms) of a query are distributed across a number of data peers and the computational costs include network latency, bandwidth consumption, and local peer work. We use a dynamic programming approach to find the optimal execution plan using compact data synopses for selectivity estimation that is the basis for our cost model. The optimized query is executed in a hierarchical way involving a small and fixed number of communication phases. We have performed experiments on real web data that show the benefits of distributed top-k query optimization both in network resource consumption and query response time.

Zusammenfassung

In dieser Arbeit beschäftigen wir uns mit der Optimierung verteilter top-k Anfragen, bei denen die Daten auf verschiedene Rechner verteilt sind. Die Kosten, die es zu minimieren gilt, umfassen die Netzwerklast, den Verbrauch lokaler Rechenleistung und letztendlich die Zeit der Anfrageausführung. Wir benutzen dynamische Programmierung, um den optimalen Anfrageplan zu finden. Die Kostenschätzung basiert dabei auf kompakten Repräsentationen der eigentlichen Score-Verteilungen. Die optimierte Anfrage wird anschließend in einer hierachischen Weise ausgeführt, bei der nur eine kleine und fest vorgegebene Anzahl von Kommunikationsschritten angewendet wird. Umfassende Experimente mit Daten aus der realen Welt zeigen beachtliche Gewinne sowohl in der Reduktion der Netzwerklast als auch in der Reduktion der Anfragezeit.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal S, Chaudhuri S, Das G, Gionis A (2003) Automated ranking of database query results. CIDR
Allen AO (1990) Probability, statistics, and queueing theory with computer science applications. Academic Press Professional Inc., San Diego, CA, USA
MATH Google Scholar
Anh VN, de Kretser O, Moffat A (2001) Vector-space ranking with effective early termination. SIGIR
Babcock B, Olston C (2003) Distributed top-k monitoring. SIGMOD
Balke WT, Nejdl W, Siberski W, Thaden U (2005) Progressive distributed top k retrieval in peer-to-peer networks. ICDE
Bawa M, Jr R, Rajagopalan S, Shekita E (2003) Make it fresh, make it quick – searching a network of personal webservers. WWW
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is ‘‘nearest neighbor’’ meaningful? Lecture Notes in Computer Science 1540
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426
Article MATH Google Scholar
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations (extended abstract). STOC
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci 60(3):630–659
Article MATH MathSciNet Google Scholar
Bruno N, Chaudhuri S, Gravano L (2002) Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans Database Syst (TODS) 27(2):153–187
Article Google Scholar
Bruno N, Gravano L, Marian A (2002) Evaluating top-k queries over web-accessible databases. ICDE
Byers JW, Considine J, Mitzenmacher M, Rost S (2004) Informed content delivery across adaptive overlay networks. IEEE/ACM Trans Netw 12(5):767–780
Article Google Scholar
Cao P, Wang Z (PODC 2004) Efficient top-k query calculation in distributed networks. PODC
Carey MJ, Kossmann D (1998) Reducing the braking distance of an sql query engine. VLDB
Chakrabarti S (2002) Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco
Google Scholar
Chang KCC, won Hwang S (2002) Minimal probing: supporting expensive predicates for top-k queries. SIGMOD
Chaudhuri S, Das G, Hristidis V, Weikum G (2004) Probabilistic ranking of database query results. VLDB
Church K, Gale W (1995) Poisson mixtures. Nat Lang Eng 1(2):163–190
Article Google Scholar
Cormode G, Korn F, Muthukrishnan S, Srivastava D (2004) Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. SIGMOD
Cranor CD, Johnson T, Spatscheck O, Shkapenyuk V (2003) Gigascope: A stream database for network applications. SIGMOD
Croft WB, Lafferty J (2003) Language Modeling for Information Retrieval, Vol. 13. Kluwer International Series on Information Retrieval
Deligiannakis A, Kotidis Y, Roussopoulos N (2004) Hierarchical in-network data aggregation with quality guarantees. EDBT
Fagin R (1999) Combining fuzzy information from multiple systems. J Comput Syst Sci 58(1):83–99
Article MATH MathSciNet Google Scholar
Fagin R, Lotem A, Naor M (2003) Optimal aggregation algorithms for middleware. J Comput Syst Sci 66(4):614–656
Article MATH MathSciNet Google Scholar
Flajolet P, Martin GN (1985) Probabilistic counting algorithms for data base applications. J Comput Syst Sci 31(2):182–209
Article MATH MathSciNet Google Scholar
Graefe G (1995) The cascades framework for query optimization. IEEE Data Eng Bull 18(3):19–29
Google Scholar
Graefe G, McKenna WJ (1993) The volcano optimizer generator: Extensibility and efficient search. ICDE
Gravano L, Marian A (2004) Optimizing top-k selection queries over multimedia repositories. IEEE TKDE
Guntzer U, Balke WT, Kiesling W (2000) Optimizing multi-feature queries for image databases. In: VLDB Journal, pp. 419–428
Güntzer U, Balke WT, Kießling W (2001) Towards efficient multi-feature queries in heterogeneous environments. In: ITCC, pp 622–628
Guo L, Shao F, Botev C, Shanmugasundaram J (2003) Xrank: Ranked keyword search over xml documents
Haas LM, Freytag JC, Lohman GM, Pirahesh H (1989) Extensible query processing in starburst. In: SIGMOD, pp 377–388
Harter S (1975) A probabilistic approach to automatic keyword indexing (part 1). J Am Soc Comput Sci 24(4):197–206
Article Google Scholar
Ilyas IF, Shah R, Aref WG, Vitter JS, Elmagarmid AK (2004) Rank-aware query optimization. In: SIGMOD Conference
Kaushik R, Krishnamurthy R, Naughton JF, Ramakrishnan R (2004) On the integration of structure indexes and inverted lists. SIGMOD
Kossmann D (2000) The state of the art in distributed query processing. ACM Comput Surv 32(4):422–469
Article Google Scholar
Kossmann D, Stocker K (2000) Iterative dynamic programming: a new class of query optimization algorithms. ACM Trans Database Syst 25(1):43–82
Article Google Scholar
Li C, Chang KCC, Ilyas IF, Song S (2005) Ranksql: Query algebra and optimization for relational top-k queries. SIGMOD
Long X, Suel T (2003) Optimized query execution in large search engines with global page ordering. VLDB
M Tamer Ozsu PV (1999) Principles of Distributed Database Systems. Prentice-Hall
Marian A, Bruno N, Gravano L (2004) Evaluating top-k queries over web-accessible databases. ACM Trans Database Syst 29(2):319–362
Article Google Scholar
Michel S, Triantafillou P, Weikum G (2005) KLEE: A framework for distributed top-k query algorithms. VLDB
Michel S, Bender M, Triantafillou P, Weikum G (2006) IQN Routing: Integrating quality and novelty for web search. EDBT
Natsev A, Chang YC, Smith JR, Li CS, Vitter JS (2001) Supporting incremental join queries on ranked inputs. VLDB Journal
Nepal S, Ramakrishna MV (1999) Query processing issues in image (multimedia) databases. In: ICDE, pp 22–29
Neumann T, Michel S (2007) Algebraic Query Optimization for Distributed Top-k Queries. 12. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web (BTW)
Persin M, Zobel J, Sacks-Davis R (1996) Filtered document retrieval with frequency-sorted indexes. J Am Soc Inf Sci 47(10):749–764
Article Google Scholar
Robertson SE, Walker S (1994) Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. SIGIR
Salomoni D, Luitz S (2000) High performance throughput tuning/measurement. http://www.slac.stanford.edu/grp/scs/net/talk/High_perf_ppdg_jul2000.ppt
Suel T, Mathur C, wen Wu J, Zhang J, Delis A, Kharrazi M, Long X, Shanmugasundaram K (2003) Odissea: A peer-to-peer architecture for scalable web search and information retrieval. WebDB
Theobald M, Weikum G, Schenkel R (2004) Top-k query evaluation with probabilistic guarantees. VLDB
Theobald M, Schenkel R, Weikum G (2005) An efficient and versatile query engine for topx search. VLDB
Tirumala et al A (2003) iperf: Testing the limits of your network. http://dast.nlanr.net/projects/iperf/
Yu C, Sharma P, Meng W, Qin Y (2001) Database selection for processing k nearest neighbors queries in distributed environments. In: JCDL ’01
Yu CT, Philip G, Meng W (2003) Distributed top-n query processing with possibly uncooperative local systems. VLDB
Yu H, Li HG, Wu P, Agrawal D, Abbadi AE (2005) Efficient processing of distributed top- queries. DEXA
Zeinalipour-Yazti D, Vagena Z, Gunopulos D, Kalogeraki V, Tsotras V, Vlachos M, Koudas N, Srivastava D (2005) The threshold join algorithm for top-k queries in distributed sensor networks. DMSN

Download references

Author information

Authors and Affiliations

Max-Planck-Institut für Informatik, Stuhlsatzenhausweg 85, 66123, Saarbrücken, Germany
Thomas Neumann & Sebastian Michel

Authors

Thomas Neumann
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Michel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Neumann.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Neumann, T., Michel, S. Algebraic query optimization for distributed top-k queries . Informatik Forsch. Entw. 21, 197–211 (2007). https://doi.org/10.1007/s00450-007-0024-2

Download citation

Received: 10 October 2006
Accepted: 19 March 2007
Published: 01 June 2007
Issue Date: June 2007
DOI: https://doi.org/10.1007/s00450-007-0024-2

Keywords

Schlagworte

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algebraic query optimization for distributed top-k queries

Abstract

Zusammenfassung

Access this article

Similar content being viewed by others

Towards Adaptive Distributed Top-k Query Processing

Fast Distributed Top-q and Top-k Query Processing

A Processing of Top-k Aggregate Queries on Distributed Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Schlagworte

Navigation

Algebraic query optimization for distributed top-k queries

Abstract

Zusammenfassung

Access this article

Similar content being viewed by others

Towards Adaptive Distributed Top-k Query Processing

Fast Distributed Top-q and Top-k Query Processing

A Processing of Top-k Aggregate Queries on Distributed Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Schlagworte

Search

Navigation