Abstract
Distributed top-k query processing is increasingly becoming an essential functionality in a large number of emerging application classes. This paper addresses the efficient algebraic optimization of top-k queries in wide-area distributed data repositories where the index lists for the attribute values (or text terms) of a query are distributed across a number of data peers and the computational costs include network latency, bandwidth consumption, and local peer work. We use a dynamic programming approach to find the optimal execution plan using compact data synopses for selectivity estimation that is the basis for our cost model. The optimized query is executed in a hierarchical way involving a small and fixed number of communication phases. We have performed experiments on real web data that show the benefits of distributed top-k query optimization both in network resource consumption and query response time.
Zusammenfassung
In dieser Arbeit beschäftigen wir uns mit der Optimierung verteilter top-k Anfragen, bei denen die Daten auf verschiedene Rechner verteilt sind. Die Kosten, die es zu minimieren gilt, umfassen die Netzwerklast, den Verbrauch lokaler Rechenleistung und letztendlich die Zeit der Anfrageausführung. Wir benutzen dynamische Programmierung, um den optimalen Anfrageplan zu finden. Die Kostenschätzung basiert dabei auf kompakten Repräsentationen der eigentlichen Score-Verteilungen. Die optimierte Anfrage wird anschließend in einer hierachischen Weise ausgeführt, bei der nur eine kleine und fest vorgegebene Anzahl von Kommunikationsschritten angewendet wird. Umfassende Experimente mit Daten aus der realen Welt zeigen beachtliche Gewinne sowohl in der Reduktion der Netzwerklast als auch in der Reduktion der Anfragezeit.
Similar content being viewed by others
References
Agrawal S, Chaudhuri S, Das G, Gionis A (2003) Automated ranking of database query results. CIDR
Allen AO (1990) Probability, statistics, and queueing theory with computer science applications. Academic Press Professional Inc., San Diego, CA, USA
Anh VN, de Kretser O, Moffat A (2001) Vector-space ranking with effective early termination. SIGIR
Babcock B, Olston C (2003) Distributed top-k monitoring. SIGMOD
Balke WT, Nejdl W, Siberski W, Thaden U (2005) Progressive distributed top k retrieval in peer-to-peer networks. ICDE
Bawa M, Jr R, Rajagopalan S, Shekita E (2003) Make it fresh, make it quick – searching a network of personal webservers. WWW
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is ‘‘nearest neighbor’’ meaningful? Lecture Notes in Computer Science 1540
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations (extended abstract). STOC
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci 60(3):630–659
Bruno N, Chaudhuri S, Gravano L (2002) Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans Database Syst (TODS) 27(2):153–187
Bruno N, Gravano L, Marian A (2002) Evaluating top-k queries over web-accessible databases. ICDE
Byers JW, Considine J, Mitzenmacher M, Rost S (2004) Informed content delivery across adaptive overlay networks. IEEE/ACM Trans Netw 12(5):767–780
Cao P, Wang Z (PODC 2004) Efficient top-k query calculation in distributed networks. PODC
Carey MJ, Kossmann D (1998) Reducing the braking distance of an sql query engine. VLDB
Chakrabarti S (2002) Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco
Chang KCC, won Hwang S (2002) Minimal probing: supporting expensive predicates for top-k queries. SIGMOD
Chaudhuri S, Das G, Hristidis V, Weikum G (2004) Probabilistic ranking of database query results. VLDB
Church K, Gale W (1995) Poisson mixtures. Nat Lang Eng 1(2):163–190
Cormode G, Korn F, Muthukrishnan S, Srivastava D (2004) Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. SIGMOD
Cranor CD, Johnson T, Spatscheck O, Shkapenyuk V (2003) Gigascope: A stream database for network applications. SIGMOD
Croft WB, Lafferty J (2003) Language Modeling for Information Retrieval, Vol. 13. Kluwer International Series on Information Retrieval
Deligiannakis A, Kotidis Y, Roussopoulos N (2004) Hierarchical in-network data aggregation with quality guarantees. EDBT
Fagin R (1999) Combining fuzzy information from multiple systems. J Comput Syst Sci 58(1):83–99
Fagin R, Lotem A, Naor M (2003) Optimal aggregation algorithms for middleware. J Comput Syst Sci 66(4):614–656
Flajolet P, Martin GN (1985) Probabilistic counting algorithms for data base applications. J Comput Syst Sci 31(2):182–209
Graefe G (1995) The cascades framework for query optimization. IEEE Data Eng Bull 18(3):19–29
Graefe G, McKenna WJ (1993) The volcano optimizer generator: Extensibility and efficient search. ICDE
Gravano L, Marian A (2004) Optimizing top-k selection queries over multimedia repositories. IEEE TKDE
Guntzer U, Balke WT, Kiesling W (2000) Optimizing multi-feature queries for image databases. In: VLDB Journal, pp. 419–428
Güntzer U, Balke WT, Kießling W (2001) Towards efficient multi-feature queries in heterogeneous environments. In: ITCC, pp 622–628
Guo L, Shao F, Botev C, Shanmugasundaram J (2003) Xrank: Ranked keyword search over xml documents
Haas LM, Freytag JC, Lohman GM, Pirahesh H (1989) Extensible query processing in starburst. In: SIGMOD, pp 377–388
Harter S (1975) A probabilistic approach to automatic keyword indexing (part 1). J Am Soc Comput Sci 24(4):197–206
Ilyas IF, Shah R, Aref WG, Vitter JS, Elmagarmid AK (2004) Rank-aware query optimization. In: SIGMOD Conference
Kaushik R, Krishnamurthy R, Naughton JF, Ramakrishnan R (2004) On the integration of structure indexes and inverted lists. SIGMOD
Kossmann D (2000) The state of the art in distributed query processing. ACM Comput Surv 32(4):422–469
Kossmann D, Stocker K (2000) Iterative dynamic programming: a new class of query optimization algorithms. ACM Trans Database Syst 25(1):43–82
Li C, Chang KCC, Ilyas IF, Song S (2005) Ranksql: Query algebra and optimization for relational top-k queries. SIGMOD
Long X, Suel T (2003) Optimized query execution in large search engines with global page ordering. VLDB
M Tamer Ozsu PV (1999) Principles of Distributed Database Systems. Prentice-Hall
Marian A, Bruno N, Gravano L (2004) Evaluating top-k queries over web-accessible databases. ACM Trans Database Syst 29(2):319–362
Michel S, Triantafillou P, Weikum G (2005) KLEE: A framework for distributed top-k query algorithms. VLDB
Michel S, Bender M, Triantafillou P, Weikum G (2006) IQN Routing: Integrating quality and novelty for web search. EDBT
Natsev A, Chang YC, Smith JR, Li CS, Vitter JS (2001) Supporting incremental join queries on ranked inputs. VLDB Journal
Nepal S, Ramakrishna MV (1999) Query processing issues in image (multimedia) databases. In: ICDE, pp 22–29
Neumann T, Michel S (2007) Algebraic Query Optimization for Distributed Top-k Queries. 12. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web (BTW)
Persin M, Zobel J, Sacks-Davis R (1996) Filtered document retrieval with frequency-sorted indexes. J Am Soc Inf Sci 47(10):749–764
Robertson SE, Walker S (1994) Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. SIGIR
Salomoni D, Luitz S (2000) High performance throughput tuning/measurement. http://www.slac.stanford.edu/grp/scs/net/talk/High_perf_ppdg_jul2000.ppt
Suel T, Mathur C, wen Wu J, Zhang J, Delis A, Kharrazi M, Long X, Shanmugasundaram K (2003) Odissea: A peer-to-peer architecture for scalable web search and information retrieval. WebDB
Theobald M, Weikum G, Schenkel R (2004) Top-k query evaluation with probabilistic guarantees. VLDB
Theobald M, Schenkel R, Weikum G (2005) An efficient and versatile query engine for topx search. VLDB
Tirumala et al A (2003) iperf: Testing the limits of your network. http://dast.nlanr.net/projects/iperf/
Yu C, Sharma P, Meng W, Qin Y (2001) Database selection for processing k nearest neighbors queries in distributed environments. In: JCDL ’01
Yu CT, Philip G, Meng W (2003) Distributed top-n query processing with possibly uncooperative local systems. VLDB
Yu H, Li HG, Wu P, Agrawal D, Abbadi AE (2005) Efficient processing of distributed top- queries. DEXA
Zeinalipour-Yazti D, Vagena Z, Gunopulos D, Kalogeraki V, Tsotras V, Vlachos M, Koudas N, Srivastava D (2005) The threshold join algorithm for top-k queries in distributed sensor networks. DMSN
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Neumann, T., Michel, S. Algebraic query optimization for distributed top-k queries . Informatik Forsch. Entw. 21, 197–211 (2007). https://doi.org/10.1007/s00450-007-0024-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-007-0024-2