Abstract
In many applications, top-k query is an important operation to return a set of interesting points in a potentially huge data space. The existing algorithms, either maintaining too many candidates, or requiring assistant structures built on the specific attribute subset, or returning results with probabilistic guarantee, cannot process top-k query on massive data efficiently. This paper proposes a sorted-list-based TKAP algorithm, which utilizes some data structures of low space overhead, to efficiently compute top-k results on massive data. In round-robin retrieval on sorted lists, TKAP performs adaptive pruning operation and maintains the required candidates until the stop condition is satisfied. The adaptive pruning operation can be adjusted by the information obtained in round-robin retrieval to achieve a better pruning effect. The adaptive pruning rule is developed in this paper, along with its theoretical analysis. The extensive experimental results, conducted on synthetic and real-life data sets, show the significant advantage of TKAP over the existing algorithms.
Similar content being viewed by others
Notes
For attributes in T, we only consider \(A_1, \ldots , A_m\) in Sect. 4.
References
Akbarinia R, Pacitti E, Valduriez P (2007) Best position algorithms for top-k queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp 495–506
Chang YC, Bergman L, Castelli V et al (2000) The onion technique: indexing for linear optimization queries. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp 391–402
Das G, Gunopulos D, Koudas N, Tsirogiannis D (2006) Answering top-k queries using views. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp 451–462
Fagin R, Kumar R, Sivakumar D (2003a) Efficient similarity search and classification via rank aggregation. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp 301–312
Fagin R, Lotem A, and Naor M (2001) Optimal aggregation algorithms for middleware. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp 102–113
Fagin R, Lotem A, Naor M (2003b) Optimal aggregation algorithms for middleware. J Comput Syst Sci 66(4):614–656
Fan H, Zaïane O, Foss A, Wu J (2009) Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data. Knowl Inf Syst 19(1):31–51
Ge S, Hou LU, Mamoulis N, Cheung DW (2013) Efficient all top-k computation—a unified solution for all top-k, reverse top-k and top-m influential queries. IEEE Trans Knowl Data Eng 25(5):1015–1027
Güntzer U, Balke WT, Kießling W (2000) Optimizing multi-feature queries for image databases. In: Proceedings of the 26th International Conference on Very Large Data Bases, pp 419–428
Güntzer U, Balke WT, Kießling W (2001) Towards efficient multi-feature queries in heterogeneous environments. In: Proceedings of the International Conference on Information Technology: Coding and Computing, pp 622–628
Han X, Li J, Yang D (2011) Supporting early pruning in top-k query processing on massive data. Inf Process Lett 111(11):524–532
Han X, Li J, Yang D (2012) Pi-join: efficiently processing join queries on massive data. Knowl Inf Syst 32(3):527–557
Heo JS, Cho J, Whang KY (2013) Subspace top-k query processing using the hybrid-layer index with a tight bound. Data Knowl Eng 83:1–19
Hristidis V, Papakonstantinou Y (2004) Algorithms and applications for answering ranked queries using ranked views. VLDB J 13(1):49–70
Ilyas I, Beskales G, Soliman M (2008) A survey of top-k query processing techniques in relational database systems. ACM Comput Surv 40(4):11:1–11:58
Lee J, Cho H, Hwang SW (2012) Efficient dual-resolution layer indexing for top-k queries. In: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, pp 1084–1095
Mamoulis N, Yiu ML, Cheng KH, Cheung DW (2007) Efficient top-k aggregation of ranked inputs. ACM Trans Database Syst 32(3):19
Pang H, Ding X, Zheng B (2010) Efficient processing of exact top-k queries over disk-resident sorted lists. VLDB J 19(3):437–456
Salam A, Khayal M (2012) Mining top-k frequent patterns without minimum support threshold. Knowl Inf Syst 30(1):57–86
Xie M, Lakshmanan L, Wood P (2013) Efficient top-k query answering using cached views. In: Proceedings of the 16th International Conference on Extending Database Technology, pp 489–500
Xin D, Chen C, Han J (2006) Towards robust indexing for ranked queries. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp 235–246
Yang B, Huang H (2010) Topsil-miner: an efficient algorithm for mining top-k significant itemsets over data streams. Knowl Inf Syst 23(2):225–242
Zou L, Chen L (2011) Pareto-based dominant graph: an efficient indexing structure to answer top-k queries. IEEE Trans Knowl Data Eng 23(5):727–741
Acknowledgments
This work was supported in part by the National Basic Research (973) Program of China under Grant No. 2012CB316200, the National Natural Science Foundation of China under Grant Nos. 61402130, 61272046, 61190115, 61173022, 61033015, Shandong Provincial Natural Science Foundation under Grant No. ZR2013FQ028, Natural Scientific Research Innovation Foundation in Harbin Institute of Technology under Grant Nos. HIT.NSRIF.2014136 and HIT(WH)201308.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Han, X., Liu, X., Li, J. et al. TKAP: Efficiently processing top-k query on massive data by adaptive pruning. Knowl Inf Syst 47, 301–328 (2016). https://doi.org/10.1007/s10115-015-0836-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0836-5