Abstract
Given two object sets \(P\) and \(Q\), a k-closest pair \((k\hbox {CP})\) query finds \(k\) closest object pairs from \(P\times Q\). This operation is common in many real-life applications such as GIS, data mining, and recommender systems. Although it has received much attention in the Euclidean space, there is little prior work on the metric space. In this paper, we study the problem of kCP query processing in general metric spaces, namely Metric kCP \((\hbox {M}k\hbox {CP})\) search, and propose several efficient algorithms using dynamic disk-based metric indexes (e.g., M-tree), which can be applied to arbitrary type of data as long as a certain metric distance is defined and satisfies the triangle inequality. Our approaches follow depth-first and/or best-first traversal paradigm(s), employ effective pruning rules based on metric space properties and the counting information preserved in the metric index, take advantage of aggressive pruning and compensation to further boost query efficiency, and derive a node-based cost model for \(\hbox {M}k\hbox {CP}\) retrieval. In addition, we extend our techniques to tackle two interesting variants of \(\hbox {M}k\hbox {CP}\) queries. Extensive experiments with both real and synthetic data sets demonstrate the performance of our proposed algorithms, the effectiveness of our developed pruning rules, and the accuracy of our presented cost model.
Similar content being viewed by others
Notes
Available at http://www.census.gov/geo/www/tiger/.
Available at http://www.dbs.informatik.uni-muenchen.de/~seidl.
Available at http://www.sisap.org/metric_space_library.html.
References
Achtert, E., Kriegel, H.P., Kroger, P., Renz, M., Zufle, A.: Reverse \(k\)-nearest neighbor search in dynamic and general metric databases. In: EDBT, pp. 886–897 (2009)
Alvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Using clustering and edit distance techniques for automatic web data extraction. In: WISE, pp. 212–224 (2007)
Angiulli, F., Pizzuti, C.: An approximate algorithm for top-\(k\) closest pairs join query in large high dimensional data. Data Knowl. Eng. 53(3), 263–281 (2005)
Arumugam, S., Jermaine, C.: Closest-point-of-approach join for moving object histories. In: ICDE, pp. 86–95 (2006)
Bohm, C.: A cost model for query processing in high dimensional data spaces. ACM Trans. Database Syst. 25(2), 129–178 (2000)
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: SIGMOD, pp. 93–104 (2000)
Bustos, B., Navarro, G., Chavez, E.: Pivot selection techniques for proximity searching in metric spaces. Pattern Recognit. Lett. 24(14), 2357–2366 (2003)
Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)
Cheema, M.A., Lin, X., Wang, H., Wang, J., Zhang, W.: A unified approach for computing top-\(k\) pairs in multi-dimensional space. In: ICDE, pp. 1031–1042 (2011)
Chen, C., Sun, W., Zheng, B., Mao, D., Liu, W.: An incremental approach to closest pair queries in spatial networks using best-first search. In: DEXA, pp. 136–143 (2011)
Chen, L., Lian, X.: Efficient processing of metric skyline queries. IEEE Trans. Knowl. Data Eng. 21(3), 351–365 (2009)
Ciaccia, P., Nanni, A., Patella, M.: A query-sensitive cost model for similarity queries with M-tree. In: ADC, pp. 65–76 (1999)
Ciaccia, P., Patella, M.: PAC nearest neighbor queries: approximate and controlled search in high dimensional and metric spaces. In: ICDE, pp. 244–255 (2000)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997)
Ciaccia, P., Patella, M., Zezula, P.: A cost model for similarity queries in metric spaces. In: PODS, pp. 59–68 (1998)
Corral, A., Almendros-Jimnez, J.: A performance comparison of distance-based query algorithms using R-trees in spatial databases. Inf. Sci. 177(11), 2207–2237 (2007)
Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Closest pair queries in spatial databases. In: SIGMOD, pp. 189–200 (2000)
Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Algorithms for processing \(k\)-closest-pair queries in spatial databases. Data Knowl. Eng. 49(1), 67–104 (2004)
Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Cost models for distance joins queries using R-trees. Data Knowl. Eng. 57(1), 1–36 (2006)
Corral, A., Vassilakopoulos, M.: On approximate algorithms for distance-based queries using R-trees. Comput. J. 48(2), 220–238 (2005)
Eppstein, D.: Fast hierarchical clustering and other applications of dynamic closest pairs. J. Exp. Algorithm. 5, article 1 (2000)
Fredriksson, K., Braithwaite, B.: Quicker similarity joins in metric spaces. In: SISAP, pp. 127–140 (2013)
Fuhry, D., Jin, R., D.Zhang: Efficient skyline computation in metric space. In: EDBT, pp. 1042–1051 (2009)
Gutierrez, G., Saez, P.: The \(k\) closest pairs in spatial databases. GeoInformatica 17(4), 543–565 (2013)
Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. In: SIGMOD, pp. 237–248 (1998)
Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4), 517–580 (2003)
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(2), article 7 (2008)
Kim, Y.J., Patel, J.M.: Performance comparison of the R*-tree and the quadtree for \(k\)nn and distance join queries. IEEE Trans. Knowl. Data Eng. 22(7), 1014–1027 (2010)
Kurasawa, H., Takasu, A., Adachi, J.: Finding the \(k\)-closest pairs in metric spaces. In: NTSS, pp. 8–13 (2011)
Nanopoulos, A., Theodoridis, Y., Manolopoulos, Y.: C2P: Clustering based on closest pairs. In: VLDB, pp. 331–340 (2001)
Papadopoulos, A.N., Nanopoulos, A., Manolopoulos, Y.: Processing distance join queries with constraints. Comput. J. 49(3), 281–296 (2006)
Paredes, R., Reyes, N.: Solving similarity joins and range queries in metric spaces with the list of twin clusters. J. Discrete Algorithms 7(1), 18–35 (2009)
Pearson, S.S., Silva, Y.N.: Index-based R-S similarity joins. In: SISAP, pp. 106–112 (2014)
Ristad, E.S., Yianilos, P.N.: Learning string-edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 20(5), 522–532 (1998)
Roumelis, G., Vassilakopoulos, M., Corral, A., Manolopoulos, Y.: A new plane-sweep algorithm for the \(k\)-closest-pairs query. In: SOFSEM, pp. 478–490 (2014)
Samet, H.: Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, San Francisco (2006)
Sarma, A.D., He, Y., Chaudhuri, S.: Clusterjoin: a similarity joins framework using map-reduce. PVLDB 7(12), 1059–1070 (2014)
Shan, J., Zhang, D., Salzberg, B.: On spatial-range closest-pair query. In: SSTD, pp. 252–269 (2003)
Shin, H., Moon, B., Lee, S.: Adaptive multi-stage distance join processing. In: SIGMOD, pp. 343–354 (2000)
Shin, H., Moon, B., Lee, S.: Adaptive and incremental processing for distance join queries. IEEE Trans. Knowl. Data Eng. 15(6), 1561–1578 (2003)
Silva, Y.N., Pearson, S.: Exploiting database similarity joins for metric spaces. In: VLDB, pp. 1922–1925 (2012)
Silva, Y.N., Pearson, S., Cheney, J.A.: Database similarity join for metric spaces. In: SISAP, pp. 266–279 (2013)
Skopal, T., Lokoc, J.: Answering metric skyline queries by PM-tree. In: DATESO, pp. 22–37 (2010)
Skopal, T., Pokorny, J., Snasel, V.: PM-tree: pivoting metric tree for similarity search in multimedia databases. In: ADBIS, pp. 803–815 (2004)
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), article 20 (2010)
Tao, Y., Yiu, M.L., Mamoulis, N.: Reverse nearest neighbor search in metric spaces. IEEE Trans. Knowl. Data Eng. 18(9), 1239–1252 (2006)
Tao, Y., Zhang, J., Papadias, D., Mamoulis, N.: An efficient cost model for optimization of nearest neighbor search in low and medium dimensional spaces. TKDE 16(10), 1169–1184 (2004)
U, L.H., Mamoulis, N., Yiu, M.L.: Computation and monitoring of exclusive closest pairs. IEEE Trans. Knowl. Data Eng. 20(12), 1641–1654 (2008)
Vlachou, A., Doulkeridis, C., Kotidis, Y.: Metric-based similarity search in unstructured peer-to-peer systems. Trans. Large Scale Data Knowl. Cent. Syst. 7100, 28–48 (2012)
Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: KDD, pp. 829–837 (2013)
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-\(k\) set similarity joins. In: ICDE, pp. 916–927 (2009)
Yang, C., Lin, K.I.: An index structure for improving closest pairs and related join queries in spatial databases. In: IDEAS, pp. 140–149 (2002)
Zezula, P., Savino, P., Amato, G., Rabitti, F.: Approximate similarity retrieval with M-trees. VLDB J. 7(4), 275–293 (1998)
Zhou, P., Zhang, D., Salzberg, B., Cooperman, G., Kollios, G.: Close pair queries in moving object databases. In: GIS, pp. 2–11 (2005)
Acknowledgments
Yunjun Gao was supported in part by the National Key Basic Research and Development Program (i.e., 973 Program) No. 2015CB352502, NSFC Grant No. 61379033, the Cyber Innovation Joint Research Center of Zhejiang University, and the Key Project of Zhejiang University Excellent Young Teacher Fund (Zijin Plan). We would like to thank Prof. A. Corral and Prof. T. Skopal for their useful feedback on the source codes of their proposed algorithms in [18, 44]. We also would like to express our gratitude to some anonymous reviewers for their giving valuable and helpful comments to improve the technical quality and presentation of this paper.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Gao, Y., Chen, L., Li, X. et al. Efficient \(k\)-closest pair queries in general metric spaces. The VLDB Journal 24, 415–439 (2015). https://doi.org/10.1007/s00778-015-0383-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-015-0383-4