Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining

Goyal, Poonam; Challa, Jagat Sesh; Kumar, Dhruv; Bhat, Anuvind; Balasubramaniam, Sundar; Goyal, Navneet

doi:10.1007/s41060-020-00208-2

Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining

Regular Paper
Published: 03 April 2020

Volume 10, pages 25–47, (2020)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

804 Accesses
Explore all metrics

Abstract

The use of multi-dimensional indexing structures has gained a lot of attention in data mining. The most commonly used data structures for indexing data are R-tree and its variants, quad-tree, k-d-tree, etc. These data structures support region queries (point, window and neighborhood queries) and nearest neighbor queries. These queries are extensively used in data mining algorithms. Although these data structures facilitate execution of the above queries in logarithmic time, the constraints associated with them become bottleneck in query execution, when used for large and high-dimensional datasets. Moreover, these indexing structures do not cater to specific data access patterns of data mining algorithms. In this paper, we propose a new data structure Grid-R-tree, a grid based R-tree which is specifically designed to address the querying requirements of multiple data mining algorithms. Grid-R-tree is a simple, yet effective adaptation of R-tree using the concept of Grid. We also introduce a new query over Grid-R-tree, called cell-wise epsilon neighborhood query (CellWiseNBH), which captures the locality in query execution pattern of density-based clustering algorithms, and enables us to redesign them for improving their efficiency. Our theoretical and experimental analysis shows that the proposed data structure outperforms the conventional R-tree in terms of neighborhood and nearest neighbor queries. The experiments were conducted on datasets of size up to 100 million and dimensionality up to 74. The results also suggest that Grid-R-tree improves the efficiency of data mining algorithms such as k-nearest neighbor classifier and DBSCAN clustering (including the redesigned version that uses CellWiseNBH). Additionally, an adaptive grid optimization has been applied on dense cells that have number of indexed data points greater than a threshold $\tau $ to keep equal load distribution in the cells, which resulted in more efficient query performance for datasets that have skewed distribution of data points.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

Article 03 July 2021

M-Grid: a distributed framework for multidimensional indexing and querying of location based data

Article 13 March 2017

Spatial data management in apache spark: the GeoSpark perspective and beyond

Article 22 October 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)
Google Scholar
Goyal, P., Kumari, S., Sharma, S., et al.: Parallel SLINK for big data. Int J Data Sci Anal 9, 339–359 (2020)
Google Scholar
Sharma, P.K., Holness, G.: Erratum to: L2-norm transformation for improving k-means clustering. Int. J. Data Sci. Anal. 4(3), 233–234 (2017)
Google Scholar
Albarakati, N., Obradovic, Z.: Multi-domain and multi-view networks model for clustering hospital admissions from the emergency department. Int. J. Data Sci. Anal. 8(4), 385–403 (2019)
Google Scholar
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
MATH Google Scholar
Anastasiu, D.C., Karypis, G.: Efficient identification of tanimoto nearest neighbors. Int. J. Data Sci. Anal. 4(3), 153–172 (2017)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, KDD’96, pp. 226–231 (1996)
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. SIGMOD Rec. 28(2), 49–60 (1999)
Google Scholar
Goyal, P., Kumari, S., Kumar, D., Balasubramaniam, S., Goyal, N., Islam, S., Challa, J.S.: Parallelizing optics for commodity clusters. In: Proceedings of the 2015 International Conference on Distributed Computing and Networking, ACM, New York, NY, USA, ICDCN ’15, pp. 1–10 (2015)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. SIGMOD Rec. 14(2), 47–57 (1984)
Google Scholar
Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A.N., Theodoridis, Y.: R-Trees: Theory and Applications. Springer, Berlin (2005)
MATH Google Scholar
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
MathSciNet MATH Google Scholar
Finkel, R.A., Bentley, J.L.: Quad trees a data structure for retrieval on composite keys. Acta Inf. 4(1), 1–9 (1974)
MATH Google Scholar
Nievergelt, J., Hinterberger, H., Sevcik, K.C.: The grid file: an adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9(1), 38–71 (1984)
Google Scholar
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 2nd edn. Prentice Hall Press, Upper Saddle River (2008)
Google Scholar
Li, G., Tang, J.: A new r-tree spatial index based on space grid coordinate division. In: Proceedings of the 2011 International Conference on Informatics, Cybernetics, and Computer Engineering (ICCE2011), pp. 133–140. Springer, Berlin(2012)
Google Scholar
Schikuta, E.: Grid-clustering: an efficient hierarchical clustering method for very large data sets. In: Proceedings of the 13th International Conference on Pattern Recognition, IEEE Computer Society, Washington, DC, USA, ICPR ’96, pp. 101–105 (1996)
Schikuta, E., Erhart, M.: The bang-clustering system: Grid-based data analysis. In: Advances in Intelligent Data Analysis Reasoning about Data, pp 513–524. Springer, Berlin (1997)
Google Scholar
Wang, W., Yang, J., Muntz, R.R.: Sting: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, VLDB ’97, pp. 186–195 (1997)
Liao, W.K., Ying, L., Choudhary, A.: A grid-based clustering algorithm using adaptive mesh refinement. In: Proceedings of the 7th Workshop on Mining Scientific and Engineering Data Sets (2004)
Wang, W., Guan, J., Li, W., Zhang, L.: GR-tree: An efficient index structure for GML. In: Proceedings of the 2014 22nd International Conference on Geoinformatics, pp. 1–6 (2014)
Hjaltason, G.R., Samet, H.: Distance browsing in spatial databases. ACM Trans. Database Syst. (TODS) 24(2), 265–318 (1999)
Google Scholar
Borah, B., Bhattacharyya, D.K.: An improved sampling-based DBSCAN for large spatial databases. In: Proceedings of 2004 International Conference on Intelligent Sensing and Information Processing, pp. 92–96 (2004)
Tsai, C.F., Liu, C.W.: Kidbscan: A new efficient data clustering algorithm. In: Proceedings of the 8th International Conference on Artificial Intelligence and Soft Computing, Springer-Verlag, Berlin, Heidelberg, ICAISC’06, pp. 702–711 (2006)
Google Scholar
Tsai, C.F., Sung, C.Y.: Dbscale: An efficient density-based clustering algorithm for data mining in large databases. In: 2010 Second Pacific-Asia Conference on Circuits, Communications and System, pp. 98–101. IEEE (2010)
Faloutsos, C., Sellis, T., Roussopoulos, N.: Analysis of object oriented spatial access methods. In: Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’87, pp. 426–439 (1987)
Google Scholar
Vampir trace library (2013). https://tu-dresden.de/zih/forschung/projekte/vampirtrace. Accessed 1 June 2018
Kaul, M., Yang, B., Jensen, C.S.: Building accurate 3d spatial networks to enable next generation intelligent transportation systems. In: 2013 IEEE 14th International Conference on Mobile Data Management, IEEE, vol. 1, pp. 137–146 (2013)
Springel, V., White, S.D.M., Jenkins, A., Frenk, C.S., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J.A., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., Pearce, F.: Simulations of the formation, evolution and clustering of galaxies and quasars. Nature 435, 629–636 (2005)
Google Scholar
Suvn trace data (2012). http://wirelesslab.sjtu.edu.cn/ Accessed 17 Sept 2015
Kdd cup 2004 bio dataset (2004). http://cs.joensuu.fi/sipu/datasets/. Accessed 16 Oct 2015
Catlett, J.: Statlog (shuttle) data set (1993). https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle). Accessed 17 Sept 2015
Bhatt, R., Dhall, A.: Skin segmentation data set (2009). https://archive.ics.uci.edu/ml/datasets/Skin +Segmentation. Accessed 17 Sept 2015
Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of Spring Joint Computer Conference 1967, ACM, New York, NY, USA, AFIPS ’67 (Spring), pp. 483–485 (1967)
Goyal, P., Kumari, S., Sharma, S., Kishore, V., Goyal, N., Balasubramaniam, S.S.: Spatial locality aware, fast, and scalable slink algorithm for commodity clusters. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, pp. 158–159 (2016a)
Goyal, P., Kumari, S., Sharma, S., Kumar, D., Kishore, V., Balasubramaniam, S., Goyal, N.: A fast, scalable slink algorithm for commodity cluster computing exploiting spatial locality. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications, IEEE, pp. 268–275 (2016b)
Kumari, S., Goyal, P., Sood, A., Kumar, D., Balasubramaniam, S., Goyal, N.: Exact, fast and scalable parallel dbscan for commodity platforms. In: Proceedings of the 18th International Conference on Distributed Computing and Networking, ACM, New York, NY, USA, ICDCN ’17, pp. 14:1–14:10 (2017)

Download references

Author information

Authors and Affiliations

ADAPT Lab, Department of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani Campus, Pilani, India
Poonam Goyal, Jagat Sesh Challa, Dhruv Kumar, Anuvind Bhat, Sundar Balasubramaniam & Navneet Goyal

Authors

Poonam Goyal
View author publications
You can also search for this author inPubMed Google Scholar
Jagat Sesh Challa
View author publications
You can also search for this author inPubMed Google Scholar
Dhruv Kumar
View author publications
You can also search for this author inPubMed Google Scholar
Anuvind Bhat
View author publications
You can also search for this author inPubMed Google Scholar
Sundar Balasubramaniam
View author publications
You can also search for this author inPubMed Google Scholar
Navneet Goyal
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Poonam Goyal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goyal, P., Challa, J.S., Kumar, D. et al. Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining. Int J Data Sci Anal 10, 25–47 (2020). https://doi.org/10.1007/s41060-020-00208-2

Download citation

Received: 24 July 2018
Accepted: 08 March 2020
Published: 03 April 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s41060-020-00208-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

M-Grid: a distributed framework for multidimensional indexing and querying of location based data

Spatial data management in apache spark: the GeoSpark perspective and beyond

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now