Abstract
The use of multi-dimensional indexing structures has gained a lot of attention in data mining. The most commonly used data structures for indexing data are R-tree and its variants, quad-tree, k-d-tree, etc. These data structures support region queries (point, window and neighborhood queries) and nearest neighbor queries. These queries are extensively used in data mining algorithms. Although these data structures facilitate execution of the above queries in logarithmic time, the constraints associated with them become bottleneck in query execution, when used for large and high-dimensional datasets. Moreover, these indexing structures do not cater to specific data access patterns of data mining algorithms. In this paper, we propose a new data structure Grid-R-tree, a grid based R-tree which is specifically designed to address the querying requirements of multiple data mining algorithms. Grid-R-tree is a simple, yet effective adaptation of R-tree using the concept of Grid. We also introduce a new query over Grid-R-tree, called cell-wise epsilon neighborhood query (CellWiseNBH), which captures the locality in query execution pattern of density-based clustering algorithms, and enables us to redesign them for improving their efficiency. Our theoretical and experimental analysis shows that the proposed data structure outperforms the conventional R-tree in terms of neighborhood and nearest neighbor queries. The experiments were conducted on datasets of size up to 100 million and dimensionality up to 74. The results also suggest that Grid-R-tree improves the efficiency of data mining algorithms such as k-nearest neighbor classifier and DBSCAN clustering (including the redesigned version that uses CellWiseNBH). Additionally, an adaptive grid optimization has been applied on dense cells that have number of indexed data points greater than a threshold \(\tau \) to keep equal load distribution in the cells, which resulted in more efficient query performance for datasets that have skewed distribution of data points.





















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)
Goyal, P., Kumari, S., Sharma, S., et al.: Parallel SLINK for big data. Int J Data Sci Anal 9, 339–359 (2020)
Sharma, P.K., Holness, G.: Erratum to: L2-norm transformation for improving k-means clustering. Int. J. Data Sci. Anal. 4(3), 233–234 (2017)
Albarakati, N., Obradovic, Z.: Multi-domain and multi-view networks model for clustering hospital admissions from the emergency department. Int. J. Data Sci. Anal. 8(4), 385–403 (2019)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Anastasiu, D.C., Karypis, G.: Efficient identification of tanimoto nearest neighbors. Int. J. Data Sci. Anal. 4(3), 153–172 (2017)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, KDD’96, pp. 226–231 (1996)
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. SIGMOD Rec. 28(2), 49–60 (1999)
Goyal, P., Kumari, S., Kumar, D., Balasubramaniam, S., Goyal, N., Islam, S., Challa, J.S.: Parallelizing optics for commodity clusters. In: Proceedings of the 2015 International Conference on Distributed Computing and Networking, ACM, New York, NY, USA, ICDCN ’15, pp. 1–10 (2015)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. SIGMOD Rec. 14(2), 47–57 (1984)
Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A.N., Theodoridis, Y.: R-Trees: Theory and Applications. Springer, Berlin (2005)
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Finkel, R.A., Bentley, J.L.: Quad trees a data structure for retrieval on composite keys. Acta Inf. 4(1), 1–9 (1974)
Nievergelt, J., Hinterberger, H., Sevcik, K.C.: The grid file: an adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9(1), 38–71 (1984)
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 2nd edn. Prentice Hall Press, Upper Saddle River (2008)
Li, G., Tang, J.: A new r-tree spatial index based on space grid coordinate division. In: Proceedings of the 2011 International Conference on Informatics, Cybernetics, and Computer Engineering (ICCE2011), pp. 133–140. Springer, Berlin(2012)
Schikuta, E.: Grid-clustering: an efficient hierarchical clustering method for very large data sets. In: Proceedings of the 13th International Conference on Pattern Recognition, IEEE Computer Society, Washington, DC, USA, ICPR ’96, pp. 101–105 (1996)
Schikuta, E., Erhart, M.: The bang-clustering system: Grid-based data analysis. In: Advances in Intelligent Data Analysis Reasoning about Data, pp 513–524. Springer, Berlin (1997)
Wang, W., Yang, J., Muntz, R.R.: Sting: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, VLDB ’97, pp. 186–195 (1997)
Liao, W.K., Ying, L., Choudhary, A.: A grid-based clustering algorithm using adaptive mesh refinement. In: Proceedings of the 7th Workshop on Mining Scientific and Engineering Data Sets (2004)
Wang, W., Guan, J., Li, W., Zhang, L.: GR-tree: An efficient index structure for GML. In: Proceedings of the 2014 22nd International Conference on Geoinformatics, pp. 1–6 (2014)
Hjaltason, G.R., Samet, H.: Distance browsing in spatial databases. ACM Trans. Database Syst. (TODS) 24(2), 265–318 (1999)
Borah, B., Bhattacharyya, D.K.: An improved sampling-based DBSCAN for large spatial databases. In: Proceedings of 2004 International Conference on Intelligent Sensing and Information Processing, pp. 92–96 (2004)
Tsai, C.F., Liu, C.W.: Kidbscan: A new efficient data clustering algorithm. In: Proceedings of the 8th International Conference on Artificial Intelligence and Soft Computing, Springer-Verlag, Berlin, Heidelberg, ICAISC’06, pp. 702–711 (2006)
Tsai, C.F., Sung, C.Y.: Dbscale: An efficient density-based clustering algorithm for data mining in large databases. In: 2010 Second Pacific-Asia Conference on Circuits, Communications and System, pp. 98–101. IEEE (2010)
Faloutsos, C., Sellis, T., Roussopoulos, N.: Analysis of object oriented spatial access methods. In: Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’87, pp. 426–439 (1987)
Vampir trace library (2013). https://tu-dresden.de/zih/forschung/projekte/vampirtrace. Accessed 1 June 2018
Kaul, M., Yang, B., Jensen, C.S.: Building accurate 3d spatial networks to enable next generation intelligent transportation systems. In: 2013 IEEE 14th International Conference on Mobile Data Management, IEEE, vol. 1, pp. 137–146 (2013)
Springel, V., White, S.D.M., Jenkins, A., Frenk, C.S., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J.A., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., Pearce, F.: Simulations of the formation, evolution and clustering of galaxies and quasars. Nature 435, 629–636 (2005)
Suvn trace data (2012). http://wirelesslab.sjtu.edu.cn/ Accessed 17 Sept 2015
Kdd cup 2004 bio dataset (2004). http://cs.joensuu.fi/sipu/datasets/. Accessed 16 Oct 2015
Catlett, J.: Statlog (shuttle) data set (1993). https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle). Accessed 17 Sept 2015
Bhatt, R., Dhall, A.: Skin segmentation data set (2009). https://archive.ics.uci.edu/ml/datasets/Skin +Segmentation. Accessed 17 Sept 2015
Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of Spring Joint Computer Conference 1967, ACM, New York, NY, USA, AFIPS ’67 (Spring), pp. 483–485 (1967)
Goyal, P., Kumari, S., Sharma, S., Kishore, V., Goyal, N., Balasubramaniam, S.S.: Spatial locality aware, fast, and scalable slink algorithm for commodity clusters. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, pp. 158–159 (2016a)
Goyal, P., Kumari, S., Sharma, S., Kumar, D., Kishore, V., Balasubramaniam, S., Goyal, N.: A fast, scalable slink algorithm for commodity cluster computing exploiting spatial locality. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications, IEEE, pp. 268–275 (2016b)
Kumari, S., Goyal, P., Sood, A., Kumar, D., Balasubramaniam, S., Goyal, N.: Exact, fast and scalable parallel dbscan for commodity platforms. In: Proceedings of the 18th International Conference on Distributed Computing and Networking, ACM, New York, NY, USA, ICDCN ’17, pp. 14:1–14:10 (2017)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Goyal, P., Challa, J.S., Kumar, D. et al. Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining. Int J Data Sci Anal 10, 25–47 (2020). https://doi.org/10.1007/s41060-020-00208-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-020-00208-2