research-article

A coordinate-oblivious index for high-dimensional distance similarity searches on the GPU

Authors:
Brian Donnelly

Northern Arizona University

Northern Arizona University
View Profile

,
Michael Gowanlock

Northern Arizona University

Northern Arizona University
View Profile

ICS '20: Proceedings of the 34th ACM International Conference on SupercomputingJune 2020Article No.: 8Pages 1–12https://doi.org/10.1145/3392717.3392768

Published:29 June 2020Publication History

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

Pages 1–12

ABSTRACT

We present COSS, an exact method for high-dimensional distance similarity self-joins using the GPU, which finds all points within a search distance e from each point in a dataset. The similarity self-join can take advantage of the massive parallelism afforded by GPUs, as each point can be searched in parallel. Despite high GPU throughput, distance similarity self-joins exhibit irregular memory access patterns which yield branch divergence and other performance limiting factors. Consequently, we propose several GPU optimizations to improve self-join query throughput, including an index designed for GPU architecture. As data dimensionality increases, the search space increases exponentially. Therefore, to find a reasonable number of neighbors for each point in the dataset, e may need to be large. The majority of indexing strategies that are used to prune the ∈-search focus on a spatial partition of data points based on each point's coordinates. As dimensionality increases, this data partitioning and pruning strategy yields exhaustive searches that eventually degrade to a brute force (quadratic) search, which is the well-known curse of dimensionality problem. To enable pruning the search using an indexing scheme in high-dimensional spaces, we depart from previous indexing approaches, and propose an indexing strategy that does not index based on each point's coordinate values. Instead, we index based on the distances to reference points, which are arbitrary points in the coordinate space. We show that our indexing scheme is able to prune the search for nearby points in high-dimensional spaces where other approaches yield high performance degradation. COSS achieves a speedup over CPU and GPU reference implementations up to 17.7X and 11.8X, respectively.

References

Daichi Amagata, Takahiro Hara, and Chuan Xiao. 2019. Dynamic Set kNN Self-Join. In IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 818--829.Google ScholarCross Ref
Pierre Baldi, Peter Sadowski, and Daniel Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature communications 5 (2014), 4308.Google Scholar
Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1990. The R^*-tree: an efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD international conference on Management of data. 322--331.Google ScholarDigital Library
Richard E Bellman. 1961. Adaptive control processes: a guided tour. Princeton university press.Google Scholar
S Berchtold, DA Keim, and HP Kriegel. 2001. The X-Tree: An index structure for high-dimensional data. Readings in multimedia computing and networking 451 (2001), 28--39.Google Scholar
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. 2011. The Million Song Dataset. In Proceedings of the 12th International Conference on Music Information Retrieval.Google Scholar
Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel. 2001. Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data. ACM SIGMOD Record 30, 2 (2001), 379--388.Google ScholarDigital Library
Přemysl Čech, Jakub Lokoč, and Yasin N Silva. 2020. Pivot-based approximate k-NN similarity joins for big high-dimensional data. Information Systems 87 (2020), 101410.Google ScholarDigital Library
Kaushik Chakrabarti and Sharad Mehrotra. 1999. The hybrid tree: An index structure for high dimensional feature spaces. In Proceedings of the 15th International Conference on Data Engineering. IEEE, 440--447.Google ScholarCross Ref
Yilin Feng, Jie Tang, Meilin Liu, Chongjun Wang, and Junyuan Xie. 2018. Fast Document Cosine Similarity Self-Join on GPUs. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence. IEEE, 205--212.Google Scholar
Michael Gowanlock and Ben Karsin. 2019. Accelerating the similarity self-join using the GPU. Journal of parallel and distributed computing 133 (2019), 107--123.Google ScholarCross Ref
Michael Gowanlock and Ben Karsin. 2019. GPU-Accelerated Similarity Self-Join for Multi-Dimensional Data. In Proceedings of the 15th International Workshop on Data Management on New Hardware. 1--9.Google ScholarDigital Library
Michael Greenspan and Mike Yurick. 2003. Approximate kd tree search for efficient ICP. In Proceedings of the Fourth International Conference on 3-D Digital Imaging and Modeling. IEEE, 442--448.Google Scholar
Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD international conference on Management of data. 47--57.Google ScholarDigital Library
Joseph M Hellerstein and Avi Pfeffer. 1994. The RD-tree: An index structure for sets. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.Google Scholar
Yun-Wu Huang, Ning Jing, Elke A Rundensteiner, et al. 1997. Spatial joins using R-trees: Breadth-first traversal with global optimizations. In VLDB, Vol. 97. Citeseer, 25--29.Google Scholar
Edwin H Jacox and Hanan Samet. 2007. Spatial join techniques. ACM Transactions on Database Systems (TODS) 32, 1, Article 7 (2007).Google ScholarDigital Library
Hosagrahar V Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang. 2005. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems (TODS) 30, 2 (2005), 364--397.Google ScholarDigital Library
Dmitri V Kalashnikov. 2013. Super-EGO: fast multi-dimensional similarity join. The VLDB Journal 22, 4 (2013), 561--585.Google ScholarDigital Library
Jinwoong Kim, Sul-Gi Kim, and BeomseokNam. 2013. Parallel multi-dimensional range query processing with R-trees on GPU. J. Parallel and Distrib. Comput. 73, 8 (2013), 1195--1207.Google ScholarDigital Library
Michael D Lieberman, Jagan Sankaranarayanan, and Hanan Samet. 2008. A fast similarity join algorithm using graphics processing units. In Proceedings of the 24th IEEE International Conference on Data Engineering. IEEE, 1111--1120.Google ScholarDigital Library
Youzhong Ma, Ruiling Zhang, Shijie Jia, Yongxin Zhang, and Xiaofeng Meng. 2019. An efficient similarity join approach on large-scale high-dimensional data using random projection. Concurrency and Computation: Practice and Experience 31, 20 (2019), e5303.Google ScholarCross Ref
Marius Muja and David G Lowe. 2009. Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1) 2, 331--340 (2009), 2.Google Scholar
Sameer A Nene and Shree K Nayar. 1997. A simple algorithm for nearest neighbor search in high dimensions. IEEE Transactions on pattern analysis and machine intelligence 19, 9 (1997), 989--1003.Google ScholarDigital Library
NVIDIA. 2017. P100 The Most Advanced Data Center Accelerator Ever Built. Featuring Pascal GP100, the World's Fastest GPU. Retrieved January 31, 2020 from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-p100/pdf/nvidia-tesla-p100-datasheet.pdfGoogle Scholar
NVIDIA. 2018. Pascal Tuning Guide. Retrieved January 31, 2020 from http://docs.nvidia.com/cuda/pascal-tuning-guide/index.htmlGoogle Scholar
Martin Perdacher, Claudia Plant, and Christian Böhm. 2019. Cache-oblivious high-performance similarity join. In Proceedings of the International Conference on Management of Data. 87--104.Google ScholarDigital Library
Sushil K Prasad, Michael McDermott, Xi He, and Satish Puri. 2015. GPU-based Parallel R-tree Construction and Querying. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, 618--627.Google Scholar
DA Rachkovskij. 2019. Fast Similarity Search for Graphs by Edit Distance. Cybernetics and Systems Analysis 55, 6 (2019), 1039--1051.Google ScholarCross Ref
Chuitian Rong, Xiaohai Cheng, Ziliang Chen, and Na Huo. 2019. Similarity joins for high-dimensional data using Spark. Concurrency and Computation: Practice and Experience 31, 20 (2019), e5339.Google ScholarCross Ref
Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. 2017. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS) 42, 3 (2017), 1--21.Google ScholarDigital Library
David A White and Ramesh Jain. 1996. Similarity indexing with the SS-tree. In Proceedings of the Twelfth International Conference on Data Engineering. IEEE, 516--523.Google ScholarCross Ref
Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and intelligent laboratory systems 2, 1-3 (1987), 37--52.Google Scholar
Ren Wu, Bin Zhang, and Meichun Hsu. 2009. Clustering billions of data points using GPUs. In Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop. 1--6.Google ScholarDigital Library
Simin You, Jianting Zhang, and Le Gruenwald. 2013. Parallel spatial query processing on GPUs using R-trees. In Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data. 23--31.Google ScholarDigital Library
Kun Zhou, Qiming Hou, Rui Wang, and Baining Guo. 2008. Real-time kd-tree construction on graphics hardware. ACM Transactions on Graphics (TOG) 27, 5 (2008), 1--11.Google ScholarDigital Library

Index Terms

A coordinate-oblivious index for high-dimensional distance similarity searches on the GPU
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space

Similarity search (e.g., k-nearest neighbor search) in high-dimensional metric space is the key operation in many applications, such as multimedia databases, image retrieval and object recognition, among others. The high dimensionality and the huge size ...
Read More
Combining CPU and GPU architectures for fast similarity search

The Signature Quadratic Form Distance on feature signatures represents a flexible distance-based similarity model for effective content-based multimedia retrieval. Although metric indexing approaches are able to speed up query processing by two orders ...
Read More
Distance Threshold Similarity Searches: Efficient Trajectory Indexing on the GPU

Applications in many domains perform searches over datasets that contain moving object trajectories. A common class of searches are similarity searches that attempt to identify trajectories with similar characteristics. In this work, we focus on the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing
June 2020
499 pages
ISBN:9781450379830
DOI:10.1145/3392717
General Chairs:
Eduard Ayguadé
Universitat Politècnica de Catalunya and Barcelona Supercomputing Center
,
Wen-mei Hwu
University of Illinois at Urbana-Champaign
,
Program Chairs:
Rosa M. Badia
Barcelona Supercomputing Center and Universitat Politècnica de Catalunya
,
H. Peter Hofstee
IBM Austin
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 June 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
high dimensional
in-memory database
multidimensional index
similarity search
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 134
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A coordinate-oblivious index for high-dimensional distance similarity searches on the GPU

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space

Combining CPU and GPU architectures for fast similarity search

Distance Threshold Similarity Searches: Efficient Trajectory Indexing on the GPU