research-article

Exact Set Similarity Joins for Large Datasets in the GPGPU paradigm

Authors:
Christos Bellas

Aristotle University of Thessaloniki, Greece

Aristotle University of Thessaloniki, Greece
View Profile

,
Anastasios Gounaris

Aristotle University of Thessaloniki, Greece

Aristotle University of Thessaloniki, Greece
View Profile

DaMoN'19: Proceedings of the 15th International Workshop on Data Management on New HardwareJuly 2019Article No.: 5Pages 1–10https://doi.org/10.1145/3329785.3329919

Published:01 July 2019Publication History

DaMoN'19: Proceedings of the 15th International Workshop on Data Management on New Hardware

Pages 1–10

ABSTRACT

We investigate the problem of exact set similarity joins using a co-process CPU-GPU scheme. We focus on large instances of the problem, i.e., using datasets of >1M entries, which may take hours to complete if not approached with care, due to the inherent quadratic complexity of the problem. We introduce a novel CPU-GPU co-process scheme, which performs initial filtering and indexing on the CPU and delegates final verification to the GPU. Further, we show that this scheme improves upon the state-of-the-art in both the CPU and GPU standalone solutions in several cases.

References

Saman Ashkiani, Martin Farach-Colton, and John D Owens. 2018. A dynamic hash table for the GPU. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 419--429.Google ScholarCross Ref
Ranieri Baraglia, Gianmarco De Francisci Morales, and Claudio Lucchese. 2010. Document Similarity Self-Join with MapReduce. In ICDM. 731--736.Google Scholar
Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007. 131--140. Google ScholarDigital Library
Christos Bellas and Anastasios Gounaris. 2017. GPU processing of theta-joins. Concurrency and Computation: Practice and Experience 29, 18 (2017).Google Scholar
Panagiotis Bouros, Shen Ge, and Nikos Mamoulis. 2012. Spatio-textual similarity joins. PVLDB 6, 1 (2012), 1--12. Google ScholarDigital Library
John Cheng, Max Grossman, and Ty McKercher. 2014. Professional Cuda C Programming. John Wiley & Sons. Google ScholarDigital Library
Mateus SH Cruz, Yusuke Kozawa, Toshiyuki Amagasa, and Hiroyuki Kitagawa. 2015. GPU acceleration of set similarity joins. In International Conference on Database and Expert Systems Applications. Springer, 384--398. Google ScholarDigital Library
Dong Deng, Guoliang Li, He Wen, and Jianhua Feng. 2015. An efficient partition based method for exact set similarity joins. Proceedings of the VLDB Endowment 9, 4 (2015), 360--371.Google ScholarDigital Library
Fabian Fier, Nikolaus Augsten, Panagiotis Bouros, Ulf Leser, and Johann-Christoph Freytag. 2018. Set similarity joins on MapReduce: an experimental survey. Proceedings of the VLDB Endowment 11, 10 (2018), 1110--1122. Google ScholarDigital Library
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford 1, 12 (2009).Google Scholar
Oded Green, Robert McColl, and David A Bader. 2012. GPU merge path: a GPU merging algorithm. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 331--340. Google ScholarDigital Library
Oded Green, Pavan Yalamanchili, and Lluís-Miquel Munguía. 2014. Fast triangle counting on the GPU. In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms. IEEE Press, 1--8. Google ScholarDigital Library
Yu Jiang, Guoliang Li, Jianhua Feng, and Wen-Syan Li. 2014. String Similarity Joins: An Experimental Evaluation. PVLDB 7, 8 (2014), 625--636. Google ScholarDigital Library
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv: 1702.08734 (2017).Google Scholar
David Blair Kirk and Wen-mei W. Hwu. 2013. Programming Massively Parallel Processors - A Hands-on Approach, 2nd Ed. Morgan Kaufmann. Google ScholarDigital Library
Michael D Lieberman, Jagan Sankaranarayanan, and Hanan Samet. 2008. A fast similarity join algorithm using graphics processing units. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on. IEEE, 1111--1120. Google ScholarDigital Library
Willi Mann and Nikolaus Augsten. 2014. PEL: Position-Enhanced Length Filter for Set Similarity Joins. In Proceedings of the 26th GI-Workshop Grundlagen von Datenbanken. 89--94.Google Scholar
Willi Mann, Nikolaus Augsten, and Panagiotis Bouros. 2016. An Empirical Evaluation of Set Similarity Join Techniques. Proceedings of the VLDB Endowment 9, 9 (2016), 636--647. http://www.vldb.org/pvldb/vol9/p636-mann.pdf Google ScholarDigital Library
Ahmed Metwally and Christos Faloutsos. 2012. V-SMART-Join: A Scal-able MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. PVLDB 5, 8 (2012), 704--715. Google ScholarDigital Library
Rafael David Quirino, Sidney Ribeiro-Junior, Leonardo Andrade Ribeiro, and Wellington Santos Martins. 2017. Efficient Filter-Based Algorithms for Exact Set Similarity Join on GPUs. In International Conference on Enterprise Information Systems. Springer, 74--95.Google Scholar
Leonardo Andrade Ribeiro and Theo Härder. 2011. prefix filtering to improve set similarity joins. Information Systems 36, 1 (2011), 62--78. Google ScholarDigital Library
Sidney Ribeiro-Junior, Rafael David Quirino, Leonardo Andrade Ribeiro, and Wellington Santos Martins. 2017. Fast parallel set similarity joins on many-core architectures. Journal of Information and Data Management 8, 3 (2017), 255.Google Scholar
Akash Das Sarma, Yeye He, and Surajit Chaudhuri. 2014. ClusterJoin: A Similarity Joins Framework using MapReduce. PVLDB 7, 12 (2014), 1059--1070.Google ScholarDigital Library
Rares Vernica, Michael J. Carey, and Chen Li. 2010. Efficient parallel set-similarity joins using MapReduce. In SIGMOD Conference. 495--506. Google ScholarDigital Library
Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 85--96. Google ScholarDigital Library
Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang. 2017. Leveraging set relations in exact set similarity join. Proceedings of the VLDB Endowment 10, 9 (2017), 925--936.Google ScholarDigital Library
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36, 3 (2011), 15:1--15:41.Google ScholarDigital Library

Index Terms

Exact Set Similarity Joins for Large Datasets in the GPGPU paradigm
1. Information systems
  1. Data management systems

Recommendations

GPGPU: general-purpose computation on graphics hardware
SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

The graphics processor (GPU) on today's commodity video cards has evolved into an extremely powerful and flexible processor. Modern graphics architectures provide tremendous memory bandwidth and computational horsepower, with dozens of fully ...
Read More
From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
Read More
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DaMoN'19: Proceedings of the 15th International Workshop on Data Management on New Hardware
July 2019
150 pages
ISBN:9781450368018
DOI:10.1145/3329785

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate80of102submissions,78%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 114
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exact Set Similarity Joins for Large Datasets in the GPGPU paradigm

DaMoN'19: Proceedings of the 15th International Workshop on Data Management on New Hardware

ABSTRACT

References

Cited By

Index Terms

Recommendations

GPGPU: general-purpose computation on graphics hardware

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Exact Set Similarity Joins for Large Datasets in the GPGPU paradigm

DaMoN'19: Proceedings of the 15th International Workshop on Data Management on New Hardware

ABSTRACT

References

Cited By

Index Terms

Recommendations

GPGPU: general-purpose computation on graphics hardware

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media