ABSTRACT
We investigate the problem of exact set similarity joins using a co-process CPU-GPU scheme. We focus on large instances of the problem, i.e., using datasets of >1M entries, which may take hours to complete if not approached with care, due to the inherent quadratic complexity of the problem. We introduce a novel CPU-GPU co-process scheme, which performs initial filtering and indexing on the CPU and delegates final verification to the GPU. Further, we show that this scheme improves upon the state-of-the-art in both the CPU and GPU standalone solutions in several cases.
- Saman Ashkiani, Martin Farach-Colton, and John D Owens. 2018. A dynamic hash table for the GPU. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 419--429.Google ScholarCross Ref
- Ranieri Baraglia, Gianmarco De Francisci Morales, and Claudio Lucchese. 2010. Document Similarity Self-Join with MapReduce. In ICDM. 731--736.Google Scholar
- Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007. 131--140. Google ScholarDigital Library
- Christos Bellas and Anastasios Gounaris. 2017. GPU processing of theta-joins. Concurrency and Computation: Practice and Experience 29, 18 (2017).Google Scholar
- Panagiotis Bouros, Shen Ge, and Nikos Mamoulis. 2012. Spatio-textual similarity joins. PVLDB 6, 1 (2012), 1--12. Google ScholarDigital Library
- John Cheng, Max Grossman, and Ty McKercher. 2014. Professional Cuda C Programming. John Wiley & Sons. Google ScholarDigital Library
- Mateus SH Cruz, Yusuke Kozawa, Toshiyuki Amagasa, and Hiroyuki Kitagawa. 2015. GPU acceleration of set similarity joins. In International Conference on Database and Expert Systems Applications. Springer, 384--398. Google ScholarDigital Library
- Dong Deng, Guoliang Li, He Wen, and Jianhua Feng. 2015. An efficient partition based method for exact set similarity joins. Proceedings of the VLDB Endowment 9, 4 (2015), 360--371.Google ScholarDigital Library
- Fabian Fier, Nikolaus Augsten, Panagiotis Bouros, Ulf Leser, and Johann-Christoph Freytag. 2018. Set similarity joins on MapReduce: an experimental survey. Proceedings of the VLDB Endowment 11, 10 (2018), 1110--1122. Google ScholarDigital Library
- Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford 1, 12 (2009).Google Scholar
- Oded Green, Robert McColl, and David A Bader. 2012. GPU merge path: a GPU merging algorithm. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 331--340. Google ScholarDigital Library
- Oded Green, Pavan Yalamanchili, and Lluís-Miquel Munguía. 2014. Fast triangle counting on the GPU. In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms. IEEE Press, 1--8. Google ScholarDigital Library
- Yu Jiang, Guoliang Li, Jianhua Feng, and Wen-Syan Li. 2014. String Similarity Joins: An Experimental Evaluation. PVLDB 7, 8 (2014), 625--636. Google ScholarDigital Library
- Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv: 1702.08734 (2017).Google Scholar
- David Blair Kirk and Wen-mei W. Hwu. 2013. Programming Massively Parallel Processors - A Hands-on Approach, 2nd Ed. Morgan Kaufmann. Google ScholarDigital Library
- Michael D Lieberman, Jagan Sankaranarayanan, and Hanan Samet. 2008. A fast similarity join algorithm using graphics processing units. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on. IEEE, 1111--1120. Google ScholarDigital Library
- Willi Mann and Nikolaus Augsten. 2014. PEL: Position-Enhanced Length Filter for Set Similarity Joins. In Proceedings of the 26th GI-Workshop Grundlagen von Datenbanken. 89--94.Google Scholar
- Willi Mann, Nikolaus Augsten, and Panagiotis Bouros. 2016. An Empirical Evaluation of Set Similarity Join Techniques. Proceedings of the VLDB Endowment 9, 9 (2016), 636--647. http://www.vldb.org/pvldb/vol9/p636-mann.pdf Google ScholarDigital Library
- Ahmed Metwally and Christos Faloutsos. 2012. V-SMART-Join: A Scal-able MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. PVLDB 5, 8 (2012), 704--715. Google ScholarDigital Library
- Rafael David Quirino, Sidney Ribeiro-Junior, Leonardo Andrade Ribeiro, and Wellington Santos Martins. 2017. Efficient Filter-Based Algorithms for Exact Set Similarity Join on GPUs. In International Conference on Enterprise Information Systems. Springer, 74--95.Google Scholar
- Leonardo Andrade Ribeiro and Theo Härder. 2011. prefix filtering to improve set similarity joins. Information Systems 36, 1 (2011), 62--78. Google ScholarDigital Library
- Sidney Ribeiro-Junior, Rafael David Quirino, Leonardo Andrade Ribeiro, and Wellington Santos Martins. 2017. Fast parallel set similarity joins on many-core architectures. Journal of Information and Data Management 8, 3 (2017), 255.Google Scholar
- Akash Das Sarma, Yeye He, and Surajit Chaudhuri. 2014. ClusterJoin: A Similarity Joins Framework using MapReduce. PVLDB 7, 12 (2014), 1059--1070.Google ScholarDigital Library
- Rares Vernica, Michael J. Carey, and Chen Li. 2010. Efficient parallel set-similarity joins using MapReduce. In SIGMOD Conference. 495--506. Google ScholarDigital Library
- Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 85--96. Google ScholarDigital Library
- Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang. 2017. Leveraging set relations in exact set similarity join. Proceedings of the VLDB Endowment 10, 9 (2017), 925--936.Google ScholarDigital Library
- Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36, 3 (2011), 15:1--15:41.Google ScholarDigital Library
Index Terms
- Exact Set Similarity Joins for Large Datasets in the GPGPU paradigm
Recommendations
GPGPU: general-purpose computation on graphics hardware
SC '06: Proceedings of the 2006 ACM/IEEE conference on SupercomputingThe graphics processor (GPU) on today's commodity video cards has evolved into an extremely powerful and flexible processor. Modern graphics architectures provide tremendous memory bandwidth and computational horsepower, with dozens of fully ...
From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture
Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingGPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming ...
Comments