skip to main content
10.1145/3329785.3329919acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Exact Set Similarity Joins for Large Datasets in the GPGPU paradigm

Published:01 July 2019Publication History

ABSTRACT

We investigate the problem of exact set similarity joins using a co-process CPU-GPU scheme. We focus on large instances of the problem, i.e., using datasets of >1M entries, which may take hours to complete if not approached with care, due to the inherent quadratic complexity of the problem. We introduce a novel CPU-GPU co-process scheme, which performs initial filtering and indexing on the CPU and delegates final verification to the GPU. Further, we show that this scheme improves upon the state-of-the-art in both the CPU and GPU standalone solutions in several cases.

References

  1. Saman Ashkiani, Martin Farach-Colton, and John D Owens. 2018. A dynamic hash table for the GPU. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 419--429.Google ScholarGoogle ScholarCross RefCross Ref
  2. Ranieri Baraglia, Gianmarco De Francisci Morales, and Claudio Lucchese. 2010. Document Similarity Self-Join with MapReduce. In ICDM. 731--736.Google ScholarGoogle Scholar
  3. Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007. 131--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Christos Bellas and Anastasios Gounaris. 2017. GPU processing of theta-joins. Concurrency and Computation: Practice and Experience 29, 18 (2017).Google ScholarGoogle Scholar
  5. Panagiotis Bouros, Shen Ge, and Nikos Mamoulis. 2012. Spatio-textual similarity joins. PVLDB 6, 1 (2012), 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. John Cheng, Max Grossman, and Ty McKercher. 2014. Professional Cuda C Programming. John Wiley & Sons. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mateus SH Cruz, Yusuke Kozawa, Toshiyuki Amagasa, and Hiroyuki Kitagawa. 2015. GPU acceleration of set similarity joins. In International Conference on Database and Expert Systems Applications. Springer, 384--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dong Deng, Guoliang Li, He Wen, and Jianhua Feng. 2015. An efficient partition based method for exact set similarity joins. Proceedings of the VLDB Endowment 9, 4 (2015), 360--371.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Fabian Fier, Nikolaus Augsten, Panagiotis Bouros, Ulf Leser, and Johann-Christoph Freytag. 2018. Set similarity joins on MapReduce: an experimental survey. Proceedings of the VLDB Endowment 11, 10 (2018), 1110--1122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford 1, 12 (2009).Google ScholarGoogle Scholar
  11. Oded Green, Robert McColl, and David A Bader. 2012. GPU merge path: a GPU merging algorithm. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 331--340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Oded Green, Pavan Yalamanchili, and Lluís-Miquel Munguía. 2014. Fast triangle counting on the GPU. In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms. IEEE Press, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yu Jiang, Guoliang Li, Jianhua Feng, and Wen-Syan Li. 2014. String Similarity Joins: An Experimental Evaluation. PVLDB 7, 8 (2014), 625--636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv: 1702.08734 (2017).Google ScholarGoogle Scholar
  15. David Blair Kirk and Wen-mei W. Hwu. 2013. Programming Massively Parallel Processors - A Hands-on Approach, 2nd Ed. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Michael D Lieberman, Jagan Sankaranarayanan, and Hanan Samet. 2008. A fast similarity join algorithm using graphics processing units. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on. IEEE, 1111--1120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Willi Mann and Nikolaus Augsten. 2014. PEL: Position-Enhanced Length Filter for Set Similarity Joins. In Proceedings of the 26th GI-Workshop Grundlagen von Datenbanken. 89--94.Google ScholarGoogle Scholar
  18. Willi Mann, Nikolaus Augsten, and Panagiotis Bouros. 2016. An Empirical Evaluation of Set Similarity Join Techniques. Proceedings of the VLDB Endowment 9, 9 (2016), 636--647. http://www.vldb.org/pvldb/vol9/p636-mann.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ahmed Metwally and Christos Faloutsos. 2012. V-SMART-Join: A Scal-able MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. PVLDB 5, 8 (2012), 704--715. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rafael David Quirino, Sidney Ribeiro-Junior, Leonardo Andrade Ribeiro, and Wellington Santos Martins. 2017. Efficient Filter-Based Algorithms for Exact Set Similarity Join on GPUs. In International Conference on Enterprise Information Systems. Springer, 74--95.Google ScholarGoogle Scholar
  21. Leonardo Andrade Ribeiro and Theo Härder. 2011. prefix filtering to improve set similarity joins. Information Systems 36, 1 (2011), 62--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sidney Ribeiro-Junior, Rafael David Quirino, Leonardo Andrade Ribeiro, and Wellington Santos Martins. 2017. Fast parallel set similarity joins on many-core architectures. Journal of Information and Data Management 8, 3 (2017), 255.Google ScholarGoogle Scholar
  23. Akash Das Sarma, Yeye He, and Surajit Chaudhuri. 2014. ClusterJoin: A Similarity Joins Framework using MapReduce. PVLDB 7, 12 (2014), 1059--1070.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rares Vernica, Michael J. Carey, and Chen Li. 2010. Efficient parallel set-similarity joins using MapReduce. In SIGMOD Conference. 495--506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 85--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang. 2017. Leveraging set relations in exact set similarity join. Proceedings of the VLDB Endowment 10, 9 (2017), 925--936.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36, 3 (2011), 15:1--15:41.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exact Set Similarity Joins for Large Datasets in the GPGPU paradigm

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      DaMoN'19: Proceedings of the 15th International Workshop on Data Management on New Hardware
      July 2019
      150 pages
      ISBN:9781450368018
      DOI:10.1145/3329785

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 July 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate80of102submissions,78%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader