Efficient set containment join

Yang, Jianye; Zhang, Wenjie; Yang, Shiyu; Zhang, Ying; Lin, Xuemin; Yuan, Long

doi:10.1007/s00778-018-0505-x

Efficient set containment join

Regular Paper
Published: 11 May 2018

Volume 27, pages 471–495, (2018)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Jianye Yang¹,
Wenjie Zhang²,
Shiyu Yang³,
Ying Zhang⁴,
Xuemin Lin² &
…
Long Yuan²

636 Accesses
12 Citations
Explore all metrics

Abstract

In this paper, we study the problem of set containment join. Given two collections \(\mathcal {R}\) and \(\mathcal {S}\) of records, the set containment join \(\mathcal {R} \bowtie _\subseteq \mathcal {S}\) retrieves all record pairs \(\{(r,s)\} \in \mathcal {R}\times \mathcal {S}\) such that \(r \subseteq s\). This problem has been extensively studied in the literature and has many important applications in commercial and scientific fields. Recent research focuses on the in-memory set containment join algorithms, and several techniques have been developed following intersection-oriented or union-oriented computing paradigms. Nevertheless, we observe that two computing paradigms have their limits due to the nature of the intersection and union operators. Particularly, intersection-oriented method relies on the intersection of the relevant inverted lists built on the elements of \(\mathcal {S}\). A nice property of the intersection-oriented method is that the join computation is verification free. However, the number of records explored during the join process may be large because there are multiple replicas for each record in \(\mathcal {S}\). On the other hand, the union-oriented method generates a signature for each record in \(\mathcal {R}\) and the candidate pairs are obtained by the union of the inverted lists of the relevant signatures. The candidate size of the union-oriented method is usually small because each record contributes only one replica in the index. Unfortunately, union-oriented method needs to verify the candidate pairs, which may be cost expensive especially when the join result size is large. As a matter of fact, the state-of-the-art union-oriented solution is not competitive compared to the intersection-oriented ones. In this paper, we propose a new union-oriented method, namely TT-Join, which not only enhances the advantage of the previous union-oriented methods but also integrates the goodness of intersection-oriented methods by imposing a variant of prefix tree structure. We conduct extensive experiments on 20 real-life datasets and synthetic datasets by comparing our method with 7 existing methods. The experiment results demonstrate that TT-Join significantly outperforms the existing algorithms on most of the datasets and can achieve up to two orders of magnitude speedup. Furthermore, to support large scale of datasets, we extend our techniques to distributed systems on top of MapReduce framework. With the help of careful designed load-aware distribution mechanisms, our distributed join algorithm can achieve up to an order of magnitude speedup than the baselines methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Set containment join revisited

Article 26 October 2015

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

Article Open access 09 November 2019

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

Notes

Algorithm 1 is named RI-Join in this paper since there is no index on \(\mathcal {R}\) and an inverted index is built on \(\mathcal {S}\).
The new simple union-oriented method is named IS-Join because an inverted index is built on \(\mathcal {R}\) and there is no index on \(\mathcal {S}\).

References

http://liu.cs.uic.edu/download/data/
http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection
http://www.informatik.uni-freiburg.de/~cziegler/BX/
http://dai-labor.de/IRML/datasets
http://www.discogs.com/
http://www.cs.cmu.edu/~enron
http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html
http://konect.uni-koblenz.de/networks/lkml_person-thread
http://socialnetworks.mpi-sws.org/data-imc2007.html
http://www.clearbits.net/torrents/1881-dec-2011
http://vi.sualize.us/
http://wiki.dbpedia.org/Downloads
Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A., Ullman, J.D.: Fuzzy joins using mapreduce. In: ICDE, pp. 498–509 (2012)
Agrawal, P., Arasu, A., Kaushik, R.: On indexing error-tolerant set containment. In: SIGMOD, pp. 927–938 (2010)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Baeza-Yates, R., Salinger, A.: A fast set intersection algorithm for sorted sequences. In: CPM (2004)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Bouros, P., Mamoulis, N., Ge, S., Terrovitis, M.: Set containment join revisited. In: Knowledge and Information Systems, pp. 1–28 (2015)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)
Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a mapreduce-based method for scalable string similarity joins. In: ICDE, pp. 340–351 (2014)
Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. In: VLDB, pp. 360–371 (2015)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD, pp. 1–12 (2000)
Helmer, S., Moerkotte, G.: Evaluation of main memory join algorithms for joins with set comparison predicates. In: VLDB, pp. 386–395 (1997)
Hmedeh, Z., Kourdounakis, H., Christophides, V., Du Mouza, C., Scholl, M., Travers., N.: Subscription indexes for web syndication systems. In: EDBT, pp. 312–323 (2012)
Hu, X., Tao, Y., Yi, K.: Output-optimal parallel algorithms for similarity joins. In: PODS, pp. 79–90 (2017)
Jampani, R., Pudi, V.: Using prefix-trees for efficiently computing set joins. In: DASFAA, pp. 761–772 (2005)
Kunkel, A., Rheinländer, A., Schiefer, C., Helmer, S., Bouros, P., Leser, U.: Piejoin: towards parallel set containment joins. In: SSDBM, p. 11 (2016)
Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the news cycle. In: SIGKDD, pp. 497–506 (2009)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Luo, Y., Fletcher, G.H., Hidders, J., De Bra, P.: Efficient and scalable trie-based algorithms for computing set containment relations. In: ICDE, pp. 303–314 (2015)
Mamoulis, N.: Efficient processing of joins on set-valued attributes. In: SIGMOD, pp. 157–168 (2003)
Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. In: VLDB, pp. 636–647 (2016)
Melnik, S., Garcia-Molina, H.: Divide-and-conquer algorithm for computing set containment joins. In: EDBT, pp. 427–444 (2002)
Melnik, S., Garcia Molina, H.: Adaptive algorithms for set containment joins. TODS 28(1), 56–99 (2003)
Article Google Scholar
Metwally, A., Faloutsos, C.: V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. In: VLDB, pp. 704–715 (2012)
Ramasamy, K., Patel, J.M., Naughton, J.F., Kaushik, R.: Set containment joins: the good, the bad and the ugly. In: VLDB, pp. 351–362 (2000)
Sun, J., Shang, Z., Li, G., Dend, D., Bao, Z.: Dima: a distributed in-memory similarity-based query processing system. In: VLDB, pp. 1925–1928 (2017)
Terrovitis, M., Bouros, P., Vassiliadis, P., Sellis, T., Mamoulis, N.: Efficient answering of set containment queries for skewed item distributions. In: EDBT, pp. 225–236 (2011)
Terrovitis, M., Passas, S., Vassiliadis, P., Sellis, T.: A combination of trie-trees and inverted files for the indexing of set-valued attributes. In: CIKM, pp. 728–737 (2006)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD, pp. 495–506 (2010)
Wang, J., Feng, J., Li, G.: Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. In: VLDB, pp. 1219–1230 (2010)
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)
Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact set similarity join. In: VLDB, pp. 925–936 (2017)
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Yan, T.W., García-Molina, H.: Index structures for selective dissemination of information under the boolean model. TODS 19(2), 332–364 (1994)
Article Google Scholar
Zhu, E., Nargesian, F., Pu, K.Q., Miller, R.J.: LSH ensemble: Internet scale domain search. In: VLDB, pp. 1185–1196 (2016)

Download references

Acknowledgements

Ying Zhang is supported by ARC FT170100128 and DP180103096. Wenjie Zhang is supported by ARC DP180103096. Xuemin Lin is supported by NSFC 61672235, DP170101628, and DP180103096. Shiyu Yang is sponsored by Shanghai Sailing Program.

Author information

Authors and Affiliations

Alibaba Group, Hangzhou, China
Jianye Yang
The University of New South Wales, Sydney, Australia
Wenjie Zhang, Xuemin Lin & Long Yuan
East China Normal University, Shanghai, China
Shiyu Yang
CAI, School of Software, University of Technology Sydney, Sydney, Australia
Ying Zhang

Authors

Jianye Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shiyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xuemin Lin
View author publications
You can also search for this author in PubMed Google Scholar
Long Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianye Yang.

Additional information

The work was partly done when the author was studying in the University of New South Wales.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, J., Zhang, W., Yang, S. et al. Efficient set containment join. The VLDB Journal 27, 471–495 (2018). https://doi.org/10.1007/s00778-018-0505-x

Download citation

Received: 21 May 2017
Revised: 17 February 2018
Accepted: 26 April 2018
Published: 11 May 2018
Issue Date: August 2018
DOI: https://doi.org/10.1007/s00778-018-0505-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient set containment join

Abstract

Access this article

Similar content being viewed by others

Set containment join revisited

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient set containment join

Abstract

Access this article

Similar content being viewed by others

Set containment join revisited

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation