skip to main content
10.1145/2811222.2811226acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

S3J: A Parallel Semi-Stream Similarity Join

Published: 22 October 2015 Publication History

Abstract

Semi-stream join algorithms join a continuous stream with a large disk-based relation. While there are efficient semi-stream equijoins for exact matches in the joined data, there are currently no semi-stream similarity joins for approximate matches. The existing similarity join algorithms work either offline (on datasets that are fully known) or on several streams (using a join window), and are less suitable for applications where continuous, immediate and complete similarity join results are required. To address this gap we propose S3J, the first semi-stream similarity join algorithm. To utilize disk and CPU optimally, S3J combines a disk-intensive queue-based semi-stream join approach with a CPU-intensive similarity matching algorithm. The similarity matching algorithm is based on tries to minimize the memory footprint. Moreover, it supports parallel execution to utilize modern multicore CPUs. We provide a cost model for S3J and evaluate its performance empirically.

References

[1]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proceedings of VLDB, pages 918--929, 2006.
[2]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the International World Wide Web Conference, pages 131--140. ACM, 2007.
[3]
T. Bocek, E. Hunt, and B. Stiller. Fast similarity search in large dictionaries. Technical Report ifi-2007.02, University of Zurich, Department of Informatics, 2007.
[4]
A. Chakraborty and A. Singh. A partition-based approach to support streaming updates over persistent data in an active datawarehouse. In Proceedings of the International Symposium on Parallel & Distributed Processing (IPDPS), pages 1--11. IEEE, 2009.
[5]
D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. Massjoin: A mapreduce-based method for scalable string similarity joins. In Proceedings of the International Conference on Data Engineering (ICDE), pages 340--351. IEEE, 2014.
[6]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, and others. Approximate string joins in a database (almost) for free. In Proceedings of VLDB, pages 491--500, 2001.
[7]
Y. Jiang, D. Deng, J. Wang, G. Li, and J. Feng. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. In Proceedings of the Joint EDBT/ICDT Workshops, pages 341--348. ACM, 2013.
[8]
Y. Jiang, G. Li, J. Feng, and W.-S. Li. String similarity joins: An experimental evaluation. In Proceedings of VLDB, pages 625--636, 2014.
[9]
V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707--710, 1966.
[10]
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. In Proceedings of VLDB, pages 253--264, 2011.
[11]
M. A. Naeem, G. Dobbie, and G. Weber. X-HYBRIDJOIN for near-real-time data warehousing. In Advances in Databases, LNCS 7051, pages 33--47. Springer, 2011.
[12]
M. A. Naeem, G. Weber, G. Dobbie, and C. Lutteroth. A generic front-stage for semi-stream processing. In Proceedings of CIKM, pages 769--774. ACM, 2013.
[13]
N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N. Frantzell. Meshing streaming updates with persistent data in an active data warehouse. IEEE Transactions on Knowledge and Data Engineering, 20(7):976--991, 2008.
[14]
J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In Proceedings of SIGMOD, pages 1033--1044. ACM, 2011.
[15]
R. A. Wagner. On the complexity of the extended string-to-string correction problem. In Proceedings of the Annual Symposium on the Theory of Computing, pages 218--223. ACM, 1975.
[16]
J. Wang, J. Feng, and G. Li. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. In Proceedings of VLDB, pages 1219--1230, 2010.
[17]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of SIGMOD, pages 85--96. ACM, 2012.
[18]
W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. VChunkJoin: An efficient algorithm for edit similarity joins. IEEE Transactions on Knowledge and Data Engineering, 25(8):1916--1929, 2013.
[19]
C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In Proceedings of VLDB, pages 933--944, 2008.
[20]
C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3):15, 2011.

Cited By

View all
  • (2021)An Efficient Data Access Approach With Queue and Stack in Optimized Hybrid JoinIEEE Access10.1109/ACCESS.2021.30642029(41261-41274)Online publication date: 2021
  • (2020)Semi-Stream Similarity Join Processing in a Distributed EnvironmentIEEE Access10.1109/ACCESS.2020.30094148(130194-130204)Online publication date: 2020
  • (2015)DOLAP 2015 Workshop SummaryProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806876(1939-1940)Online publication date: 17-Oct-2015

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DOLAP '15: Proceedings of the ACM Eighteenth International Workshop on Data Warehousing and OLAP
October 2015
108 pages
ISBN:9781450337854
DOI:10.1145/2811222
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. semi-stream join
  2. similarity search
  3. trie

Qualifiers

  • Research-article

Conference

CIKM'15
Sponsor:

Acceptance Rates

DOLAP '15 Paper Acceptance Rate 8 of 31 submissions, 26%;
Overall Acceptance Rate 29 of 79 submissions, 37%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)An Efficient Data Access Approach With Queue and Stack in Optimized Hybrid JoinIEEE Access10.1109/ACCESS.2021.30642029(41261-41274)Online publication date: 2021
  • (2020)Semi-Stream Similarity Join Processing in a Distributed EnvironmentIEEE Access10.1109/ACCESS.2020.30094148(130194-130204)Online publication date: 2020
  • (2015)DOLAP 2015 Workshop SummaryProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806876(1939-1940)Online publication date: 17-Oct-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media