research-article

S3J: A Parallel Semi-Stream Similarity Join

Authors:

Muhammad Asif Naeem,

Christof Lutteroth,

Gerald WeberAuthors Info & Claims

DOLAP '15: Proceedings of the ACM Eighteenth International Workshop on Data Warehousing and OLAP

Pages 49 - 57

https://doi.org/10.1145/2811222.2811226

Published: 22 October 2015 Publication History

Abstract

Semi-stream join algorithms join a continuous stream with a large disk-based relation. While there are efficient semi-stream equijoins for exact matches in the joined data, there are currently no semi-stream similarity joins for approximate matches. The existing similarity join algorithms work either offline (on datasets that are fully known) or on several streams (using a join window), and are less suitable for applications where continuous, immediate and complete similarity join results are required. To address this gap we propose S3J, the first semi-stream similarity join algorithm. To utilize disk and CPU optimally, S3J combines a disk-intensive queue-based semi-stream join approach with a CPU-intensive similarity matching algorithm. The similarity matching algorithm is based on tries to minimize the memory footprint. Moreover, it supports parallel execution to utilize modern multicore CPUs. We provide a cost model for S3J and evaluate its performance empirically.

References

[1]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proceedings of VLDB, pages 918--929, 2006.

Digital Library

[2]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the International World Wide Web Conference, pages 131--140. ACM, 2007.

Digital Library

[3]

T. Bocek, E. Hunt, and B. Stiller. Fast similarity search in large dictionaries. Technical Report ifi-2007.02, University of Zurich, Department of Informatics, 2007.

[4]

A. Chakraborty and A. Singh. A partition-based approach to support streaming updates over persistent data in an active datawarehouse. In Proceedings of the International Symposium on Parallel & Distributed Processing (IPDPS), pages 1--11. IEEE, 2009.

Digital Library

[5]

D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. Massjoin: A mapreduce-based method for scalable string similarity joins. In Proceedings of the International Conference on Data Engineering (ICDE), pages 340--351. IEEE, 2014.

[6]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, and others. Approximate string joins in a database (almost) for free. In Proceedings of VLDB, pages 491--500, 2001.

Digital Library

[7]

Y. Jiang, D. Deng, J. Wang, G. Li, and J. Feng. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. In Proceedings of the Joint EDBT/ICDT Workshops, pages 341--348. ACM, 2013.

Digital Library

[8]

Y. Jiang, G. Li, J. Feng, and W.-S. Li. String similarity joins: An experimental evaluation. In Proceedings of VLDB, pages 625--636, 2014.

Digital Library

[9]

V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707--710, 1966.

[10]

G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. In Proceedings of VLDB, pages 253--264, 2011.

Digital Library

[11]

M. A. Naeem, G. Dobbie, and G. Weber. X-HYBRIDJOIN for near-real-time data warehousing. In Advances in Databases, LNCS 7051, pages 33--47. Springer, 2011.

Digital Library

[12]

M. A. Naeem, G. Weber, G. Dobbie, and C. Lutteroth. A generic front-stage for semi-stream processing. In Proceedings of CIKM, pages 769--774. ACM, 2013.

Digital Library

[13]

N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N. Frantzell. Meshing streaming updates with persistent data in an active data warehouse. IEEE Transactions on Knowledge and Data Engineering, 20(7):976--991, 2008.

Digital Library

[14]

J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In Proceedings of SIGMOD, pages 1033--1044. ACM, 2011.

Digital Library

[15]

R. A. Wagner. On the complexity of the extended string-to-string correction problem. In Proceedings of the Annual Symposium on the Theory of Computing, pages 218--223. ACM, 1975.

Digital Library

[16]

J. Wang, J. Feng, and G. Li. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. In Proceedings of VLDB, pages 1219--1230, 2010.

Digital Library

[17]

J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of SIGMOD, pages 85--96. ACM, 2012.

Digital Library

[18]

W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. VChunkJoin: An efficient algorithm for edit similarity joins. IEEE Transactions on Knowledge and Data Engineering, 25(8):1916--1929, 2013.

Digital Library

[19]

C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In Proceedings of VLDB, pages 933--944, 2008.

Digital Library

[20]

C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3):15, 2011.

Digital Library

Cited By

Aziz OAnees TMehmood E(2021)An Efficient Data Access Approach With Queue and Stack in Optimized Hybrid JoinIEEE Access10.1109/ACCESS.2021.30642029(41261-41274)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3064202
Kim HLee K(2020)Semi-Stream Similarity Join Processing in a Distributed EnvironmentIEEE Access10.1109/ACCESS.2020.30094148(130194-130204)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3009414
Garcia-Alvarado COrdonez CSong IBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)DOLAP 2015 Workshop SummaryProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806876(1939-1940)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806876

Index Terms

S3J: A Parallel Semi-Stream Similarity Join
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

A generic front-stage for semi-stream processing
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Recently, a number of semi-stream join algorithms have been published. The typical system setup for these consists of one fast stream input that has to be joined with a disk-based relation R. These semi-stream join approaches typically perform the join ...
A Cached-Based Stream-Relation Join Operator for Semi-Stream Data Processing

Stream-based join algorithms got a prominent role in the field of real-time data warehouses. One particular type of stream-based joins is a semi-stream join where a single stream is joined with a disk -based relation. Normally the size of this disk-...
Q2P: Discovering Query Templates via Autocompletion

We present Q2P, a system that discovers query templates from search engines via their query autocompletion services. Q2P is distinct from the existing works in that it does not rely on query logs of search engines that are typically not readily ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DOLAP '15: Proceedings of the ACM Eighteenth International Workshop on Data Warehousing and OLAP

October 2015

108 pages

ISBN:9781450337854

DOI:10.1145/2811222

General Chair:
Il-Yeol Song
Drexel University, USA
,
Program Chairs:
Carlos Garcia-Alvarado
Pivotal Software Inc., USA
,
Carlos Ordonez
University of Houston & ATT, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM'15

Sponsor:

CIKM'15: 24th ACM International Conference on Information and Knowledge Management

October 23, 2015

Melbourne, Australia

Acceptance Rates

DOLAP '15 Paper Acceptance Rate 8 of 31 submissions, 26%;

Overall Acceptance Rate 29 of 79 submissions, 37%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
126
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Aziz OAnees TMehmood E(2021)An Efficient Data Access Approach With Queue and Stack in Optimized Hybrid JoinIEEE Access10.1109/ACCESS.2021.30642029(41261-41274)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3064202
Kim HLee K(2020)Semi-Stream Similarity Join Processing in a Distributed EnvironmentIEEE Access10.1109/ACCESS.2020.30094148(130194-130204)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3009414
Garcia-Alvarado COrdonez CSong IBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)DOLAP 2015 Workshop SummaryProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806876(1939-1940)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806876

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten