The Similarity-Aware Relational Intersect Database Operator

Marri, Wadha J. Al; Malluhi, Qutaibah; Ouzzani, Mourad; Tang, Mingjie; Aref, Walid G.

doi:10.1007/978-3-319-11988-5_15

Wadha J. Al Marri¹⁸,
Qutaibah Malluhi¹⁸,
Mourad Ouzzani¹⁹,
Mingjie Tang²⁰ &
…
Walid G. Aref²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8821))

Included in the following conference series:

International Conference on Similarity Search and Applications

981 Accesses
3 Citations

Abstract

Identifying similarities in large datasets is an essential operation in many applications such as bioinformatics, pattern recognition, and data integration. To make the underlying database system similarity-aware, the core relational operators have to be extended. Several similarity-aware relational operators have been proposed that introduce similarity processing at the database engine level, e.g., similarity joins and similarity group-by. This paper extends the semantics of the set intersection operator to operate over similar values. The paper describes the semantics of the similarity-based set intersection operator, and develops an efficient query processing algorithm for evaluating it. The proposed operator is implemented inside an open-source database system, namely PostgreSQL. Several queries from the TPC-H benchmark are extended to include similarity-based set intersetion predicates. Performance results demonstrate up to three orders of magnitude speedup in performance over equivalent queries that only employ regular operators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Narayanan, M., Karp, R.M.: Gapped local similarity search with provable guarantees. In: Jonassen, I., Kim, J. (eds.) WABI 2004. LNCS (LNBI), vol. 3240, pp. 74–86. Springer, Heidelberg (2004)
Chapter Google Scholar
Wang, J., Li, G., Feng, J.: Fast-join: An efficient method for fuzzy token matching based string similarity join. In: ICDE (2011)
Google Scholar
Schallehn, E., Sattler, K.U., Saake, G.: Efficient similarity-based operations for data integration. Data and Knowledge Engineering 48(3) (2004)
Google Scholar
Mills, P.: Efficient statistical classification of satellite measurements. International Journal of Remote Sensing 32(21) (2011)
Google Scholar
Silva, Y.N., Aref, W.G., Ali, M.H.: The similarity join database operator. In: ICDE (2010)
Google Scholar
Silva, Y.N., Aref, W.G., Ali, M.H.: Similarity group-by. In: ICDE (2009)
Google Scholar
Silva, Y.N., Aref, W.G., Larson, P., Pearson, S., Ali, M.H.: Similarity queries: their conceptual evaluation, transformations, and processing. VLDB J. 22(3) (2013)
Google Scholar
Marri, W.J.A.: Similarity-aware set operators. Master’s thesis, Qatar University (2009)
Google Scholar
Wang, J., Li, G., Fe, J.: Fast-join: An efficient method for fuzzy token matching based string similarity join. In: ICDE (2011)
Google Scholar
Schallehn, E., Sattler, K., Saake, G.: Advanced grouping and aggregation for data integration. In: CIKM (2001)
Google Scholar
Yu, C., Cui, B., Wang, S., Su, J.: Efficient index-based knn join processing for high-dimensional data. Journal of Information and Software Technology 49(4) (2007)
Google Scholar
Hjaltason, G., Samet, H.: Incremental distance join algorithms for spatial databases. In: SIGMOD (1998)
Google Scholar
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB (2006)
Google Scholar
Böhm, C., Krebs, F.: The k-nearest neighbour join: Turbo charging the kdd process. Knowledge and Information Systems 6(6) (2004)
Google Scholar
Gao, L., Wang, M., Wang, X.S., Padmanabhan, S.: Expressing and optimizing similarity-based queries in sql. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T.-W. (eds.) ER 2004. LNCS, vol. 3288, pp. 464–478. Springer, Heidelberg (2004)
Chapter Google Scholar
Barioni, M.C.N., Razente, H.L., Traina Jr., C., Traina, A.J.M.: Querying complex objects by similarity in sql. In: SBBD (2005)
Google Scholar
Barioni, M.C.N., Razente, H.L., Traina, A.J.M., Traina Jr., C.: Siren: A similarity retrieval engine for complex data. In: VLDB (2006)
Google Scholar
Silva, Y.N., Aly, A.M., Aref, W.G., Larson, P.Å.: Simdb: a similarity-aware database system. In: SIGMOD (2010)
Google Scholar
PostgreSQL Global Development Group: Postgresql (2014), http://www.postgresql.org/
TPCH: Tpc-h version 2.15.0 (2014), http://www.tpc.org/tpch
Intel Berkeley Research lab: Intel lab data (2014), http://db.csail.mit.edu/labdata/labdata.html

Download references

Author information

Authors and Affiliations

Qatar University, Doha, Qatar
Wadha J. Al Marri & Qutaibah Malluhi
Qatar Computing Research Institute, Doha, Qatar
Mourad Ouzzani
Purdue University, West Lafayette, IN, USA
Mingjie Tang & Walid G. Aref

Authors

Wadha J. Al Marri
View author publications
You can also search for this author in PubMed Google Scholar
Qutaibah Malluhi
View author publications
You can also search for this author in PubMed Google Scholar
Mourad Ouzzani
View author publications
You can also search for this author in PubMed Google Scholar
Mingjie Tang
View author publications
You can also search for this author in PubMed Google Scholar
Walid G. Aref
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of São Paulo, São Carlos, Brazil
Agma Juci Machado Traina
University of Sao Paulo at Sao Carlos - USP, Av. do Trabalhador Saocarlense 400, 13566-590, Sao Carlos, Brazil
Caetano Traina Jr.
University of Sal Paulo at Sao Carlos - USP, Av. do Trabalhador Saocarlense 400, 13566-590, Sao Carlos, Brazil
Robson Leonardo Ferreira Cordeiro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Marri, W.J.A., Malluhi, Q., Ouzzani, M., Tang, M., Aref, W.G. (2014). The Similarity-Aware Relational Intersect Database Operator. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds) Similarity Search and Applications. SISAP 2014. Lecture Notes in Computer Science, vol 8821. Springer, Cham. https://doi.org/10.1007/978-3-319-11988-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-11988-5_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11987-8
Online ISBN: 978-3-319-11988-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics