research-article

Estimating the selectivity of tf-idf based cosine similarity predicates

Authors:

Sandeep Tata,

Jignesh M. PatelAuthors Info & Claims

ACM SIGMOD Record, Volume 36, Issue 4

Pages 75 - 80

https://doi.org/10.1145/1361348.1361351

Published: 01 December 2007 Publication History

Get Access

Abstract

An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.

References

[1]

Caetano Traina and Agma J. M. Traina and Christos Faloutsos. Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees. In ICDE, pages 195--195, 2000.

Digital Library

Google Scholar

[2]

Digital Bibliography and Library Project (DBLP), http://dblp.uni-trier.de/.

Google Scholar

[3]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.

Google Scholar

[4]

L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins for Data Cleansing and Integration in an RDBMS. In ICDE, pages 729--731, 2003.

Crossref

Google Scholar

[5]

L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins in an RDBMS for Web Data Integration. In WWW, pages 90--101, 2003.

Digital Library

Google Scholar

[6]

Y. Huang and G. Madey. Web Data Integration Using Approximate String Join. In WWW, pages 364--365.

Digital Library

Google Scholar

[7]

L. Jin and C. Li. Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. In VLDB, pages 397--408, 2005.

Digital Library

Google Scholar

[8]

N. Koudas, A. Marathe, and D. Srivastava. Flexible String Matching Against Large Databases in Practice. In VLDB, pages 1078--1086, 2004.

Digital Library

Google Scholar

[9]

A. Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill.

Google Scholar

[10]

The LDC Corpus Catalog, http://wave.ldc.upenn.edu/Catalog/.

Google Scholar

Cited By

View all

Yi XTang LCheng RYin MZheng Y(2025)Domain ontology to integrate building-integrated photovoltaic, battery energy storage, and building energy flexibility information for explicable operation and maintenanceComputers in Industry10.1016/j.compind.2025.104250166(104250)Online publication date: Apr-2025
https://doi.org/10.1016/j.compind.2025.104250
Teh PUwasomba C(2024)Impact of Large Language Models on Scholarly Publication Titles and Abstracts: A Comparative AnalysisJournal of Social Computing10.23919/JSC.2024.00115:2(105-121)Online publication date: Jun-2024
https://doi.org/10.23919/JSC.2024.0011
Zhou MLi SLu W(2022)An Improved SLAM Based On The Indoor Mobile Robot2022 34th Chinese Control and Decision Conference (CCDC)10.1109/CCDC55256.2022.10033680(5594-5601)Online publication date: 15-Aug-2022
https://doi.org/10.1109/CCDC55256.2022.10033680
Show More Cited By

Index Terms

Estimating the selectivity of tf-idf based cosine similarity predicates

Recommendations

Estimating the selectivity of tf-idf based cosine similarity predicates

An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for ...
Consistently estimating the selectivity of conjuncts of predicates
VLDB '05: Proceedings of the 31st international conference on Very large data bases

Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics (MVS) to improve information about the joint ...
Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data

Success of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reordering [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that ...

Comments

Information & Contributors

Information

Published In

ACM SIGMOD Record Volume 36, Issue 4

December 2007

58 pages

ISSN:0163-5808

DOI:10.1145/1361348

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2007

Published in SIGMOD Volume 36, Issue 4

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
459
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Yi XTang LCheng RYin MZheng Y(2025)Domain ontology to integrate building-integrated photovoltaic, battery energy storage, and building energy flexibility information for explicable operation and maintenanceComputers in Industry10.1016/j.compind.2025.104250166(104250)Online publication date: Apr-2025
https://doi.org/10.1016/j.compind.2025.104250
Teh PUwasomba C(2024)Impact of Large Language Models on Scholarly Publication Titles and Abstracts: A Comparative AnalysisJournal of Social Computing10.23919/JSC.2024.00115:2(105-121)Online publication date: Jun-2024
https://doi.org/10.23919/JSC.2024.0011
Zhou MLi SLu W(2022)An Improved SLAM Based On The Indoor Mobile Robot2022 34th Chinese Control and Decision Conference (CCDC)10.1109/CCDC55256.2022.10033680(5594-5601)Online publication date: 15-Aug-2022
https://doi.org/10.1109/CCDC55256.2022.10033680
Abdalla HAmer A(2021)Boolean logic algebra driven similarity measure for text based applicationsPeerJ Computer Science10.7717/peerj-cs.6417(e641)Online publication date: 29-Jul-2021
https://doi.org/10.7717/peerj-cs.641
Chen KTsai M(2021)Conversation-Based Information Delivery Method for Facility ManagementSensors10.3390/s2114477121:14(4771)Online publication date: 13-Jul-2021
https://doi.org/10.3390/s21144771
Messner W(2021)Geometrical Measurement of Cultural DifferencesJournal of International Marketing10.1177/1069031X21101845229:3(43-62)Online publication date: 1-Jul-2021
https://doi.org/10.1177/1069031X211018452
Sepúlveda-Torres RVicente MSaquete ELloret EPalomar M(2021)HeadlineStanceChecker: Exploiting summarization to detect headline disinformationJournal of Web Semantics10.1016/j.websem.2021.100660(100660)Online publication date: Sep-2021
https://doi.org/10.1016/j.websem.2021.100660
Subba BGupta P(2021)A tfidfvectorizer and singular value decomposition based host intrusion detection system framework for detecting anomalous system processesComputers and Security10.1016/j.cose.2020.102084100:COnline publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1016/j.cose.2020.102084
Zhao KShi NSa ZWang HLu CXu X(2020)Text mining and analysis of treatise on febrile diseases based on natural language processingWorld Journal of Traditional Chinese Medicine10.4103/wjtcm.wjtcm_28_19(0)Online publication date: 2020
https://doi.org/10.4103/wjtcm.wjtcm_28_19
Lee TWang ZWang HHwang S(2020)Web scale taxonomy cleansingProceedings of the VLDB Endowment10.14778/3402755.34027634:12(1295-1306)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402755.3402763
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations