skip to main content
column

Estimating the selectivity of tf-idf based cosine similarity predicates

Published: 01 June 2007 Publication History

Abstract

An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.

References

[1]
Caetano Traina and Agma J. M. Traina and Christos Faloutsos. Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees. In ICDE, pages 195--195, 2000.
[2]
Digital Bibliography and Library Project (DBLP), http://dblp.uni-trier.de/.
[3]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.
[4]
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins for Data Cleansing and Integration in an RDBMS. In ICDE, pages 729--731, 2003.
[5]
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins in an RDBMS for Web Data Integration. In WWW, pages 90--101, 2003.
[6]
Y. Huang and G. Madey. Web Data Integration Using Approximate String Join. In WWW, pages 364--365.
[7]
L. Jin and C. Li. Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. In VLDB, pages 397--408, 2005.
[8]
N. Koudas, A. Marathe, and D. Srivastava. Flexible String Matching Against Large Databases in Practice. In VLDB, pages 1078--1086, 2004.
[9]
A. Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill.
[10]
The LDC Corpus Catalog, http://wave.ldc.upenn.edu/Catalog/.

Cited By

View all
  • (2025)Hybrid deep-learning prediction model based on kernel multi-granularity fuzzy rough sets and its application in the diagnosis and treatment of chronic kidney diseaseEngineering Applications of Artificial Intelligence10.1016/j.engappai.2025.110297147(110297)Online publication date: May-2025
  • (2024)An Ensemble Learning Hybrid Recommendation System Using Content-Based, Collaborative Filtering, Supervised Learning and Boosting AlgorithmsAutomatic Control and Computer Sciences10.3103/S014641162470061558:5(491-505)Online publication date: 1-Oct-2024
  • (2024)An NLP-based novel approach for assessing national influence in clause dissemination across bilateral investment treatiesPLOS ONE10.1371/journal.pone.029838019:3(e0298380)Online publication date: 12-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 36, Issue 2
June 2007
38 pages
ISSN:0163-5808
DOI:10.1145/1328854
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2007
Published in SIGMOD Volume 36, Issue 2

Check for updates

Qualifiers

  • Column

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)158
  • Downloads (Last 6 weeks)14
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Hybrid deep-learning prediction model based on kernel multi-granularity fuzzy rough sets and its application in the diagnosis and treatment of chronic kidney diseaseEngineering Applications of Artificial Intelligence10.1016/j.engappai.2025.110297147(110297)Online publication date: May-2025
  • (2024)An Ensemble Learning Hybrid Recommendation System Using Content-Based, Collaborative Filtering, Supervised Learning and Boosting AlgorithmsAutomatic Control and Computer Sciences10.3103/S014641162470061558:5(491-505)Online publication date: 1-Oct-2024
  • (2024)An NLP-based novel approach for assessing national influence in clause dissemination across bilateral investment treatiesPLOS ONE10.1371/journal.pone.029838019:3(e0298380)Online publication date: 12-Mar-2024
  • (2024)Industrial Semiconductor GPT: A Question-and-Answer System that Provides Professional Advice and Problem-Solving Methods for Semiconductor and Factory Equipment and Process2024 IEEE 33rd International Symposium on Industrial Electronics (ISIE)10.1109/ISIE54533.2024.10595824(1-6)Online publication date: 18-Jun-2024
  • (2024)A similarity-based assortativity measure for complex networksJournal of Complex Networks10.1093/comnet/cnae01012:2Online publication date: 13-Mar-2024
  • (2024)Exploring nonlinear correlations among transition metal nanocluster properties using deep learning: a comparative analysis with LOO-CV method and cosine similarityNanotechnology10.1088/1361-6528/ad892c36:4(045701)Online publication date: 4-Nov-2024
  • (2024)Evolving energy landscapes: A computational analysis of the determinants of energy povertyRenewable and Sustainable Energy Reviews10.1016/j.rser.2024.114705202(114705)Online publication date: Sep-2024
  • (2024)An intelligent research environment on cotton diseases and pests based on a cotton phytosanitary surveillance ontology ontoSYSPARCOTCI.Procedia Computer Science10.1016/j.procs.2024.05.185237:C(858-865)Online publication date: 24-Jul-2024
  • (2024)An intelligent research environment on cotton diseases and pests based on a cotton phytosanitary surveillance ontology ontoSYSPARCOTCI.Procedia Computer Science10.1016/j.procs.2024.05.175237:C(866-873)Online publication date: 24-Jul-2024
  • (2024)Uncovering lobbying strategies in sustainable finance disclosure regulations using machine learningJournal of Environmental Management10.1016/j.jenvman.2024.120562356(120562)Online publication date: Apr-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media