skip to main content
research-article

Estimating the selectivity of tf-idf based cosine similarity predicates

Published: 01 December 2007 Publication History

Abstract

An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.

References

[1]
Caetano Traina and Agma J. M. Traina and Christos Faloutsos. Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees. In ICDE, pages 195--195, 2000.
[2]
Digital Bibliography and Library Project (DBLP), http://dblp.uni-trier.de/.
[3]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.
[4]
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins for Data Cleansing and Integration in an RDBMS. In ICDE, pages 729--731, 2003.
[5]
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins in an RDBMS for Web Data Integration. In WWW, pages 90--101, 2003.
[6]
Y. Huang and G. Madey. Web Data Integration Using Approximate String Join. In WWW, pages 364--365.
[7]
L. Jin and C. Li. Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. In VLDB, pages 397--408, 2005.
[8]
N. Koudas, A. Marathe, and D. Srivastava. Flexible String Matching Against Large Databases in Practice. In VLDB, pages 1078--1086, 2004.
[9]
A. Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill.
[10]
The LDC Corpus Catalog, http://wave.ldc.upenn.edu/Catalog/.

Cited By

View all
  • (2025)Domain ontology to integrate building-integrated photovoltaic, battery energy storage, and building energy flexibility information for explicable operation and maintenanceComputers in Industry10.1016/j.compind.2025.104250166(104250)Online publication date: Apr-2025
  • (2024)Impact of Large Language Models on Scholarly Publication Titles and Abstracts: A Comparative AnalysisJournal of Social Computing10.23919/JSC.2024.00115:2(105-121)Online publication date: Jun-2024
  • (2022)An Improved SLAM Based On The Indoor Mobile Robot2022 34th Chinese Control and Decision Conference (CCDC)10.1109/CCDC55256.2022.10033680(5594-5601)Online publication date: 15-Aug-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 36, Issue 4
December 2007
58 pages
ISSN:0163-5808
DOI:10.1145/1361348
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2007
Published in SIGMOD Volume 36, Issue 4

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Domain ontology to integrate building-integrated photovoltaic, battery energy storage, and building energy flexibility information for explicable operation and maintenanceComputers in Industry10.1016/j.compind.2025.104250166(104250)Online publication date: Apr-2025
  • (2024)Impact of Large Language Models on Scholarly Publication Titles and Abstracts: A Comparative AnalysisJournal of Social Computing10.23919/JSC.2024.00115:2(105-121)Online publication date: Jun-2024
  • (2022)An Improved SLAM Based On The Indoor Mobile Robot2022 34th Chinese Control and Decision Conference (CCDC)10.1109/CCDC55256.2022.10033680(5594-5601)Online publication date: 15-Aug-2022
  • (2021)Boolean logic algebra driven similarity measure for text based applicationsPeerJ Computer Science10.7717/peerj-cs.6417(e641)Online publication date: 29-Jul-2021
  • (2021)Conversation-Based Information Delivery Method for Facility ManagementSensors10.3390/s2114477121:14(4771)Online publication date: 13-Jul-2021
  • (2021)Geometrical Measurement of Cultural DifferencesJournal of International Marketing10.1177/1069031X21101845229:3(43-62)Online publication date: 1-Jul-2021
  • (2021)HeadlineStanceChecker: Exploiting summarization to detect headline disinformationJournal of Web Semantics10.1016/j.websem.2021.100660(100660)Online publication date: Sep-2021
  • (2021)A tfidfvectorizer and singular value decomposition based host intrusion detection system framework for detecting anomalous system processesComputers and Security10.1016/j.cose.2020.102084100:COnline publication date: 1-Jan-2021
  • (2020)Text mining and analysis of treatise on febrile diseases based on natural language processingWorld Journal of Traditional Chinese Medicine10.4103/wjtcm.wjtcm_28_19(0)Online publication date: 2020
  • (2020)Web scale taxonomy cleansingProceedings of the VLDB Endowment10.14778/3402755.34027634:12(1295-1306)Online publication date: 3-Jun-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media