Abstract
An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.
- Caetano Traina and Agma J. M. Traina and Christos Faloutsos. Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees. In ICDE, pages 195--195, 2000. Google ScholarDigital Library
- Digital Bibliography and Library Project (DBLP), http://dblp.uni-trier.de/.Google Scholar
- L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.Google Scholar
- L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins for Data Cleansing and Integration in an RDBMS. In ICDE, pages 729--731, 2003.Google ScholarCross Ref
- L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins in an RDBMS for Web Data Integration. In WWW, pages 90--101, 2003. Google ScholarDigital Library
- Y. Huang and G. Madey. Web Data Integration Using Approximate String Join. In WWW, pages 364--365. Google ScholarDigital Library
- L. Jin and C. Li. Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. In VLDB, pages 397--408, 2005. Google ScholarDigital Library
- N. Koudas, A. Marathe, and D. Srivastava. Flexible String Matching Against Large Databases in Practice. In VLDB, pages 1078--1086, 2004. Google ScholarDigital Library
- A. Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill.Google Scholar
- The LDC Corpus Catalog, http://wave.ldc.upenn.edu/Catalog/.Google Scholar
Index Terms
- Estimating the selectivity of tf-idf based cosine similarity predicates
Recommendations
Estimating the selectivity of tf-idf based cosine similarity predicates
An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for ...
Consistently estimating the selectivity of conjuncts of predicates
VLDB '05: Proceedings of the 31st international conference on Very large data basesCost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics (MVS) to improve information about the joint ...
Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of dataSuccess of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reordering [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that ...
Comments