skip to main content
research-article

Estimating the selectivity of tf-idf based cosine similarity predicates

Published:01 December 2007Publication History
Skip Abstract Section

Abstract

An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.

References

  1. Caetano Traina and Agma J. M. Traina and Christos Faloutsos. Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees. In ICDE, pages 195--195, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Digital Bibliography and Library Project (DBLP), http://dblp.uni-trier.de/.Google ScholarGoogle Scholar
  3. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.Google ScholarGoogle Scholar
  4. L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins for Data Cleansing and Integration in an RDBMS. In ICDE, pages 729--731, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  5. L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins in an RDBMS for Web Data Integration. In WWW, pages 90--101, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Huang and G. Madey. Web Data Integration Using Approximate String Join. In WWW, pages 364--365. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Jin and C. Li. Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. In VLDB, pages 397--408, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Koudas, A. Marathe, and D. Srivastava. Flexible String Matching Against Large Databases in Practice. In VLDB, pages 1078--1086, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill.Google ScholarGoogle Scholar
  10. The LDC Corpus Catalog, http://wave.ldc.upenn.edu/Catalog/.Google ScholarGoogle Scholar

Index Terms

  1. Estimating the selectivity of tf-idf based cosine similarity predicates

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGMOD Record
            ACM SIGMOD Record  Volume 36, Issue 4
            December 2007
            58 pages
            ISSN:0163-5808
            DOI:10.1145/1361348
            Issue’s Table of Contents

            Copyright © 2007 Authors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 December 2007

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader