skip to main content
10.1145/1807167.1807177acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Results Reproduced / v1.1

Sampling dirty data for matching attributes

Authors Info & Claims
Published:06 June 2010Publication History

ABSTRACT

We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.

References

  1. A. Y. Halevy, A. Rajaraman, and J. J. Ordille, "Data integration: The teenage years," in VLDB, 2006, pp. 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Rahm and P. A. Bernstein, A survey of approaches to automatic schema matching, VLDB J., vol. 10, no. 4, pp. 334--350, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. J. Franklin, A. Y. Halevy, and D. Maier, From databases to dataspaces: a new abstraction for information management, SIGMOD Record, vol. 34, no. 4, pp. 27--33, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Y. Halevy, M. J. Franklin, and D. Maier, Principles of dataspace systems, in PODS, 2006, pp. 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk, Mining database structure; or, how to build a data quality browser, in SIGMOD, 2002, pp. 240--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Broder, On the resemblance and containment of documents, in SEQUENCES: Proceedings of the Compression and Complexity of Sequences. IEEE Computer Society, 1997, p. 21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, Syntactic clustering of the web, Computer Networks, vol. 29, no. 8-13, pp. 1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. U. Manber, Finding similar files in a large file system, in USENIX Winter, 1994, pp. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Z. Broder, Identifying and filtering near-duplicate documents, in CPM, 2000, pp. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. E. Shannon, A Mathematical Theory of Communication. CSLI Publications, 1948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Li, B. Wang, and X. Yang, Vgram: Improving performance of approximate queries on string collections using variable-length grams, in VLDB, 2007, pp. 303--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Bauckmann, U. Leser, F. Naumann, and V. Tietz, Efficiently detecting inclusion dependencies, in ICDE, 2007, pp. 1448--1450.Google ScholarGoogle ScholarCross RefCross Ref
  13. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, Duplicate record detection: A survey, IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Barbar'a, W. Dumouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik, The New Jersey data reduction report, IEEE Data Engineering Bulletin, vol. 20, pp. 3--45, 1997.Google ScholarGoogle Scholar
  15. F. Olken and D. Rotem, Random sampling from databases - a survey, Statistics and Computing, vol. 5, pp. 25--42, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  16. R. J. Miller, L. M. Haas, and M. A. Hernáandez, Schema mapping as query discovery, in VLDB, 2000, pp. 77--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos, iMAP: Discovering complex mappings between database schemas, in SIGMOD, 2004, pp. 383--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. T. Dai, N. Koudas, D. Srivastava, A. K. H. Tung, and S. Venkatasubramanian, Validating multi-column schema matchings by type, in ICDE, 2008, pp. 120--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity, in SIGMOD, 1998, pp. 201--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, Approximate string joins in a database (almost) for free, in VLDB, 2001, pp. 491--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Olken and D. Rotem, Simple random sampling from relational databases, in VLDB, 1986, pp. 160--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz, Bifocal sampling for skew-resistant join size estimation, in SIGMOD, 1996, pp. 271--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Chaudhuri, R. Motwani, and V. R. Narasayya, On random sampling over joins, in SIGMOD, 1999, pp. 263--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Acharya, P. B. Gibbons, and V. Poosala, Congressional samples for approximate answering of group-by queries, in SIGMOD, 2000, pp. 487--498. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Chaudhuri, G. Das, and U. Srivastava, Effective use of block-level sampling in statistics estimation, in SIGMOD Conf., 2004, pp. 287--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. J. Haas and C. Koenig, A bi-level Bernoulli scheme for database sampling, in SIGMOD, 2004, pp. 275--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Gryz, J. Guo, L. Liu, and C. Zuzarte, Query sampling in DB2 universal database, in SIGMOD, 2004, pp. 839--843. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Sampling dirty data for matching attributes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
      June 2010
      1286 pages
      ISBN:9781450300322
      DOI:10.1145/1807167

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 June 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader