skip to main content
10.1145/2213836.2213962acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Finding related tables

Published:20 May 2012Publication History

ABSTRACT

We consider the problem of finding related tables in a large corpus of heterogenous tables. Detecting related tables provides users a powerful tool for enhancing their tables with additional data and enables effective reuse of available public data. Our first contribution is a framework that captures several types of relatedness, including tables that are candidates for joins and tables that are candidates for union. Our second contribution is a set of algorithms for detecting related tables that can be either unioned or joined. We describe a set of experiments that demonstrate that our algorithms produce highly related tables. We also show that we can often improve the results of table search by pulling up tables that are ranked much lower based on their relatedness to top-ranked tables. Finally, we describe how to scale up our algorithms and show the results of running it on a corpus of over a million tables extracted from Wikipedia.

References

  1. http://secondstring.sourceforge.net/.Google ScholarGoogle Scholar
  2. http://www.factual.com/.Google ScholarGoogle Scholar
  3. http://www.freebase.com/.Google ScholarGoogle Scholar
  4. http://www.socrata.com/.Google ScholarGoogle Scholar
  5. http://www.tableausoftware.com/public.Google ScholarGoogle Scholar
  6. F. Afrati, A. D. Sarma, D. Menestrina, A. Parameswaran, and J. D. Ullman. Fuzzy joins using mapreduce. In ICDE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom. Adaptive ordering of pipelined stream filters. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Bunescu and R. J. Mooney. Collective information extraction with relational markov networks. In ACL, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Cafarella, A. Halevy, and N. Khoussainova. Data Integration for the Relational Web. PVLDB, 2(1):1090--1101, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. PVLDB, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the Relational Web. In WebDB, 2008.Google ScholarGoogle Scholar
  12. W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Condon, A. Deshpande, L. Hellerstein, and N. Wu. Flow algorithms for two pipelined filter ordering problems. In PODS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Davidov. Fully unsupervised discovery of concept-specific relationships by web mining. In ACL, 2007.Google ScholarGoogle Scholar
  15. Z. (Eds.) Bellahsene, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting Relational Tables from Lists on the Web. PVLDB, 2:1078--1089, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. nsupervised named-entity extraction from the Web: An experimental study. AIJ, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Gatterbauer and P. Bohunsky. Table extraction using spatial reasoning on the CSS2 visual box model. In AAAI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Gupta and S. Sarawagi. Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1):289--300, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Ipeirotis and A. Marian, editors. DBRank, 2010.Google ScholarGoogle Scholar
  23. M. Kodialam. The throughput of sequential testing. In In Integer Programming and Combinatorial Optimization, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. In VLDB, pages 1338--1347, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In CONLL, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Paşca and B. Van Durme. Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs. In ACL, 2008.Google ScholarGoogle Scholar
  27. P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. U. Srivastava, K. Munagal, J. Widom, and R. Motwani. Query optimization over web services. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. In PVLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Wang and W. Cohen. Language-Independent Set Expansion of Named Entities Using the Web. In ICDM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Wang and W. Cohen. Iterative Set Expansion of Named Entities Using the Web. In ICDM, pages 1091--1096, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Finding related tables

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
      May 2012
      886 pages
      ISBN:9781450312479
      DOI:10.1145/2213836

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 May 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '12 Paper Acceptance Rate48of289submissions,17%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader