research-article

Finding related tables

Authors:
Anish Das Sarma

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Lujun Fang

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Nitin Gupta

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Alon Halevy

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Hongrae Lee

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Fei Wu

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Reynold Xin

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Cong Yu

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataMay 2012Pages 817–828https://doi.org/10.1145/2213836.2213962

Published:20 May 2012Publication History

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Pages 817–828

ABSTRACT

We consider the problem of finding related tables in a large corpus of heterogenous tables. Detecting related tables provides users a powerful tool for enhancing their tables with additional data and enables effective reuse of available public data. Our first contribution is a framework that captures several types of relatedness, including tables that are candidates for joins and tables that are candidates for union. Our second contribution is a set of algorithms for detecting related tables that can be either unioned or joined. We describe a set of experiments that demonstrate that our algorithms produce highly related tables. We also show that we can often improve the results of table search by pulling up tables that are ranked much lower based on their relatedness to top-ranked tables. Finally, we describe how to scale up our algorithms and show the results of running it on a corpus of over a million tables extracted from Wikipedia.

References

http://secondstring.sourceforge.net/.Google Scholar
http://www.factual.com/.Google Scholar
http://www.freebase.com/.Google Scholar
http://www.socrata.com/.Google Scholar
http://www.tableausoftware.com/public.Google Scholar
F. Afrati, A. D. Sarma, D. Menestrina, A. Parameswaran, and J. D. Ullman. Fuzzy joins using mapreduce. In ICDE, 2012. Google ScholarDigital Library
S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom. Adaptive ordering of pipelined stream filters. In SIGMOD, 2004. Google ScholarDigital Library
R. Bunescu and R. J. Mooney. Collective information extraction with relational markov networks. In ACL, 2004. Google ScholarDigital Library
M. Cafarella, A. Halevy, and N. Khoussainova. Data Integration for the Relational Web. PVLDB, 2(1):1090--1101, 2009. Google ScholarDigital Library
M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. PVLDB, 1(1):538--549, 2008. Google ScholarDigital Library
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the Relational Web. In WebDB, 2008.Google Scholar
W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, 2003.Google ScholarDigital Library
A. Condon, A. Deshpande, L. Hellerstein, and N. Wu. Flow algorithms for two pipelined filter ordering problems. In PODS, 2006. Google ScholarDigital Library
D. Davidov. Fully unsupervised discovery of concept-specific relationships by web mining. In ACL, 2007.Google Scholar
Z. (Eds.) Bellahsene, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011. Google ScholarDigital Library
H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting Relational Tables from Lists on the Web. PVLDB, 2:1078--1089, 2009. Google ScholarDigital Library
O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. nsupervised named-entity extraction from the Web: An experimental study. AIJ, 2005. Google ScholarDigital Library
W. Gatterbauer and P. Bohunsky. Table extraction using spatial reasoning on the CSS2 visual box model. In AAAI, 2006. Google ScholarDigital Library
H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. In SIGMOD, 2010. Google ScholarDigital Library
R. Gupta and S. Sarawagi. Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1):289--300, 2009. Google ScholarDigital Library
M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google ScholarDigital Library
P. Ipeirotis and A. Marian, editors. DBRank, 2010.Google Scholar
M. Kodialam. The throughput of sequential testing. In In Integer Programming and Combinatorial Optimization, 2001. Google ScholarDigital Library
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. In VLDB, pages 1338--1347, 2010. Google ScholarDigital Library
A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In CONLL, 2003. Google ScholarDigital Library
M. Paşca and B. Van Durme. Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs. In ACL, 2008.Google Scholar
P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP, 2009. Google ScholarDigital Library
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4), 2001. Google ScholarDigital Library
U. Srivastava, K. Munagal, J. Widom, and R. Motwani. Query optimization over web services. In VLDB, 2006. Google ScholarDigital Library
P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. In PVLDB, 2011. Google ScholarDigital Library
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, 2010. Google ScholarDigital Library
R. Wang and W. Cohen. Language-Independent Set Expansion of Named Entities Using the Web. In ICDM, 2007. Google ScholarDigital Library
R. Wang and W. Cohen. Iterative Set Expansion of Named Entities Using the Web. In ICDM, pages 1091--1096, 2008. Google ScholarDigital Library

Index Terms

Finding related tables
1. Information systems

Recommendations

The Mannheim Search Join Engine

A Search Join is a join operation which extends a user-provided table with additional attributes based on a large corpus of heterogeneous data originating from the Web or corporate intranets. Search Joins are useful within a wide range of application ...
Read More
Hybrid.AI: A Learning Search Engine for Large-scale Structured Data
WWW '18: Companion Proceedings of the The Web Conference 2018

Variety of Big data is a significant impediment for anyone who wants to search inside a large-scale structured dataset. For example, there are millions of tables available on the Web, but the most relevant search result does not necessarily match the ...
Read More
Finding k-dominant skylines in high dimensional space
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

Given a d-dimensional data set, a point p dominates another point q if it is better than or equal to q in all dimensions and better than q in at least one dimension. A point is a skyline point if there does not exists any point that can dominate it. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
May 2012
886 pages
ISBN:9781450312479
DOI:10.1145/2213836
General Chairs:
K. Selçuk Candan
Arizona State University
,
Yi Chen
Arizona State University
,
Richard Snodgrass
University of Arizona
,
Program Chair:
Luis Gravano
Columbia University
,
Publications Chair:
Ariel Fuxman
Microsoft Research
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 May 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data integration
related tables
web tables
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '12 Paper Acceptance Rate48of289submissions,17%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 105
  Total Citations
  View Citations
- 1,095
  Total Downloads
- Downloads (Last 12 months)67
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Finding related tables

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Mannheim Search Join Engine

Hybrid.AI: A Learning Search Engine for Large-scale Structured Data

Finding k-dominant skylines in high dimensional space

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Finding related tables

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Mannheim Search Join Engine

Hybrid.AI: A Learning Search Engine for Large-scale Structured Data

Finding k-dominant skylines in high dimensional space

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media