ABSTRACT
We consider the problem of finding related tables in a large corpus of heterogenous tables. Detecting related tables provides users a powerful tool for enhancing their tables with additional data and enables effective reuse of available public data. Our first contribution is a framework that captures several types of relatedness, including tables that are candidates for joins and tables that are candidates for union. Our second contribution is a set of algorithms for detecting related tables that can be either unioned or joined. We describe a set of experiments that demonstrate that our algorithms produce highly related tables. We also show that we can often improve the results of table search by pulling up tables that are ranked much lower based on their relatedness to top-ranked tables. Finally, we describe how to scale up our algorithms and show the results of running it on a corpus of over a million tables extracted from Wikipedia.
- http://secondstring.sourceforge.net/.Google Scholar
- http://www.factual.com/.Google Scholar
- http://www.freebase.com/.Google Scholar
- http://www.socrata.com/.Google Scholar
- http://www.tableausoftware.com/public.Google Scholar
- F. Afrati, A. D. Sarma, D. Menestrina, A. Parameswaran, and J. D. Ullman. Fuzzy joins using mapreduce. In ICDE, 2012. Google ScholarDigital Library
- S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom. Adaptive ordering of pipelined stream filters. In SIGMOD, 2004. Google ScholarDigital Library
- R. Bunescu and R. J. Mooney. Collective information extraction with relational markov networks. In ACL, 2004. Google ScholarDigital Library
- M. Cafarella, A. Halevy, and N. Khoussainova. Data Integration for the Relational Web. PVLDB, 2(1):1090--1101, 2009. Google ScholarDigital Library
- M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. PVLDB, 1(1):538--549, 2008. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the Relational Web. In WebDB, 2008.Google Scholar
- W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, 2003.Google ScholarDigital Library
- A. Condon, A. Deshpande, L. Hellerstein, and N. Wu. Flow algorithms for two pipelined filter ordering problems. In PODS, 2006. Google ScholarDigital Library
- D. Davidov. Fully unsupervised discovery of concept-specific relationships by web mining. In ACL, 2007.Google Scholar
- Z. (Eds.) Bellahsene, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011. Google ScholarDigital Library
- H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting Relational Tables from Lists on the Web. PVLDB, 2:1078--1089, 2009. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. nsupervised named-entity extraction from the Web: An experimental study. AIJ, 2005. Google ScholarDigital Library
- W. Gatterbauer and P. Bohunsky. Table extraction using spatial reasoning on the CSS2 visual box model. In AAAI, 2006. Google ScholarDigital Library
- H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. In SIGMOD, 2010. Google ScholarDigital Library
- R. Gupta and S. Sarawagi. Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1):289--300, 2009. Google ScholarDigital Library
- M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google ScholarDigital Library
- P. Ipeirotis and A. Marian, editors. DBRank, 2010.Google Scholar
- M. Kodialam. The throughput of sequential testing. In In Integer Programming and Combinatorial Optimization, 2001. Google ScholarDigital Library
- G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. In VLDB, pages 1338--1347, 2010. Google ScholarDigital Library
- A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In CONLL, 2003. Google ScholarDigital Library
- M. Paşca and B. Van Durme. Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs. In ACL, 2008.Google Scholar
- P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. In EMNLP, 2009. Google ScholarDigital Library
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4), 2001. Google ScholarDigital Library
- U. Srivastava, K. Munagal, J. Widom, and R. Motwani. Query optimization over web services. In VLDB, 2006. Google ScholarDigital Library
- P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. In PVLDB, 2011. Google ScholarDigital Library
- R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, 2010. Google ScholarDigital Library
- R. Wang and W. Cohen. Language-Independent Set Expansion of Named Entities Using the Web. In ICDM, 2007. Google ScholarDigital Library
- R. Wang and W. Cohen. Iterative Set Expansion of Named Entities Using the Web. In ICDM, pages 1091--1096, 2008. Google ScholarDigital Library
Index Terms
- Finding related tables
Recommendations
The Mannheim Search Join Engine
A Search Join is a join operation which extends a user-provided table with additional attributes based on a large corpus of heterogeneous data originating from the Web or corporate intranets. Search Joins are useful within a wide range of application ...
Hybrid.AI: A Learning Search Engine for Large-scale Structured Data
WWW '18: Companion Proceedings of the The Web Conference 2018Variety of Big data is a significant impediment for anyone who wants to search inside a large-scale structured dataset. For example, there are millions of tables available on the Web, but the most relevant search result does not necessarily match the ...
Finding k-dominant skylines in high dimensional space
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of dataGiven a d-dimensional data set, a point p dominates another point q if it is better than or equal to q in all dimensions and better than q in at least one dimension. A point is a skyline point if there does not exists any point that can dominate it. ...
Comments