ABSTRACT
Cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph have gained increasing attention over the last years and are starting to be deployed within various use cases. However, the content of such knowledge bases is far from being complete, far from always being correct, and suffers from deprecation (i.e. population numbers become outdated after some time). Hence, there are efforts to leverage various types of Web data to complement, update and extend such knowledge bases. A source of Web data that potentially provides a very wide coverage are millions of relational HTML tables that are found on the Web. The existing work on using data from Web tables to augment cross-domain knowledge bases reports only aggregated performance numbers. The actual content of the Web tables and the topical areas of the knowledge bases that can be complemented using the tables remain unclear. In this paper, we match a large, publicly available Web table corpus to the DBpedia knowledge base. Based on the matching results, we profile the potential of Web tables for augmenting different parts of cross-domain knowledge bases and report detailed statistics about classes, properties, and instances for which missing values can be filled using Web table data as evidence. In order to estimate the potential quality of the new values, we empirically examine the Local Closed World Assumption and use it to determine the maximal number of correct facts that an ideal data fusion strategy could generate. Using this as ground truth, we compare three data fusion strategies and conclude that knowledge-based trust outperforms PageRank- and voting-based fusion.
- S. Balakrishnan, A. Y. Halevy, and B. Harb. Applying WebTables in Practice. In Proc. of the 7th Biennial Conference on Innovative Data Systems Research, CIDR '15, 2015.Google Scholar
- J. Bleiholder and F. Naumann. Data fusion. ACM Comput. Surv., 41(1):1--41, 2009. Google ScholarDigital Library
- K. Braunschweig, M. Thiele, J. Eberius, and W. Lehner. Column-specific Context Extraction for Web Tables. In Proc. of the 30th Annual ACM Symposium on Applied Computing, SAC '15, pages 1072--1077, 2015. Google ScholarDigital Library
- V. Bryl and C. Bizer. Learning conflict resolution strategies for cross-language wikipedia data fusion. In Proc. of the 23rd Int. Conference on World Wide Web Companion, WWW '14, pages 1129--1134, 2014. Google ScholarDigital Library
- M. Cafarella, Y. Halevy, Alonand Zhang, D. Z. Wang, and E. Wu. Uncovering the Relational Web. In Proc. of the WebDB Workshop, 2008.Google Scholar
- M. J. Cafarella, A. Halevy, and N. Khoussainova. Data Integration for the Relational Web. Proc. of the VLDB Endow., 2:1090--1101, 2009. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. Proc. of the VLDB Endow., 1:538--549, 2008. Google ScholarDigital Library
- A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding Related Tables. In Proc. of the Int. Conference on Management of Data, pages 817--828, 2012. Google ScholarDigital Library
- X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion. In Proc. of the 20th SIGKDD, pages 601--610, 2014. Google ScholarDigital Library
- X. L. Dong, E. Gabrilovich, K. Murphy, V. Dang, W. Horn, C. Lugaresi, S. Sun, and W. Zhang. Knowledge-based Trust: Estimating the Trustworthiness of Web Sources. Proc. of the VLDB Endow., 8(9):938--949, 2015. Google ScholarDigital Library
- R. Gupta, A. Halevy, X. Wang, S. Whang, and F. Wu. Biperpedia: An Ontology for Search Applications. In Proc. of the 40th Int. Conference on Very Large Data Bases, 2014. Google ScholarDigital Library
- O. Hassanzadeh, M. J. Ward, M. Rodriguez-Muro, and K. Srinivas. Understanding a large corpus of web tables through matching with knowledge bases: an empirical study. In Proc. of the 10th Int. Workshop on Ontology Matching, 2015.Google Scholar
- J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, 6(2):167--195, 2015.Google ScholarCross Ref
- O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel, H. Paulheim, and C. Bizer. The Mannheim Search Join Engine. Web Semantics: Science, Services and Agents on the World Wide Web, 35:159--166, 2015. Google ScholarDigital Library
- G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc.of the VLDB Endow., 3:1338--1347, 2010. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford InfoLab, 1999.Google Scholar
- J. Pasternack and D. Roth. Knowing What to Believe (when You Already Know Something). In Proc. of the 23rd Int. Conference on Computational Linguistics, pages 877--885, 2010. Google ScholarDigital Library
- D. Ritze, O. Lehmberg, and C. Bizer. Matching HTML Tables to DBpedia. In Proc. of the 5th Int. Conference on Web Intelligence, Mining and Semantics, 2015. Google ScholarDigital Library
- Y. A. Sekhavat, F. di Paolo, D. Barbosa, and P. Merialdo. Knowledge Base Augmentation using Tabular Data. In Proc. of the 7th Workshop on Linked Data on the Web, 2014.Google Scholar
- M. Surdeanu and H. Ji. Overview of the English Slot Filling Track at the TAC2014 Knowledge Base Population Evaluation. http://nlp.cs.rpi.edu/paper/sf2014overview.pdf, 2014.Google Scholar
- P. Venetis, A. Halevy, J. Madhavan, M. Paşca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering Semantics of Tables on the Web. Proc. of the VLDB Endow., pages 528--538, 2011. Google ScholarDigital Library
- J. Wang, H. Wang, Z. Wang, and K. Q. Zhu. Understanding Tables on the Web. In Proc. of the 31st Int. Conf. on Conceptual Modeling, pages 141--155, 2012. Google ScholarDigital Library
- R. C. Wang and W. W. Cohen. Iterative set expansion of named entities using the web. In Proc. of the 8th IEEE Int. Conference on Data Mining, ICDM '08, pages 1091--1096, 2008. Google ScholarDigital Library
- G. Weikum and M. Theobald. From Information to Knowledge: Harvesting Entities and Relationships from Web Sources. In Proc. of the 29th Symp. on Principles of Database Systems, pages 65--76, 2010. Google ScholarDigital Library
- M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. InfoGather: Entity Augmentation and Attribute Discovery by Holistic Matching with Web Tables. In Proc. of the 2012 SIGMOD, pages 97--108, 2012. Google ScholarDigital Library
- X. Yin and W. Tan. Semi-supervised truth discovery. In Proc. of the 20th Int. Conference on World Wide Web, WWW '11, pages 217--226. AC, 2011. Google ScholarDigital Library
- M. Zhang and K. Chakrabarti. InfoGather+: Semantic Matching and Annotation of Numeric and Time-varying Attributes in Web Tables. In Proc. of the 2013 ACM SIGMOD Int. Conference on Management of Data, pages 145--156, 2013. Google ScholarDigital Library
- X. Zhang, Y. Chen, J. Chen, X. Du, and L. Zou. Mapping Entity-Attribute Web Tables to Web-Scale Knowledge Bases. In Database Systems for Advanced Applications, pages 108--122. Springer Berlin, 2013. Google ScholarCross Ref
Index Terms
- Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases
Recommendations
Profiling the semantics of n-ary web table data
SBD '19: Proceedings of the International Workshop on Semantic Big DataThe Web contains millions of relational HTML tables, which cover a multitude of different, often very specific topics. This rich pool of data has motivated a growing body of research on methods that use web table data to extend local tables with ...
Detecting and Representing Relevant Web Deltas Using Web Join
ICDCS '00: Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)In this paper, we show how to detect and represent web deltas, i.e., changes in Web information, that are relevant to a user's query in the context of our web warehousing system called WHOWEDA (Warehouse of Web Data). In WHOWEDA, Web information is ...
KnowMore – knowledge base augmentation with structured web markup
Machine Learning for Knowledge Base Generation and PopulationKnowledge bases are in widespread use for aiding tasks such as information extraction and information retrieval, for example in Web search. However, knowledge bases are known to be inherently incomplete, where in particular tail entities and properties ...
Comments