skip to main content
10.1145/2808194.2809453acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article

Context Retrieval for Web Tables

Published: 27 September 2015 Publication History

Abstract

Many modern knowledge bases are built by extracting information from millions of web pages. Though existing extraction methods primarily focus on web pages' main text, a huge amount of information is embedded within other web structures, such as web tables. Previous studies have shown that linking web page tables and textual context is beneficial for extracting more information from web pages. However, using the text surrounding each table without carefully assessing its relevance introduces noise in the extracted information, degrading its accuracy. To the best of our knowledge, we provide the first systematic study of the problem of table-related context retrieval: given a table and the sentences within the same web page, determine for each sentence whether it is relevant to the table. We define the concept of relevance and introduce a Table-Related Context Retrieval system (TRCR) in this paper. We experiment with different machine learning algorithms, including a recently developed algorithm that is robust to biases in the training data, and show that our system retrieves table-related context with F1=0.735.

References

[1]
A. Arnold, R. Nallapati, and W. W. Cohen. A comparative study of methods for transductive transfer learning. In Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on, pages 77--82. IEEE, 2007.
[2]
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. Springer, 2007.
[3]
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction for the web. In IJCAI, volume 7, pages 2670--2676, 2007.
[4]
M. J. Cafarella, A. Halevy, and N. Khoussainova. Data integration for the relational web. Proceedings of the VLDB Endowment, 2(1):1090--1101, 2009.
[5]
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the relational web.
[6]
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1):538--549, 2008.
[7]
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, volume 5, page 3, 2010.
[8]
C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pages 442--450, 2010.
[9]
X. L. Dong, K. Murphy, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion, 2014.
[10]
A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1535--1545. Association for Computational Linguistics, 2011.
[11]
V. Govindaraju, C. Zhang, and C. Ré. Understanding tables in context using standard nlp toolkits. In ACL (2), pages 658--664, 2013.
[12]
R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. Proceedings of the VLDB Endowment, 7(7), 2014.
[13]
E. Hatcher, O. Gospodnetic, and M. McCandless. Lucene in action, 2004.
[14]
J. Huang, A. Gretton, K. M. Borgwardt, B. Schölkopf, and A. J. Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601--608, 2006.
[15]
J. Jiang and C. Zhai. Instance weighting for domain adaptation in nlp. In ACL, volume 7, pages 264--271. Citeseer, 2007.
[16]
J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
[17]
J. R. Landis and G. G. Koch. The measurement of observer agreement for categorical data. biometrics, pages 159--174, 1977.
[18]
H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky. Stanford's multi-pass sieve coreference resolution system at the conll-2011 shared task. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pages 28--34. ACL, 2011.
[19]
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. Proceedings of the VLDB Endowment, 3(1-2):1338--1347, 2010.
[20]
A. Liu and B. Ziebart. Robust classification under sample selection bias. In Advances in Neural Information Processing Systems, pages 37--45, 2014.
[21]
F. Niu, C. Zhang, C. Ré, and J. Shavlik. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. International Journal on Semantic Web and Information Systems (IJSWIS), 8(3):42--73, 2012.
[22]
S. J. Pan and Q. Yang. A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10):1345--1359, 2010.
[23]
R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. Proceedings of the VLDB Endowment, 5(10):908--919, 2012.
[24]
M. F. Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3):130--137, 1980.
[25]
K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. Manning. A multi-pass sieve for coreference resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 492--501. ACL, 2010.
[26]
H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227--244, 2000.
[27]
A. Singhal. Introducing the knowledge graph: things, not strings. Official Google Blog, May, 2012.
[28]
R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631--1642. Citeseer, 2013.
[29]
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In Proceedings of the 16th WWW, pages 697--706. ACM, 2007.
[30]
P. Venetis, A. Halevy, J. Madhavan, M. Paşca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. Proceedings of the VLDB Endowment, 4(9):528--538, 2011.
[31]
J. Wang, H. Wang, Z. Wang, and K. Q. Zhu. Understanding tables on the web. In Conceptual Modeling, pages 141--155. Springer, 2012.
[32]
W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM SIGMOD, pages 481--492. ACM, 2012.
[33]
R. Xia, J. Yu, F. Xu, and S. Wang. Instance-based domain adaptation in nlp via in-target-domain logistic approximation. In Proceedings of the Twenty-Eighth AAAI, 2014.
[34]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD, pages 97--108. ACM, 2012.
[35]
M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama. Relative density-ratio estimation for robust distribution comparison. In Advances in neural information processing systems, pages 594--602, 2011.
[36]
B. Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, page 114. ACM, 2004.
[37]
J. Zhu, Z. Nie, X. Liu, B. Zhang, and J.-R. Wen. Statsnowball: a statistical approach to extracting entity relationships. In Proceedings of the 18th WWW, pages 101--110. ACM, 2009.

Cited By

View all
  • (2020)Web Table Extraction, Retrieval, and AugmentationACM Transactions on Intelligent Systems and Technology10.1145/337211711:2(1-35)Online publication date: 25-Jan-2020
  • (2016)Table Topic Models for Hidden Unit EstimationInformation Retrieval Technology10.1007/978-3-319-48051-0_23(302-307)Online publication date: 15-Oct-2016

Index Terms

  1. Context Retrieval for Web Tables

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information Retrieval
    September 2015
    402 pages
    ISBN:9781450338332
    DOI:10.1145/2808194
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 September 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. context retrieval
    2. covariate shift
    3. web tables

    Qualifiers

    • Research-article

    Conference

    ICTIR '15
    Sponsor:

    Acceptance Rates

    ICTIR '15 Paper Acceptance Rate 29 of 57 submissions, 51%;
    Overall Acceptance Rate 235 of 527 submissions, 45%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Web Table Extraction, Retrieval, and AugmentationACM Transactions on Intelligent Systems and Technology10.1145/337211711:2(1-35)Online publication date: 25-Jan-2020
    • (2016)Table Topic Models for Hidden Unit EstimationInformation Retrieval Technology10.1007/978-3-319-48051-0_23(302-307)Online publication date: 15-Oct-2016

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media