skip to main content
10.1145/2020408.2020615acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
poster

Web information extraction using markov logic networks

Published: 21 August 2011 Publication History

Abstract

In this paper, we consider the problem of extracting structured data from web pages taking into account both the content of individual attributes as well as the structure of pages and sites. We use Markov Logic Networks (MLNs) to capture both content and structural features in a single unified framework, and this enables us to perform more accurate inference. MLNs allow us to model a wide range of rich structural features like proximity, precedence, alignment, and contiguity, using first-order clauses. We show that inference in our information extraction scenario reduces to solving an instance of the maximum weight subgraph problem. We develop specialized procedures for solving the maximum subgraph variants that are far more efficient than previously proposed inference methods for MLNs that solve variants of MAX-SAT. Experiments with real-life datasets demonstrate the effectiveness of our MLN-based approach compared to existing state-of-the-art extraction methods.

References

[1]
E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In ACM SIGKDD, 2004.
[2]
V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In ACM SIGMOD, 2001.
[3]
C.-H. Chang, M. Kayed, M. R. Girgis, and K. Shaalan. A survey of web information extraction systems. IEEE transactions on KDE, 18:1411--1428, 2006.
[4]
K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3:951--991, 2003.
[5]
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001.
[6]
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstrcutured lists on the web. In VLDB, 2009.
[7]
H. Kautz, B. Selman, and Y. Jiang. A general stochastic approach to solving problems with hard and soft constraints. In The satisfiability problem: theory and applications. AMS, 1997.
[8]
N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.
[9]
G. Miao, J. Tatemura, W. Hsiung, A. Sawires, and L. Moser. Extracting data records from the web using tag path clustering. In WWW, 2009.
[10]
I. Muslea, S. Minton, and C. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 1(2), 2001.
[11]
H. Poon, P. Domingos, and M. Sumner. A general method for reducing the complexity of relational inference and its application to mcmc. In AAAI, pages 1075--1080, 2008.
[12]
M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1--2):107--136, 2006.
[13]
S. Riedel. Improving the accuracy and efficiency of map inference for Markov logic. In UAI, 2008.
[14]
S. Sarawagi. Information extraction. Foundations and trends in databases, 1(3):261--377, 2008.
[15]
P. Singla and P. Domingos. Memory-efficient inference in relational domains. In 21st NCAI, 2006.
[16]
J.-M. Yang, R. Cai, Y. Wang, J. Zhu, L. Zhang, and W.-Y. Ma. Incorporating site-level knowledge to extract structured data from web forums. In WWW, 2009.
[17]
J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In NIPS, 2000.
[18]
Y. Zhai and B. Liu. Web data extraction based on partial tree assignment. In WWW, 2005.
[19]
J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In ACM SIGKDD, 2006.

Cited By

View all
  • (2019)Synthesis and machine learning for heterogeneous extractionProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3322485(301-315)Online publication date: 8-Jun-2019
  • (2018)News events prediction using Markov logic networksJournal of Information Science10.1177/016555151667328544:1(91-109)Online publication date: 1-Feb-2018
  • (2018)Extraction of Data from Mass Media Web SitesProgramming and Computing Software10.1134/S036176881805009244:5(344-352)Online publication date: 1-Sep-2018
  • Show More Cited By

Index Terms

  1. Web information extraction using markov logic networks

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2011
      1446 pages
      ISBN:9781450308137
      DOI:10.1145/2020408
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 August 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Markov logic networks
      2. information extraction
      3. machine learned models
      4. probabilistic models

      Qualifiers

      • Poster

      Conference

      KDD '11
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 10 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2019)Synthesis and machine learning for heterogeneous extractionProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3322485(301-315)Online publication date: 8-Jun-2019
      • (2018)News events prediction using Markov logic networksJournal of Information Science10.1177/016555151667328544:1(91-109)Online publication date: 1-Feb-2018
      • (2018)Extraction of Data from Mass Media Web SitesProgramming and Computing Software10.1134/S036176881805009244:5(344-352)Online publication date: 1-Sep-2018
      • (2018)Cost-effective conceptual design using taxonomiesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0501-127:3(369-394)Online publication date: 1-Jun-2018
      • (2017)Cost-Effective Conceptual Design Over TaxonomiesProceedings of the 20th International Workshop on the Web and Databases10.1145/3068839.3068841(35-40)Online publication date: 14-May-2017
      • (2017)Scaling Up Markov Logic Probabilistic Inference for Social GraphsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.262525129:2(433-445)Online publication date: 1-Feb-2017
      • (2016)A survey of methods for the extraction of information from Web resourcesProgramming and Computing Software10.1134/S036176881605007842:5(279-291)Online publication date: 1-Sep-2016
      • (2014)Database principles in information extractionProceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems10.1145/2594538.2594563(156-163)Online publication date: 18-Jun-2014
      • (2013)Heuristic Approach to Automatic Wrapper Generation for Social Media WebsitesNew Trends in Databases and Information Systems10.1007/978-3-642-32518-2_26(273-284)Online publication date: 2013
      • (2012)Markov logic networks for situated incremental natural language understandingProceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue10.5555/2392800.2392854(314-323)Online publication date: 5-Jul-2012
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media