Elsevier

Information Systems

Volume 23, Issue 8, December 1998, Pages 521-538
Information Systems

Generating finite-state transducers for semi-structured data extraction from the Web

https://doi.org/10.1016/S0306-4379(98)00027-1Get rights and content

Abstract

Integrating a large number of Web information sources may significantly increase the utility of the World-Wide Web. A promising solution to the integration is through the use of a Web Information mediator that provides seamless, transparent access for the clients. Information mediators need wrappers to access a Web source as a structured database, but building wrappers by hand is impractical. Previous work on wrapper induction is too restrictive to handle a large number of Web pages that contain tuples with missing attributes, multiple values, variant attribute permutations, exceptions and typos. This paper presents SoftMealy, a novel wrapper representation formalism. This representation is based on a finite-state transducer (FST) and contextual rules. This approach can wrap a wide range of semistructured Web pages because FSTs can encode each different attribute permutation as a path. A SoftMealy wrapper can be induced from a handful of labeled examples using our generalization algorithm. We have implemented this approach into a prototype system and tested it on real Web pages. The performance statistics shows that the sizes of the induced wrappers as well as the required training effort are linear with regard to the structural variance of the test pages. Our experiment also shows that the induced wrappers can generalize over unseen pages.

References (21)

  • R.S. Michalski

    A theory and methodology of inductive learning

  • Y. Arens et al.

    Retrieving and integrating data from multiple information sources

    International Journal of Intelligent and Cooperative Information Systems

    (1993)
  • Y. Arens et al.

    Query processing in the SIMS information mediator

  • N. Ashish et al.

    Semi-automatic wrapper generation for internet information sources

  • P. Atzeni et al.

    Cut and paste

  • P. Buneman

    Semistructured data

  • R.B. Doorenbos et al.

    A scalable comparison-shopping agent for the world-wide web

  • J. Hammer et al.

    Extracting Semistructured information from the Web

  • J. Hammer et al.

    Information translation, mediation, and mosaic-based browsing in the TSIMMIS system

There are more references available in the full text version of this article.

Cited by (313)

  • On validating web information extraction proposals

    2022, Expert Systems with Applications
  • Unified Parsing Script using Machine Learning

    2023, 2023 3rd International Conference on Artificial Intelligence and Signal Processing, AISP 2023
View all citing articles on Scopus

Recommended by Gottfried Vossen.

View full text