Skip to main content
Log in

Wrapper verification

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Many Internet information-management applications (e.g., information integration systems) require a library of wrappers, specialized information extraction procedures that translate a source's native format into a structured representation suitable for further application-specific processing. Maintaining wrappers is tedious and error-prone, because the formatting regularities on which wrappers rely change frequently on the decentralized and dynamic Internet. The wrapper verification problem is to determine whether a wrapper is operating correctly. Standard regression testing approaches are inappropriate, because both the formatting regularities on which wrappers rely and the source's underlying content may change. We introduce RAPTURE, a fully-implemented, domain-independent wrapper verification algorithm. RAPTURE computes a probabilistic similarity measure between a wrapper's expected and observed output, where similarity is defined in terms of simple numeric features (e.g., the length, or the fraction of punctuation characters) of the extracted strings. Experiments with numerous actual Internet sources demostrate that RAPTURE performs substantially better than standard regression testing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Beizer, B. (1995), Black-Box Testing, Wiley, New York.

    Google Scholar 

  • Cohen, W. (1999), “Recognizing Structure in Web Pages Using Similarity Querries,” In Proc. 16th Nat. Conf. AI, pp. 59–66.

  • Cowie, J. and W. Lehnert (1996), “Information Extraction,” Comm. of the ACM 39, 1, 80–91.

    Article  Google Scholar 

  • Embley, D., D. Campbell, Y. Jiang, Y.-K. Ng, R. Smith, S. Liddle, and D. Quass (1998), “A Conceptual-Modeling Approach to Extracting Data from the Web,” In Proc. Int. Conf. Conceptual Modeling, pp. 78–91.

  • Friedman, N. and M. Goldszmidt (1996), “Learning Bayesian Networks with Local Structure,” In Proc. 12th Conf. Uncertainty in Artificial Intelligence, pp. 252–262.

  • Gruser, J.-B., L. Raschid, M. Vidal, and L. Bright (1998), “Wrapper Generation for Web Accessible Data Sources,” In Proc. Conf. Cooperative Information Systems, pp. 14–23.

  • Hammer, J., H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo (1997), “Extracting Semistructured Information from the Web,” In Proc. Workshop on Management of Semistructured Data.

  • Hsu, C. and M. Dung (1998), “Generating Finite-state Transducers for Semistructured Data Extraction from the Web,” J. Information Systems 23, 8, 521–538.

    Article  Google Scholar 

  • Huck, G., P. Frankhausewr, K. Aberer, and E. Neuhold (1998), “Jedi: Extracting and Synthesizing Information from theWeb,” In Proc. Conf. Cooperative Information Systems, pp. 32–43.

  • Knoblock, A., A. Levy, O. Duschka, D. Florescu, and N. Kushmerick, Eds. (1998), Proc. 1998 Workshop on AI and Information Integration, AAAI Press.

  • Kushmerick, N. (2000), “Wrapper Induction: Efficiency and Expressiveness,” J. Artificial Intelligence 118, 1–2, 15–68.

    Article  MathSciNet  Google Scholar 

  • Kushmerick, N., D. Weld, and R. Doorenbos (1997), “Wrapper Induction for Information Extraction,” In Proc. 15th Int. Joint Conf. AI, pp. 729–35.

  • Levy, A., C. Knoblock, S. Minton, and W. Cohen (1998), “Trends and Controversies: Information Integration,” IEEE Intelligent Systems 13, 5, 12–24.

    Google Scholar 

  • Muslea, I., S. Minton, and C. Knoblock (1998), “Wrapper Induction for Semi-structured, Web-based Information Sources,” In Proc. Conf. Automatic Learning & Discovery.

  • Muslea, I., S. Minton, and C. Knoblock (1999), “A Hierachical Approach to Wrapper Induction,” In Proc. 3rd Int. Conf. Autonomous Agents, pp. 190–197.

  • Rosenfeld, R. (1996), “A Maximum Entropy Approach to Adaptive Statistical Language Modelling,” Computer, Speech and Language 10, 3, 187–228.

    Article  Google Scholar 

  • Smith, D. and M. Lopez (1997), “Information Extraction for Semistructured Documents,” In Proc. Workshop on Management of Semistructured Data.

  • Wiederhold, G. (1996), Intelligent Information Integration, Kluwer, Dordrecht.

    Google Scholar 

Download references

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kushmerick, N. Wrapper verification. World Wide Web 3, 79–94 (2000). https://doi.org/10.1023/A:1019229612909

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1019229612909

Keywords

Navigation