Wrapper verification

Kushmerick, Nicholas

doi:10.1023/A:1019229612909

Wrapper verification

Published: October 2000

Volume 3, pages 79–94, (2000)
Cite this article

World Wide Web Aims and scope Submit manuscript

Nicholas Kushmerick

193 Accesses
Explore all metrics

Abstract

Many Internet information-management applications (e.g., information integration systems) require a library of wrappers, specialized information extraction procedures that translate a source's native format into a structured representation suitable for further application-specific processing. Maintaining wrappers is tedious and error-prone, because the formatting regularities on which wrappers rely change frequently on the decentralized and dynamic Internet. The wrapper verification problem is to determine whether a wrapper is operating correctly. Standard regression testing approaches are inappropriate, because both the formatting regularities on which wrappers rely and the source's underlying content may change. We introduce RAPTURE, a fully-implemented, domain-independent wrapper verification algorithm. RAPTURE computes a probabilistic similarity measure between a wrapper's expected and observed output, where similarity is defined in terms of simple numeric features (e.g., the length, or the fraction of punctuation characters) of the extracted strings. Experiments with numerous actual Internet sources demostrate that RAPTURE performs substantially better than standard regression testing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Beizer, B. (1995), Black-Box Testing, Wiley, New York.
Google Scholar
Cohen, W. (1999), “Recognizing Structure in Web Pages Using Similarity Querries,” In Proc. 16th Nat. Conf. AI, pp. 59–66.
Cowie, J. and W. Lehnert (1996), “Information Extraction,” Comm. of the ACM 39, 1, 80–91.
Article Google Scholar
Embley, D., D. Campbell, Y. Jiang, Y.-K. Ng, R. Smith, S. Liddle, and D. Quass (1998), “A Conceptual-Modeling Approach to Extracting Data from the Web,” In Proc. Int. Conf. Conceptual Modeling, pp. 78–91.
Friedman, N. and M. Goldszmidt (1996), “Learning Bayesian Networks with Local Structure,” In Proc. 12th Conf. Uncertainty in Artificial Intelligence, pp. 252–262.
Gruser, J.-B., L. Raschid, M. Vidal, and L. Bright (1998), “Wrapper Generation for Web Accessible Data Sources,” In Proc. Conf. Cooperative Information Systems, pp. 14–23.
Hammer, J., H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo (1997), “Extracting Semistructured Information from the Web,” In Proc. Workshop on Management of Semistructured Data.
Hsu, C. and M. Dung (1998), “Generating Finite-state Transducers for Semistructured Data Extraction from the Web,” J. Information Systems 23, 8, 521–538.
Article Google Scholar
Huck, G., P. Frankhausewr, K. Aberer, and E. Neuhold (1998), “Jedi: Extracting and Synthesizing Information from theWeb,” In Proc. Conf. Cooperative Information Systems, pp. 32–43.
Knoblock, A., A. Levy, O. Duschka, D. Florescu, and N. Kushmerick, Eds. (1998), Proc. 1998 Workshop on AI and Information Integration, AAAI Press.
Kushmerick, N. (2000), “Wrapper Induction: Efficiency and Expressiveness,” J. Artificial Intelligence 118, 1–2, 15–68.
Article MathSciNet Google Scholar
Kushmerick, N., D. Weld, and R. Doorenbos (1997), “Wrapper Induction for Information Extraction,” In Proc. 15th Int. Joint Conf. AI, pp. 729–35.
Levy, A., C. Knoblock, S. Minton, and W. Cohen (1998), “Trends and Controversies: Information Integration,” IEEE Intelligent Systems 13, 5, 12–24.
Google Scholar
Muslea, I., S. Minton, and C. Knoblock (1998), “Wrapper Induction for Semi-structured, Web-based Information Sources,” In Proc. Conf. Automatic Learning & Discovery.
Muslea, I., S. Minton, and C. Knoblock (1999), “A Hierachical Approach to Wrapper Induction,” In Proc. 3rd Int. Conf. Autonomous Agents, pp. 190–197.
Rosenfeld, R. (1996), “A Maximum Entropy Approach to Adaptive Statistical Language Modelling,” Computer, Speech and Language 10, 3, 187–228.
Article Google Scholar
Smith, D. and M. Lopez (1997), “Information Extraction for Semistructured Documents,” In Proc. Workshop on Management of Semistructured Data.
Wiederhold, G. (1996), Intelligent Information Integration, Kluwer, Dordrecht.
Google Scholar

Download references

Authors

Nicholas Kushmerick
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kushmerick, N. Wrapper verification. World Wide Web 3, 79–94 (2000). https://doi.org/10.1023/A:1019229612909

Download citation

Issue Date: October 2000
DOI: https://doi.org/10.1023/A:1019229612909

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Wrapper verification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

User-Friendly and Extensible Web Data Extraction

Validation Metrics: A Case for Pattern-Based Methods

R and the ISO Standards for Quality Control

References

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Wrapper verification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

User-Friendly and Extensible Web Data Extraction

Validation Metrics: A Case for Pattern-Based Methods

R and the ISO Standards for Quality Control

References

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation