Abstract
Many Internet information-management applications (e.g., information integration systems) require a library of wrappers, specialized information extraction procedures that translate a source's native format into a structured representation suitable for further application-specific processing. Maintaining wrappers is tedious and error-prone, because the formatting regularities on which wrappers rely change frequently on the decentralized and dynamic Internet. The wrapper verification problem is to determine whether a wrapper is operating correctly. Standard regression testing approaches are inappropriate, because both the formatting regularities on which wrappers rely and the source's underlying content may change. We introduce RAPTURE, a fully-implemented, domain-independent wrapper verification algorithm. RAPTURE computes a probabilistic similarity measure between a wrapper's expected and observed output, where similarity is defined in terms of simple numeric features (e.g., the length, or the fraction of punctuation characters) of the extracted strings. Experiments with numerous actual Internet sources demostrate that RAPTURE performs substantially better than standard regression testing.
Similar content being viewed by others
References
Beizer, B. (1995), Black-Box Testing, Wiley, New York.
Cohen, W. (1999), “Recognizing Structure in Web Pages Using Similarity Querries,” In Proc. 16th Nat. Conf. AI, pp. 59–66.
Cowie, J. and W. Lehnert (1996), “Information Extraction,” Comm. of the ACM 39, 1, 80–91.
Embley, D., D. Campbell, Y. Jiang, Y.-K. Ng, R. Smith, S. Liddle, and D. Quass (1998), “A Conceptual-Modeling Approach to Extracting Data from the Web,” In Proc. Int. Conf. Conceptual Modeling, pp. 78–91.
Friedman, N. and M. Goldszmidt (1996), “Learning Bayesian Networks with Local Structure,” In Proc. 12th Conf. Uncertainty in Artificial Intelligence, pp. 252–262.
Gruser, J.-B., L. Raschid, M. Vidal, and L. Bright (1998), “Wrapper Generation for Web Accessible Data Sources,” In Proc. Conf. Cooperative Information Systems, pp. 14–23.
Hammer, J., H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo (1997), “Extracting Semistructured Information from the Web,” In Proc. Workshop on Management of Semistructured Data.
Hsu, C. and M. Dung (1998), “Generating Finite-state Transducers for Semistructured Data Extraction from the Web,” J. Information Systems 23, 8, 521–538.
Huck, G., P. Frankhausewr, K. Aberer, and E. Neuhold (1998), “Jedi: Extracting and Synthesizing Information from theWeb,” In Proc. Conf. Cooperative Information Systems, pp. 32–43.
Knoblock, A., A. Levy, O. Duschka, D. Florescu, and N. Kushmerick, Eds. (1998), Proc. 1998 Workshop on AI and Information Integration, AAAI Press.
Kushmerick, N. (2000), “Wrapper Induction: Efficiency and Expressiveness,” J. Artificial Intelligence 118, 1–2, 15–68.
Kushmerick, N., D. Weld, and R. Doorenbos (1997), “Wrapper Induction for Information Extraction,” In Proc. 15th Int. Joint Conf. AI, pp. 729–35.
Levy, A., C. Knoblock, S. Minton, and W. Cohen (1998), “Trends and Controversies: Information Integration,” IEEE Intelligent Systems 13, 5, 12–24.
Muslea, I., S. Minton, and C. Knoblock (1998), “Wrapper Induction for Semi-structured, Web-based Information Sources,” In Proc. Conf. Automatic Learning & Discovery.
Muslea, I., S. Minton, and C. Knoblock (1999), “A Hierachical Approach to Wrapper Induction,” In Proc. 3rd Int. Conf. Autonomous Agents, pp. 190–197.
Rosenfeld, R. (1996), “A Maximum Entropy Approach to Adaptive Statistical Language Modelling,” Computer, Speech and Language 10, 3, 187–228.
Smith, D. and M. Lopez (1997), “Information Extraction for Semistructured Documents,” In Proc. Workshop on Management of Semistructured Data.
Wiederhold, G. (1996), Intelligent Information Integration, Kluwer, Dordrecht.
Rights and permissions
About this article
Cite this article
Kushmerick, N. Wrapper verification. World Wide Web 3, 79–94 (2000). https://doi.org/10.1023/A:1019229612909
Issue Date:
DOI: https://doi.org/10.1023/A:1019229612909