Schema-based Web wrapping

Fazzinga, Bettina; Flesca, Sergio; Tagarelli, Andrea

doi:10.1007/s10115-009-0275-2

Schema-based Web wrapping

Regular Paper
Published: 08 December 2009

Volume 26, pages 127–173, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Bettina Fazzinga¹,
Sergio Flesca¹ &
Andrea Tagarelli¹

250 Accesses
Explore all metrics

Abstract

An effective solution to automate information extraction from Web pages is represented by wrappers. A wrapper associates a Web page with an XML document that represents part of the information in that page in a machine-readable format. Most existing wrapping approaches have traditionally focused on how to generate extraction rules, while they have ignored potential benefits deriving from the use of the schema of the information being extracted in the wrapper evaluation. In this paper, we investigate how the schema of extracted information can be effectively used in both the design and evaluation of a Web wrapper. We define a clean declarative semantics for schema-based wrappers by introducing the notion of (preferred) extraction model, which is essential to compute a valid XML document containing the information extracted from a Web page. We developed the SCRAP (SChema-based wRAPper for web data) system for the proposed schema-based wrapping approach, which also provides visual support tools to the wrapper designer. Moreover, we present a wrapper generalization framework to profitably speed up the design of schema-based wrappers. Experimental evaluation has shown that SCRAP wrappers are not only able to successfully extract the required data, but also they are robust to changes that may occur in the source Web pages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Adelberg B (1998) NoDoSE: a tool for semi-automatically extracting semistructured data from text documents. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 283–294
Amer-Yahia S, Cho S, Srivastava D (2002) Tree pattern relaxation. In: Proceedings of the 8th international conference on extending database technology, pp 496–513
Arasu A, Garcia-Molina H (2003) Extracting structured data from Web pages. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 337–348
Aumann Y, Feldman R, Liberzon Y, Rosenfeld B, Schler J (2006) Visual information extraction. Knowl Inf Syst 10(1): 1–15
Article Google Scholar
Baumgartner R, Flesca F, Gottlob G (2001) Visual Web information extraction with Lixto. In: Proceedings of the international conference on very large data bases, pp 119–128
Biskup J, Embley DW (2003) Extracting information from heterogeneous information sources using ontologically specified target views. Inf Syst 28(3): 169–212
Article Google Scholar
Bray T, Paoli J, Sperberg-McQueen CM (eds) (1998) Extensible Markup Language (XML) 1.0, W3C recommendation. http://www.w3.org/TR/1998/REC-xml-19980210
Brüggemann-Klein A, Wood D (1998) One-unambiguous regular languages. Inf Comput 142(2): 182–206
Article MATH Google Scholar
Chidlovskii B (2001) Automatic repairing of Web wrappers. In: Proceedings of the 3rd ACM international workshop on Web information and data management, pp 24–30
Clark J, DeRose S (eds) (1999) XML Path Language (XPath) 1.0, W3C recommendation. http://www.w3.org/TR/1999/REC-xpath-19991116
Crescenzi V, Mecca G, Merialdo P (2001) RoadRunner: towards automatic data extraction from large Web sites. In: Proceedings of the international conference on very large data bases, pp 109–118
Embley DW, Campbell DM, Jiang YS, Liddle SW, Lonsdale DW, Ng Y-K, Smith RD (1999) Conceptual-model-based data extraction from multiple-record Web pages. Data Knowl Eng 31(3): 227–251
Article MATH Google Scholar
Embley DW, Tao C, Liddle SW (2002) Automatically extracting ontologically specified data from HTML tables of unknown structure. In: Proceedings of the international conference on conceptual modeling, pp 322–337
Fazzinga B, Flesca S, Tagarelli A (2005) Learning robust Web wrappers. In: Proceedings of the international conference on database and expert systems applications, pp 736–745
Flesca S, Greco S (1999) Partially ordered regular languages for graph queries. In: Proceedings of the international colloquium on automata, languages and programming, pp 321–330
Flesca S, Tagarelli A (2004) Schema-based Web wrapping. In: Proceedings of the international conference on conceptual modeling, pp 286–299
Freitag D (2000) Machine learning for information extraction in informal domains. Mach Learn 39(2/3): 233–272
Article Google Scholar
Freitag D, Kushmerick N (2000) Boosted wrapper induction. In: Proceedings of the national conference of the American association for artificial intelligence, pp 577–583
Gottlob G, Koch C (2002) Monadic datalog and the expressive power of languages for Web information extraction. In: Proceedings of the ACM symposium on principles of database systems, pp 17–28
Grenager T, Klein D, Manning CD (2005) Unsupervised learning of field segmentation models for information extraction. In: Proceedings of the annual meeting of the association for computational linguistics
Gruser J-R, Raschid L, Vidal ME, Bright L (1998) Wrapper generation for Web accessible data sources. In: Proceedings of the international conference on cooperative information systems, pp 14–23
Hammer J, Garcia-Molina H, Cho J, Aranha R, Crespo A (1997) Extracting semistructured information from the Web. In: Proceedings of the ACM SIGMOD workshop on management of semistructured data, pp 18–25
Han W, Buttler D, Pu C (2001) Wrapping Web data into XML. ACM SIGMOD Rec 3(30): 33–38
Article Google Scholar
Hsu C-H, Dung M-T (1998) Generating finite-state transducers for semistructured data extraction from the Web. Inf Syst 23(8): 521–538
Article Google Scholar
Huck G, Fankhauser P, Aberer K, Neuhold E (1998) Jedi: extracting and synthesizing information from the Web. In: Proceedings of the international conference on cooperative information systems, pp 32–43
Kim D, Jung H, Geunbae Lee G (2003) Unsupervised learning of mDTD extraction patterns for Web text mining. Inf Process Manag 39(4): 623–637
Article MATH Google Scholar
Kosala R, Blockeel H, Bruynooghe M, Vanden Bussche J (2006) Information extraction from structured documents using k-testable tree automaton inference. Data Knowl Eng 58(2): 129–158
Article Google Scholar
Kushmerick N (2000) Wrapper verification. World Wide Web J 3(2): 79–94
Article MATH Google Scholar
Kushmerick N, Weld DS, Doorenbos R (1997) Wrapper induction for information extraction. In: Proceedings of the international joint conference on artificial intelligence, pp 729–737
Laender AHF, Ribeiro-Neto BA, daSilva AS (2002) DEByE—data extraction by example. Data Knowl Eng 40(2): 121–154
Article MATH Google Scholar
Laender AHF, Ribeiro-Neto BA, da Silva AS, Teixeira JS (2002) A brief survey of Web data extraction tools. ACM SIGMOD Rec 31(2): 84–93
Article Google Scholar
Lerman K, Minton SN, Knoblock CA (2003) Wrapper maintenance: a machine learning approach. J Artif Intell Res 18: 149–181
MATH Google Scholar
Li Z, Ng WK, Sun A (2005) Web data extraction based on structural similarity. Knowl Inf Syst 8(4): 438–461
Article Google Scholar
Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for Web information sources. In: Proceedings of the IEEE international conference on data engineering, pp 611–621
Meng X, Hu D, Li C (2003) Schema-guided wrapper maintenance for Web-data extraction. In: Proceedings of the 5th ACM international workshop on Web information and data management, pp 1–8
Meng X, Lu H, Wang H, Gu M (2002) Data extraction from the Web based on pre-defined schema. J Comput Sci Technol 17(4): 377–388
Article MATH Google Scholar
Miklau G, Suciu D (2004) Containment and equivalence for a fragment of XPath. J ACM 51(1): 2–45
Article MathSciNet Google Scholar
Muggleton S, De Raedt L (1994) Inductive logic programming: theory and methods. J Logic Programm 19(20): 629–679
Article Google Scholar
Muslea I, Minton S, Knoblock CA (2001) Hierarchical wrapper induction for semistructured information sources. Auton Agents Multi-Agent Syst 4(1/2): 93–114
Article Google Scholar
Raeymaekers S, Bruynooghe M, Van den Bussche J (2005) Learning (k, l)-contextual tree languages for information extraction. In: Proceedings of the European conference on machine learning
Raposo J, Pan A, Alvarez M, Hidalgo J (2005) Automatically generating labeled examples for Web wrapper maintenance. In: Proceedings of the IEEE/WIC/ACM international conference on Web intelligence, pp 250–256
Raposo J, Pan A, Alvarez M, Hidalgo J (2007) Automatically maintaining wrappers for semi-structured Web sources. Data Knowl Eng 61(2): 331–358
Article Google Scholar
Rosenfeld B, Feldman R (2008) Self-supervised relation extraction from the Web. Knowl Inf Syst 17(1): 17–33
Article Google Scholar
Sahuguet A, Azavant F (2001) Building intelligent Web applications using lightweight wrappers. Data Knowl Eng 36(3): 283–316
Article MATH Google Scholar
Soderland S (1999) Learning information extraction rules for semistructured and free text. Mach Learn 34(1/3): 233–272
Article MATH Google Scholar
Viola PA, Narasimhan M (2005) Learning to extract information from semi-structured text using a discriminative context free grammar. In: Proceedings of the international SIGIR conference on research and development in information retrieval, pp 330–337
Wong T, Lam W (2008) Learning to extract and summarize hot item features from multiple auction Web sites. Knowl Inf Syst 14(2): 143–160
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics, Computer and Systems Sciences, University of Calabria, 87036, Arcavacata di Rende, CS, Italy
Bettina Fazzinga, Sergio Flesca & Andrea Tagarelli

Authors

Bettina Fazzinga
View author publications
You can also search for this author inPubMed Google Scholar
Sergio Flesca
View author publications
You can also search for this author inPubMed Google Scholar
Andrea Tagarelli
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Andrea Tagarelli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fazzinga, B., Flesca, S. & Tagarelli, A. Schema-based Web wrapping. Knowl Inf Syst 26, 127–173 (2011). https://doi.org/10.1007/s10115-009-0275-2

Download citation

Received: 06 February 2008
Revised: 16 October 2009
Accepted: 17 October 2009
Published: 08 December 2009
Issue Date: January 2011
DOI: https://doi.org/10.1007/s10115-009-0275-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Schema-based Web wrapping

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

User-Friendly and Extensible Web Data Extraction

Multiple Types of Semi-structured Data Extraction Using Wrapper for Extraction of Image Using DOM (WEID)

Efficient Page-Level Data Extraction via Schema Induction and Verification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Schema-based Web wrapping

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

User-Friendly and Extensible Web Data Extraction

Multiple Types of Semi-structured Data Extraction Using Wrapper for Extraction of Image Using DOM (WEID)

Efficient Page-Level Data Extraction via Schema Induction and Verification

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now