Abstract:
We propose a novel schema-guided approach for wrapper maintenance, called SG-WRAM. SG-WRAP can generate a wrapper to extract data from an HTML document to produce an XML ...Show MoreMetadata
Abstract:
We propose a novel schema-guided approach for wrapper maintenance, called SG-WRAM. SG-WRAP can generate a wrapper to extract data from an HTML document to produce an XML document conforming to the user-defined schema. Based on these observations, we fulfill the maintenance following four sequential steps. At first, syntactic features, data pattern and notation are obtained from the schema, previous rule and extracted results, and then they are used to recognize the data items. After that, they are grouped according to the given schema. Each group is an instance of the given schema. At last, the representative instances are selected to re-induce the extraction rule. We name these four steps as features discovery, item recovery, block configuration and wrapper reparation respectively. The system to be demonstrated is implemented in Java. We also consider the major algorithms used in SG-WRAM.
Date of Conference: 05-08 March 2003
Date Added to IEEE Xplore: 21 January 2004
Print ISBN:0-7803-7665-X