Abstract:
Web data extraction is an essential task for web data integration. Most researches focus on data extraction from list-pages by detecting data-rich section and record boun...Show MoreMetadata
Abstract:
Web data extraction is an essential task for web data integration. Most researches focus on data extraction from list-pages by detecting data-rich section and record boundary segmentation. However, in detail-pages which contain all-inclusive product information in each page, so the number of data attributes need to be aligned is much larger. In this paper, we formulate data extraction problem as alignment of leaf nodes from DOM Trees. We propose AFIS, Annotation-Free Induction of Full Schema for detail pages in this paper. AFIS applies Divide-and-Conquer and Longest Increasing Sequence (LIS) algorithms to mine landmarks from input. The experiments show that AFIS outperforms RoadRunner, FivaTech and TEX (F1 0.990) in terms of selected data. For full schema evaluation (all data), AFIS also represents the highest average performance (F1 0.937) compared with TEX and RoadRunner.
Date of Conference: 25-27 November 2016
Date Added to IEEE Xplore: 20 March 2017
ISBN Information:
Electronic ISSN: 2376-6824