Abstract:
Web pages may carry semantics that are very important to the authors and the readers. Due to many reasons, the authors may insert contents that are irrelevant to the unde...Show MoreMetadata
Abstract:
Web pages may carry semantics that are very important to the authors and the readers. Due to many reasons, the authors may insert contents that are irrelevant to the underlying semantics of the page to different positions of the page, such as advertizements, guide bars, links. As a result, it may not lead good effect by using all the data of a web page to model its semantics. In this paper, we propose a framework that can extract the real semantic content from web pages via repeated structures of the HTML data. Our algorithm first detect the real semantic blocks in web pages via repeated structure segmentation, then extracts the real semantic content of the pages from real semantic blocks.
Date of Conference: 15-19 July 2013
Date Added to IEEE Xplore: 03 October 2013
Electronic ISBN:978-1-4799-1604-7