Loading [a11y]/accessibility-menu.js
Extracting the semantic content of web pages via repeated structures | IEEE Conference Publication | IEEE Xplore

Extracting the semantic content of web pages via repeated structures


Abstract:

Web pages may carry semantics that are very important to the authors and the readers. Due to many reasons, the authors may insert contents that are irrelevant to the unde...Show More

Abstract:

Web pages may carry semantics that are very important to the authors and the readers. Due to many reasons, the authors may insert contents that are irrelevant to the underlying semantics of the page to different positions of the page, such as advertizements, guide bars, links. As a result, it may not lead good effect by using all the data of a web page to model its semantics. In this paper, we propose a framework that can extract the real semantic content from web pages via repeated structures of the HTML data. Our algorithm first detect the real semantic blocks in web pages via repeated structure segmentation, then extracts the real semantic content of the pages from real semantic blocks.
Date of Conference: 15-19 July 2013
Date Added to IEEE Xplore: 03 October 2013
Electronic ISBN:978-1-4799-1604-7
Conference Location: San Jose, CA

References

References is not available for this document.