Paper
7 February 2011 Title identification of web article pages using HTML and visual features
Jian Fan, Ping Luo, Parag Joshi
Author Affiliations +
Proceedings Volume 7879, Imaging and Printing in a Web 2.0 World II; 78790K (2011) https://doi.org/10.1117/12.876708
Event: IS&T/SPIE Electronic Imaging, 2011, San Francisco Airport, California, United States
Abstract
Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.
© (2011) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jian Fan, Ping Luo, and Parag Joshi "Title identification of web article pages using HTML and visual features", Proc. SPIE 7879, Imaging and Printing in a Web 2.0 World II, 78790K (7 February 2011); https://doi.org/10.1117/12.876708
Lens.org Logo
CITATIONS
Cited by 8 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Visualization

Information visualization

Printing

Neptunium

Feature extraction

Machine learning

Roads

Back to Top