Title identification of web article pages using HTML and visual features

Jian Fan; Ping Luo; Parag Joshi

doi:10.1117/12.876708

7 February 2011 Title identification of web article pages using HTML and visual features

Jian Fan, Ping Luo, Parag Joshi

Proceedings Volume 7879, Imaging and Printing in a Web 2.0 World II; 78790K (2011) https://doi.org/10.1117/12.876708
Event: IS&T/SPIE Electronic Imaging, 2011, San Francisco Airport, California, United States

Abstract

Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.

Citation Download Citation

Jian Fan, Ping Luo, and Parag Joshi "Title identification of web article pages using HTML and visual features", Proc. SPIE 7879, Imaging and Printing in a Web 2.0 World II, 78790K (7 February 2011); https://doi.org/10.1117/12.876708

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available