research-article

Web article extraction for web printing: a DOM+visual based approach

Authors:
Ping Luo

HP Labs, Beijing, China

HP Labs, Beijing, China
View Profile

,
Jian Fan

HP Labs, Palo Alto, CA, USA

HP Labs, Palo Alto, CA, USA
View Profile

,
Sam Liu

HP Labs, Palo Alto, CA, USA

HP Labs, Palo Alto, CA, USA
View Profile

,
Fen Lin

HP Labs, Beijing, China

HP Labs, Beijing, China
View Profile

,
Yuhong Xiong

HP Labs, Beijing, China

HP Labs, Beijing, China
View Profile

,
Jerry Liu

HP Labs, Palo Alto, CA, USA

HP Labs, Palo Alto, CA, USA
View Profile

DocEng '09: Proceedings of the 9th ACM symposium on Document engineeringSeptember 2009Pages 66–69https://doi.org/10.1145/1600193.1600208

Published:16 September 2009Publication History

DocEng '09: Proceedings of the 9th ACM symposium on Document engineering

Pages 66–69

ABSTRACT

This work studies the problem of extracting articles from Web pages for better printing. Different from existing approaches of article extraction, Web printing poses several unique requirements: 1) Identifying just the boundary surrounding the text-body is not the ideal solution for article extraction. It is highly desirable to filter out some uninformative links and advertisements within this boundary. 2) It is necessary to identify paragraphs, which may not be readily separated as DOM nodes, for the purpose of better layout of the article. 3) Its performance should be independent of content domains, written languages, and Web page templates. Toward these goals we propose a novel method of article extraction using both DOM (Document Object Model) and visual features. The main components of our method include: 1) a text segment/paragraph identification algorithm based on line-breaking features, 2) a global optimization method, Maximum Scoring Subsequence, based on text segments for identifying the boundary of the article body, 3) an outlier elimination step based on left or right alignment of text segments with the article body. Our experiments showed the proposed method is effective in terms of precision and recall at the level of text segments.

References

J. Pasternack and D. Roth. Extracting article text from the web with maximum subsequence segmentation. In Proceedings of the 18th WWW, 2009. Google ScholarDigital Library
W. Ruzzo and M. Tompa. A linear time algorithm for finding all maximal scoring subsequences. In Proceedings of ISMB, 1999. Google ScholarDigital Library
J. Wang, X. He, C. Wang, J. Pei, J. Bu, C. Chen, Z. Guan, and W. V. Zhang. Can we learn a template-independent wrapper for news article extraction from a single training site? In Proceedings of the 15th SIGKDD, 2009. Google ScholarDigital Library

Index Terms

Web article extraction for web printing: a DOM+visual based approach

Recommendations

Print-friendly page extraction for web printing service
DocEng '11: Proceedings of the 11th ACM symposium on Document engineering

Printing Web pages from browsers usually results in unsatisfactory printouts because the pages are typically ill formatted and contain non-informative content such as navigation menu and ads. Thus, print-worthy Web pages such as articles generally ...
Read More
Visual Area Classification for Article Identification in Web Documents
DEXA '10: Proceedings of the 2010 Workshops on Database and Expert Systems Applications

In the World Wide Web, the news and other articles are usually published in complex HTML documents containing many types of additional information that is not explicitly marked. In this paper, we propose a visual information analysis approach to the ...
Read More
qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations
IC3K 2015: Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management

We present a new method called qRead to achieve real-time content extractions from web pages with high

accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model

(DOM) trees, and machine learning models. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '09: Proceedings of the 9th ACM symposium on Document engineering
September 2009
264 pages
ISBN:9781605585758
DOI:10.1145/1600193
General Chair:
Uwe M. Borghoff
Universität der Bundeswehr München, Germany
,
Program Chair:
Boris Chidlovskii
Xerox Research Centre Europe, France
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 September 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
article extraction
maximal scoring subsequence
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate178of537submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 317
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Web article extraction for web printing: a DOM+visual based approach

DocEng '09: Proceedings of the 9th ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Print-friendly page extraction for web printing service

Visual Area Classification for Article Identification in Web Documents

qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations