research-article

Extracting web content for personalized presentation

Authors:
Rodrigo Chamun

Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil

Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil
View Profile

,
Daniele Pinheiro

Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil

Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil
View Profile

,
Diego Jornada

Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil

Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil
View Profile

,
João Batista S. de Oliveira

Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil

Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil
View Profile

,
Isabel Manssour

Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil

Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil
View Profile

DocEng '14: Proceedings of the 2014 ACM symposium on Document engineeringSeptember 2014Pages 157–164https://doi.org/10.1145/2644866.2644871

Published:16 September 2014Publication History

DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering

Pages 157–164

ABSTRACT

Printing web pages is usually a thankless task as the result is often a document with many badly-used pages and poor layout. Besides the actual content, superfluous web elements like menus and links are often present and in a printed version they are commonly perceived as an annoyance. Therefore, a solution for obtaining cleaner versions for printing is to detect parts of the page that the reader wants to consume, eliminating unnecessary elements and filtering the "true" content of the web page. In addition, the same solution may be used online to present cleaner versions of web pages, discarding any elements that the user wishes to avoid.

In this paper we present a novel approach to implement such filtering. The method is interactive at first: The user samples items that are to be preserved on the page and thereafter everything that is not similar to the samples is removed from the page. This is achieved by comparing the path of all elements on the DOM representation of the page with the path of the elements sampled by the user and preserving only elements that have a path "similar" to the sample. The introduction of a similarity measure adds an important degree of adaptability to the needs of different users and applications.

This approach is quite general and may be applied to any XML tree that has labeled nodes. We use HTML as a case study and present a Google Chrome extension that implements the approach as well as a user study comparing our results with commercial results.

References

Clean Print. http://www.formatdynamics.com/cleanprint-4-0/, 2014. {Online; accessed 24-March-2014}.Google Scholar
Evernote Clearly. http://evernote.com/clearly/, 2014. {Online; accessed 24-March-2014}.Google Scholar
Internet Explorer Reading View. http://msdn.microsoft.com/en-us/library/ie/hh771832(v=vs.85).aspx#reading-view, 2014. {Online; accessed 24-March-2014}.Google Scholar
Reader. http://support.apple.com/kb/ht4550, 2014. {Online; accessed 24-March-2014}.Google Scholar
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Vips: A vision-based page segmentation algorithm. Technical report, Microsoft technical report, MSR-TR-2003-79, 2003.Google Scholar
Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. Dom-based content extraction of html documents. In Proceedings of the 12th international conference on World Wide Web, pages 207--214. ACM, 2003. Google ScholarDigital Library
HP Clipper. http://www.hpclipper.com/, 2014. {Online; accessed 24-March-2014}.Google Scholar
João Batista S. de Oliveira. Two algorithms for automatic document page layout. In Proceedings of the Eighth ACM Symposium on Document Engineering, DocEng '08, pages 141--149, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-081-4. 10.1145/1410140.1410170. URL http://doi.acm.org/10.1145/1410140.1410170. Google ScholarDigital Library
Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, volume 10, page 707, 1966.Google Scholar
Suk Hwan Lim, Liwei Zheng, Jianming Jin, Huiman Hou, Jian Fan, and Jerry Liu. Automatic selection of print-worthy content for enhanced web page printing experience. In Proceedings of the 10th ACM symposium on Document engineering, pages 165--168. ACM, 2010. Google ScholarDigital Library
Ping Luo, Jian Fan, Sam Liu, Fen Lin, Yuhong Xiong, and Jerry Liu. Web article extraction for web printing: a dom+visual based approach. In Proceedings of the 9th ACM symposium on Document engineering, pages 66--69. ACM, 2009. Google ScholarDigital Library
J. Marini. Document Object Model: Processing Structured Documents: Processing Structured Documents. McGraw-Hill Professional Publishing, 2002. ISBN 9780072228311. URL http://books.google.com.br/books?id=vFXu8D9ml8AC. Google ScholarDigital Library
Davi de Castro Reis, Paulo Braz Golgher, ASd Silva, and A. F. Laender. Automatic web news extraction using tree edit distance. In Proceedings of the 13th international conference on World Wide Web, pages 502--511. ACM, 2004. Google ScholarDigital Library
Junfeng Wang, Chun Chen, Can Wang, Jian Pei, Jiajun Bu, Ziyu Guan, and Wei Vivian Zhang. Can we learn a template-independent wrapper for news article extraction from a single training site? In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1345--1354. ACM, 2009. Google ScholarDigital Library

Index Terms

Extracting web content for personalized presentation
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

An effective and efficient Web content extractor for optimizing the crawling process

Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. ...
Read More
Using main content extraction to improve performance of Vietnamese web page classification
SoICT '11: Proceedings of the 2nd Symposium on Information and Communication Technology

Web page classification is the process of categorizing a web page into one or more classes which have been predetermined. If we remove all HTML tags from a web page, then this process can be considered as a text classification problem. However, this ...
Read More
Web Content Extraction based on Webpage Layout Analysis
ITCS '10: Proceedings of the 2010 Second International Conference on Information Technology and Computer Science

for web content extraction task, researchers have proposed many different methods, such as wrapper-based method, DOM tree rule-based method, machine learning-based method and so on. To some extent, all these methods ignore the layout information of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering
September 2014
226 pages
ISBN:9781450329491
DOI:10.1145/2644866
General Chair:
Steven Simske
Hewlett-Packard, Fort Collins, USA
,
Program Chair:
Sebastian Rönnau
Zalando AG, Berlin, Germany
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 September 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
levenshtein algorithm
web content extraction
web content filtering
Qualifiers
- research-article
Conference

Acceptance Rates
DocEng '14 Paper Acceptance Rate15of41submissions,37%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 160
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Extracting web content for personalized presentation

DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

An effective and efficient Web content extractor for optimizing the crawling process

Using main content extraction to improve performance of Vietnamese web page classification

Web Content Extraction based on Webpage Layout Analysis