ABSTRACT
Printing web pages is usually a thankless task as the result is often a document with many badly-used pages and poor layout. Besides the actual content, superfluous web elements like menus and links are often present and in a printed version they are commonly perceived as an annoyance. Therefore, a solution for obtaining cleaner versions for printing is to detect parts of the page that the reader wants to consume, eliminating unnecessary elements and filtering the "true" content of the web page. In addition, the same solution may be used online to present cleaner versions of web pages, discarding any elements that the user wishes to avoid.
In this paper we present a novel approach to implement such filtering. The method is interactive at first: The user samples items that are to be preserved on the page and thereafter everything that is not similar to the samples is removed from the page. This is achieved by comparing the path of all elements on the DOM representation of the page with the path of the elements sampled by the user and preserving only elements that have a path "similar" to the sample. The introduction of a similarity measure adds an important degree of adaptability to the needs of different users and applications.
This approach is quite general and may be applied to any XML tree that has labeled nodes. We use HTML as a case study and present a Google Chrome extension that implements the approach as well as a user study comparing our results with commercial results.
- Clean Print. http://www.formatdynamics.com/cleanprint-4-0/, 2014. {Online; accessed 24-March-2014}.Google Scholar
- Evernote Clearly. http://evernote.com/clearly/, 2014. {Online; accessed 24-March-2014}.Google Scholar
- Internet Explorer Reading View. http://msdn.microsoft.com/en-us/library/ie/hh771832(v=vs.85).aspx#reading-view, 2014. {Online; accessed 24-March-2014}.Google Scholar
- Reader. http://support.apple.com/kb/ht4550, 2014. {Online; accessed 24-March-2014}.Google Scholar
- Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Vips: A vision-based page segmentation algorithm. Technical report, Microsoft technical report, MSR-TR-2003-79, 2003.Google Scholar
- Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. Dom-based content extraction of html documents. In Proceedings of the 12th international conference on World Wide Web, pages 207--214. ACM, 2003. Google ScholarDigital Library
- HP Clipper. http://www.hpclipper.com/, 2014. {Online; accessed 24-March-2014}.Google Scholar
- João Batista S. de Oliveira. Two algorithms for automatic document page layout. In Proceedings of the Eighth ACM Symposium on Document Engineering, DocEng '08, pages 141--149, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-081-4. 10.1145/1410140.1410170. URL http://doi.acm.org/10.1145/1410140.1410170. Google ScholarDigital Library
- Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, volume 10, page 707, 1966.Google Scholar
- Suk Hwan Lim, Liwei Zheng, Jianming Jin, Huiman Hou, Jian Fan, and Jerry Liu. Automatic selection of print-worthy content for enhanced web page printing experience. In Proceedings of the 10th ACM symposium on Document engineering, pages 165--168. ACM, 2010. Google ScholarDigital Library
- Ping Luo, Jian Fan, Sam Liu, Fen Lin, Yuhong Xiong, and Jerry Liu. Web article extraction for web printing: a dom+visual based approach. In Proceedings of the 9th ACM symposium on Document engineering, pages 66--69. ACM, 2009. Google ScholarDigital Library
- J. Marini. Document Object Model: Processing Structured Documents: Processing Structured Documents. McGraw-Hill Professional Publishing, 2002. ISBN 9780072228311. URL http://books.google.com.br/books?id=vFXu8D9ml8AC. Google ScholarDigital Library
- Davi de Castro Reis, Paulo Braz Golgher, ASd Silva, and A. F. Laender. Automatic web news extraction using tree edit distance. In Proceedings of the 13th international conference on World Wide Web, pages 502--511. ACM, 2004. Google ScholarDigital Library
- Junfeng Wang, Chun Chen, Can Wang, Jian Pei, Jiajun Bu, Ziyu Guan, and Wei Vivian Zhang. Can we learn a template-independent wrapper for news article extraction from a single training site? In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1345--1354. ACM, 2009. Google ScholarDigital Library
Index Terms
- Extracting web content for personalized presentation
Recommendations
An effective and efficient Web content extractor for optimizing the crawling process
Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. ...
Using main content extraction to improve performance of Vietnamese web page classification
SoICT '11: Proceedings of the 2nd Symposium on Information and Communication TechnologyWeb page classification is the process of categorizing a web page into one or more classes which have been predetermined. If we remove all HTML tags from a web page, then this process can be considered as a text classification problem. However, this ...
Web Content Extraction based on Webpage Layout Analysis
ITCS '10: Proceedings of the 2010 Second International Conference on Information Technology and Computer Sciencefor web content extraction task, researchers have proposed many different methods, such as wrapper-based method, DOM tree rule-based method, machine learning-based method and so on. To some extent, all these methods ignore the layout information of the ...
Comments