skip to main content
10.1145/2644866.2644871acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Extracting web content for personalized presentation

Authors Info & Claims
Published:16 September 2014Publication History

ABSTRACT

Printing web pages is usually a thankless task as the result is often a document with many badly-used pages and poor layout. Besides the actual content, superfluous web elements like menus and links are often present and in a printed version they are commonly perceived as an annoyance. Therefore, a solution for obtaining cleaner versions for printing is to detect parts of the page that the reader wants to consume, eliminating unnecessary elements and filtering the "true" content of the web page. In addition, the same solution may be used online to present cleaner versions of web pages, discarding any elements that the user wishes to avoid.

In this paper we present a novel approach to implement such filtering. The method is interactive at first: The user samples items that are to be preserved on the page and thereafter everything that is not similar to the samples is removed from the page. This is achieved by comparing the path of all elements on the DOM representation of the page with the path of the elements sampled by the user and preserving only elements that have a path "similar" to the sample. The introduction of a similarity measure adds an important degree of adaptability to the needs of different users and applications.

This approach is quite general and may be applied to any XML tree that has labeled nodes. We use HTML as a case study and present a Google Chrome extension that implements the approach as well as a user study comparing our results with commercial results.

References

  1. Clean Print. http://www.formatdynamics.com/cleanprint-4-0/, 2014. {Online; accessed 24-March-2014}.Google ScholarGoogle Scholar
  2. Evernote Clearly. http://evernote.com/clearly/, 2014. {Online; accessed 24-March-2014}.Google ScholarGoogle Scholar
  3. Internet Explorer Reading View. http://msdn.microsoft.com/en-us/library/ie/hh771832(v=vs.85).aspx#reading-view, 2014. {Online; accessed 24-March-2014}.Google ScholarGoogle Scholar
  4. Reader. http://support.apple.com/kb/ht4550, 2014. {Online; accessed 24-March-2014}.Google ScholarGoogle Scholar
  5. Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Vips: A vision-based page segmentation algorithm. Technical report, Microsoft technical report, MSR-TR-2003-79, 2003.Google ScholarGoogle Scholar
  6. Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. Dom-based content extraction of html documents. In Proceedings of the 12th international conference on World Wide Web, pages 207--214. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. HP Clipper. http://www.hpclipper.com/, 2014. {Online; accessed 24-March-2014}.Google ScholarGoogle Scholar
  8. João Batista S. de Oliveira. Two algorithms for automatic document page layout. In Proceedings of the Eighth ACM Symposium on Document Engineering, DocEng '08, pages 141--149, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-081-4. 10.1145/1410140.1410170. URL http://doi.acm.org/10.1145/1410140.1410170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, volume 10, page 707, 1966.Google ScholarGoogle Scholar
  10. Suk Hwan Lim, Liwei Zheng, Jianming Jin, Huiman Hou, Jian Fan, and Jerry Liu. Automatic selection of print-worthy content for enhanced web page printing experience. In Proceedings of the 10th ACM symposium on Document engineering, pages 165--168. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ping Luo, Jian Fan, Sam Liu, Fen Lin, Yuhong Xiong, and Jerry Liu. Web article extraction for web printing: a dom+visual based approach. In Proceedings of the 9th ACM symposium on Document engineering, pages 66--69. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Marini. Document Object Model: Processing Structured Documents: Processing Structured Documents. McGraw-Hill Professional Publishing, 2002. ISBN 9780072228311. URL http://books.google.com.br/books?id=vFXu8D9ml8AC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Davi de Castro Reis, Paulo Braz Golgher, ASd Silva, and A. F. Laender. Automatic web news extraction using tree edit distance. In Proceedings of the 13th international conference on World Wide Web, pages 502--511. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Junfeng Wang, Chun Chen, Can Wang, Jian Pei, Jiajun Bu, Ziyu Guan, and Wei Vivian Zhang. Can we learn a template-independent wrapper for news article extraction from a single training site? In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1345--1354. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Extracting web content for personalized presentation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering
          September 2014
          226 pages
          ISBN:9781450329491
          DOI:10.1145/2644866

          Copyright © 2014 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 September 2014

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          DocEng '14 Paper Acceptance Rate15of41submissions,37%Overall Acceptance Rate178of537submissions,33%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader