research-article

DOM based content extraction via text density

Authors:
Fei Sun

Beijing Institute of Technology, Beijing, China

Beijing Institute of Technology, Beijing, China
View Profile

,
Dandan Song

Beijing Institute of Technology, Beijing, China

Beijing Institute of Technology, Beijing, China
View Profile

,
Lejian Liao

Beijing Institute of Technology, Beijing, China

Beijing Institute of Technology, Beijing, China
View Profile

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalJuly 2011Pages 245–254https://doi.org/10.1145/2009916.2009952

Published:24 July 2011Publication History

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Pages 245–254

ABSTRACT

In addition to the main content, most web pages also contain navigation panels, advertisements and copyright and disclaimer notices. This additional content, which is also known as noise, is typically not related to the main subject and may hamper the performance of web data mining, and hence needs to be removed properly. In this paper, we present Content Extraction via Text Density (CETD) a fast, accurate and general method for extracting content from diverse web pages, and using DOM (Document Object Model) node text density to preserve the original structure. For this purpose, we introduce two concepts to measure the importance of nodes: Text Density and Composite Text Density. In order to extract content intact, we propose a technique called DensitySum to replace Data Smoothing. The approach was evaluated with the CleanEval benchmark and with randomly selected pages from well-known websites, where various web domains and styles are tested. The average F1-scores with our method were 8.79% higher than the best scores among several alternative methods.

References

W3C document object model. Website, 2009. http://www.w3.org/DOM.Google Scholar
B. Adelberg. Nodose--a tool for semi-automatically extracting semi-structured data from text documents. In Proceedings of SIGMOD '98, pages 283--294, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
S. Baluja. Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In Proceedings of WWW '06, pages 33--42, 2006. Google ScholarDigital Library
Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proceedings of WWW '02, pages 580--591, New York, NY, USA, 2002. Google ScholarDigital Library
O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Accordion summarization for end-game browsing on pdas and cellular phones. In Proceedings of SIGCHI '01, pages 213--220, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
D. Cai, S. Yu, J. Wen, and W. Ma. Extracting content structure for web pages based on visual representation. In Proceedings of APWeb'03, pages 406--417, 2003. Google ScholarDigital Library
L. Chen, S. Ye, and X. Li. Template detection for large scale search engines. In Proceedings of SAC '06, pages 1094--1098, New York, NY, USA, 2006. Google ScholarDigital Library
Y. Chen, P. Fankhauser, and H.-J. Zhang. Detecting web page structure for adaptive viewing on small form factor devices. In Proceedings of WWW '03, pages 225--233, 2003. Google ScholarDigital Library
B. D. Davison. Recognizing nepotistic links on the web. In AAAI-2000 Workshop On Artificial Intelligence For Web Search, pages 23--28, Austin, Texas, 2000.Google Scholar
S. Debnath, P. Mitra, and C. L. Giles. Automatic extraction of informative blocks from webpages. In Proceedings of SAC '05, pages 1722--1726, 2005. Google ScholarDigital Library
S. Debnath, P. Mitra, and C. L. Giles. Identifying content blocks from web documents. ISMIS, 3488(5):285--293, November 2005. Google ScholarDigital Library
D. Fernandes, E. S. de Moura, B. Ribeiro-Neto, A. S. da Silva, and M. A. Gonçalves. Computing block importance for searching on web sites. In Proceedings of CIKM '07, pages 165--174, 2007. Google ScholarDigital Library
A. Finn, N. Kushmerick, and B. Smyth. Fact or fiction: Content classification for digital libraries. In Joint DELOS-NSF Workshop: Personalization and Recommender Systems in Digital Libraries, 2001.Google Scholar
D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In WWW '05, pages 830--839, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
T. Gottron. Combining content extraction heuristics: the CombinE system. In Proceedings of iiWAS '08, pages 591--595, 2008. Google ScholarDigital Library
T. Gottron. Content code blurring: A new approach to content extraction. In Proceedings of DEXA '08, pages 29--33, 2008. Google ScholarDigital Library
S. Gupta, G. Kaiser, and S. Stolfo. Extracting context to improve accuracy for html content extraction. In Proceedings of WWW '05, pages 1114--1115, 2005. Google ScholarDigital Library
H. Kao, S. Lin, J. Ho, and M. Chen. Mining web informative structures and contents based on entropy analysis. In IEEE Transactions on Knowledge and Data Engineering, pages 41--55, Piscataway, NJ, USA, 2004. Google ScholarDigital Library
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In Proceedings of WSDM '10, pages 441--450, 2010. Google ScholarDigital Library
N. Kushmerick. Learning to remove internet advertisements. In Proceedings of AGENTS '99, pages 175--181, New York, NY, USA, 1999. Google ScholarDigital Library
S. Lin and J. Ho. Discovering informative content blocks from web documents. In Proceedings of SIGKDD '02, pages 588--593, New York, NY, USA, 2002. Google ScholarDigital Library
C. Mantratzis, M. Orgun, and S. Cassidy. Separating xhtml content from navigation clutter using dom-structure block analysis. In Proceedings of HYPERTEXT '05, pages 145--147, 2005. Google ScholarDigital Library
M. Marek, P. Pecina, and M. Spousta. Web page cleaning with conditional random fields. In Proceedings of the Web as Corpus Workshop (WAC3),Cleaneval Session, 2007.Google Scholar
D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. Quasm: A system for question answering using semi-structured data. In Proceedings of JCDL '02, pages 46--55, 2002. Google ScholarDigital Library
A. F. R. Rahman, H. Alam, and R. Hartono. Content extraction from html documents. In WDA2001, pages 7--10, 2001.Google Scholar
R. Song, H. Liu, J. Wen, and W. Ma. Learning block importance models for web pages. In Proceedings of WWW '04, pages 203--211, New York, NY, USA, 2004. Google ScholarDigital Library
T. Weninger, W. H. Hsu, and J. Han. Cetr - content extraction via tag ratios. In Proceedings of WWW '10, pages 971--980, New York, NY, USA, 2010. Google ScholarDigital Library
L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of SIGKDD '03, pages 296--305, New York, NY, USA, 2003. Google ScholarDigital Library

Index Terms

DOM based content extraction via text density
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Abstraction

Recommendations

Automatic Web Content Extraction by Combination of Learning and Grouping
WWW '15: Proceedings of the 24th International Conference on World Wide Web

Web pages consist of not only actual content, but also other elements such as branding banners, navigational elements, advertisements, copyright etc. This noisy content is typically not related to the main subjects of the webpages. Identifying the part ...
Read More
DOM-based content extraction of HTML documents
WWW '03: Proceedings of the 12th international conference on World Wide Web

Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, ...
Read More
A hybrid approach for content extraction with text density and visual importance of DOM nodes

Additional contents in web pages, such as navigation panels, advertisements, copyrights and disclaimer notices, are typically not related to the main subject and may hamper the performance of Web data mining. They are traditionally taken as noises and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
July 2011
1374 pages
ISBN:9781450307574
DOI:10.1145/2009916
General Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Jian-Yun Nie
University of Montreal, Canada
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Research, Spain
,
Tat-Seng Chua
National University of Singapore
,
W. Bruce Croft
University of Massachusetts, Amherst, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 July 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
composite text density
content extraction
densitysum
text density
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 75
  Total Citations
  View Citations
- 1,078
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DOM based content extraction via text density

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic Web Content Extraction by Combination of Learning and Grouping

DOM-based content extraction of HTML documents

A hybrid approach for content extraction with text density and visual importance of DOM nodes