research-article

Web news extraction via path ratios

Authors:

Xindong WuAuthors Info & Claims

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pages 2059 - 2068

https://doi.org/10.1145/2505515.2505558

Published: 27 October 2013 Publication History

Abstract

In addition to the news content, most web news pages also contain navigation panels, advertisements, related news links etc. These non-news items not only exist outside the news region, but are also present in the news content region. Effectively extracting the news content and filtering the noise have important effects on the follow-up activities of content management and analysis. Our extensive case studies have indicated that there exists potential relevance between web content layouts and their tag paths. Based on this observation, we design two tag path features to measure the importance of nodes: Text to tag Path Ratio (TPR) and Extended Text to tag Path Ratio (ETPR), and describe the calculation process of TPR by traversing the parsing tree of a web news page. In this paper, we present Content Extraction via Path Ratios (CEPR) - a fast, accurate and general on-line method for distinguishing news content from non-news content by the TPR/ETPR histogram effectively. In order to improve the ability of CEPR in extracting short texts, we propose a Gaussian smoothing method weighted by a tag path edit distance. This approach can enhance the importance of internal-link nodes but ignore noise nodes existing in news content. Experimental results on the CleanEval datasets and web news pages randomly selected from well-known websites show that CEPR can extract across multi-resources, multi-styles, and multi-languages. The average F and average score with CEPR is 8.69% and 14.25% higher than CETR, which demonstrates better web news extraction performance than most existing methods.

References

[1]

Gibson, D., Punera, K. and Tomkins, A. 2005. The volume and evolution of web page templates. In Proceedings of WWW '05. New York, NY, USA, ACM Press, 830--839.

Digital Library

[2]

Wu, X., Wu, G.-Q., Xie, F., Zhu, Z., Hu, X.-G., Lu, H., and Li, H. 2010. News filtering and summarization on the web. IEEE Intelligent Systems. 25(5): 68--76.

Digital Library

[3]

Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., and Weischedel, R. 2004. The Automatic Content Extraction (ACE) program--tasks, data, and evaluation. In Proceedings of LREC '04. 837--840.

[4]

Crescenzi, V. and Mecca, G. 1998. Grammars have exceptions. Information Systems. December 1998, 23(8): 539--565.

Digital Library

[5]

Arocena, G.O. and Mendelzon, A.O. 1998. WebOQL: Restructuring documents, databases, and webs, In Proceedings of ICDE '98. Orlando, Florida, USA, Feb 23-27, 1998, 24--33.

Digital Library

[6]

Sahuguet, A. and Azavant, F. 2001. Building intelligent web applications using lightweight wrappers. Data and Knowledge Engineering. March 2001, 36(3): 283--316.

Digital Library

[7]

Liu, L., Pu, C., and Han, W. 2000. XWRAP: An XML-enabled wrapper construction system for web information sources, In Proceedings of ICDE '00. San Diego, California, USA, February 28-March 03, 2000, 611--621.

Digital Library

[8]

Soderland, S. 1999. Learning information extraction rules for semi-structured and free text. Journal of Machine Learning. February 1999, 34(1-3): 233--272.

Digital Library

[9]

Laender, A.H.F., Ribeiro-Neto, B., and Silva, A.S. 2002. DEByE - Data extraction by example. Data and Knowledge Engineering. February 2002, 40(2): 121--154.

Digital Library

[10]

Hsu, C.N. and Dung, M.T. 1998. Generating finite-state transducers for semi-structured data extraction from the web. Journal of Information Systems. 1998, 23(8): 521--538.

Digital Library

[11]

Freitag, D. 1998. Information extraction from HTML: Application of a general learning approach, In Proceedings of AAAI '98. Madison, Wisconsin, USA, July 26-30, 1998, 517--523.

Digital Library

[12]

Wu, G. and Wu, X. 2012. Extracting Web News Using Tag Path Patterns. In Proceedings of WI-IAT '12, Macau, China, December 4-7, 2012, 588--595.

[13]

Wu, X., Xie, F., Wu, G.-Q., and Ding, W. 2011. Personalized news filtering and summarization on the web, In Proceedings of ICTAI '11. Boca Raton, Florida, USA, November 07-09, 2011.

Digital Library

[14]

Chang, C.H. and Lui, S.C. 2001. IEPAD: Information extraction based on pattern discovery. In Proceedings of WWW '01. Hong-Kong, China, May 01-05, 2001, 223--231.

Digital Library

[15]

Chang, C.H. and Kuo, S.C. 2004. OLERA: A semi-supervised approach for web data extraction with visual support. IEEE Intelligent Systems. November 2004, 19(6): 56--64.

Digital Library

[16]

Hogue, A. and Karger, D. 2005. Thresher: automating the unwrapping of semantic content from the World Wide Web. In Proceedings of WWW '05. ACM, New York, NY, USA, 86--95.

Digital Library

[17]

Crescenzi, V., Mecca, G., and Merialdo, P. 2001. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of VLDB '01. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 109--118.

Digital Library

[18]

Arasu, A. and Garcia-Molina, H. 2003. Extracting structured data from web pages. In Proceedings of SIGMOD '05. ACM, New York, NY, USA, 337--348.

Digital Library

[19]

Wang, J. and Lochovsky, F.H. 2003. Data extraction and label assignment for web databases. In Proceedings of WWW '03. Budapest, Hungary, May 20-24, 2003, 187--196.

Digital Library

[20]

Zhai, Y. and Liu, B. 2005. Web data extraction based on partial tree alignment. In Proceedings of WWW '05. Japan, 2005, 76--85.

Digital Library

[21]

Liu, B. and Zhai, Y. 2005. NET - A system for extracting web data from flat and nested data records, In Proceedings of WISE '05. 487--495.

Digital Library

[22]

Cai, D., He, X., Wen, J.R., and Ma, W.Y. 2004. Block-level link analysis. In Proceedings of SIGIR '04. Sheffield, UK, July 25-29, 2004, 440--447.

Digital Library

[23]

Zheng, S., Song, R., and Wen, J.R. 2007. Template-independent news extraction based on visual consistency. In Proceedings of AAAI '07. Vancouver, British Columbia, Anthony Cohn (Ed.), Vol. 2. AAAI Press, 1507--1512.

Digital Library

[24]

Wang, J., Chen, C., Wang, C., Pei, J., Bu, J., Guan, Z., and Zhang, W.V. 2009. Can we learn a template-independent wrapper for news article extraction from a single training site? In Proceedings of KDD '09. Paris, France, 1345--1354.

Digital Library

[25]

Baroni, M., Chantree, F., Kilgarriff, A., and Sharoff, S. 2008. Cleaneval: a competition for cleaning web pages. In Proceedings of LREC '08. Marrakech, Morocco, May 28-30, 2008, 638--643.

[26]

Gottron, T. 2008. Content code blurring: a new approach to content extraction. In Proceedings of DEXA '08. IEEE Computer Society, Washington, DC, USA, 29--33.

Digital Library

[27]

Weninger, T., Hsu, W.H., and Han, J. 2010. CETR: content extraction via tag ratios. In Proceedings of WWW '10. A Raleigh, North Carolina, USA, April 26-30, 2010, 971--980.

Digital Library

[28]

Gulhane, P., Madaan, A., Mehta, R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S. H., Tengli, A., and Tiwari, C. 2011. Web-scale information extraction with vertex. In Proceedings of ICDE '11. Hannover, Apr 11-16, 2011, 1209--1220.

Digital Library

[29]

Levenshtein V.I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8): 707--710.

Cited By

Zhong LWu JLi QPeng HWu X(2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3618295
Gong PLi S(2023)Adaptive Behavior-aware Driven Intelligent Method for Detecting News Webpage Structure2023 8th International Conference on Information Systems Engineering (ICISE)10.1109/ICISE60366.2023.00030(115-120)Online publication date: 23-Jun-2023
https://doi.org/10.1109/ICISE60366.2023.00030
Aslam NTahir BShafiq HMehmood M(2019)Web-AM: An Efficient Boilerplate Removal Algorithm for Web Articles2019 International Conference on Frontiers of Information Technology (FIT)10.1109/FIT47737.2019.00061(287-2875)Online publication date: Dec-2019
https://doi.org/10.1109/FIT47737.2019.00061
Show More Cited By

Index Terms

Web news extraction via path ratios
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Automatic Web Content Extraction by Combination of Learning and Grouping
WWW '15: Proceedings of the 24th International Conference on World Wide Web

Web pages consist of not only actual content, but also other elements such as branding banners, navigational elements, advertisements, copyright etc. This noisy content is typically not related to the main subjects of the webpages. Identifying the part ...
CETR: content extraction via tag ratios
WWW '10: Proceedings of the 19th international conference on World wide web

We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram ...
DOM based content extraction via text density
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

In addition to the main content, most web pages also contain navigation panels, advertisements and copyright and disclaimer notices. This additional content, which is also known as noise, is typically not related to the main subject and may hamper the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

October 2013

2612 pages

ISBN:9781450322638

DOI:10.1145/2505515

General Chairs:
Qi He
LinkedIn, USA
,
Arun Iyengar
IBM T.J. Watson Research Center, USA
,
Program Chairs:
Wolfgang Nejdl
L3S Research Center, Germany
,
Jian Pei
Simon Fraser University, Canada
,
Rajeev Rastogi
Amazon, India

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM'13

Sponsor:

CIKM'13: 22nd ACM International Conference on Information and Knowledge Management

October 27 - November 1, 2013

California, San Francisco, USA

Acceptance Rates

CIKM '13 Paper Acceptance Rate 143 of 848 submissions, 17%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
581
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhong LWu JLi QPeng HWu X(2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3618295
Gong PLi S(2023)Adaptive Behavior-aware Driven Intelligent Method for Detecting News Webpage Structure2023 8th International Conference on Information Systems Engineering (ICISE)10.1109/ICISE60366.2023.00030(115-120)Online publication date: 23-Jun-2023
https://doi.org/10.1109/ICISE60366.2023.00030
Aslam NTahir BShafiq HMehmood M(2019)Web-AM: An Efficient Boilerplate Removal Algorithm for Web Articles2019 International Conference on Frontiers of Information Technology (FIT)10.1109/FIT47737.2019.00061(287-2875)Online publication date: Dec-2019
https://doi.org/10.1109/FIT47737.2019.00061
Yuliang WQi ZFang LXixian HGuodong XBailing W(2019)A novel approach for Web page modeling in personal information extractionWorld Wide Web10.1007/s11280-018-0631-922:2(603-620)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11280-018-0631-9
Utiu NIonescu V(2018)Learning Web Content Extraction with DOM Features2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP)10.1109/ICCP.2018.8516632(5-11)Online publication date: Sep-2018
https://doi.org/10.1109/ICCP.2018.8516632
Tan ZHe CFang YGe BXiao W(2018)Title-Based Extraction of News Contents for Text MiningIEEE Access10.1109/ACCESS.2018.28775926(64085-64095)Online publication date: 2018
https://doi.org/10.1109/ACCESS.2018.2877592
Zhang JWang QYang QZhou RZhang Y(2018)Exploiting Multi-Category Characteristics and Unified Framework to Extract Web ContentData Science and Engineering10.1007/s41019-018-0067-33:2(101-114)Online publication date: 7-Jun-2018
https://doi.org/10.1007/s41019-018-0067-3
Zhang KZhang CChen XTan J(2018)Automatic Web News Extraction Based on DS Theory Considering Content TopicsComputational Science – ICCS 201810.1007/978-3-319-93698-7_15(194-207)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1007/978-3-319-93698-7_15
Sun JTang LLiao DChang V(2017)An efficient method for extracting web news content2017 International Conference on Engineering and Technology (ICET)10.1109/ICEngTechnol.2017.8308202(1-5)Online publication date: Aug-2017
https://doi.org/10.1109/ICEngTechnol.2017.8308202
Wang LWu G(2017)A Method Study of Online Publication Time Extraction for Chinese Web News2017 IEEE International Conference on Big Knowledge (ICBK)10.1109/ICBK.2017.57(242-247)Online publication date: Aug-2017
https://doi.org/10.1109/ICBK.2017.57
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten