skip to main content
10.1145/2505515.2505558acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Web news extraction via path ratios

Published: 27 October 2013 Publication History

Abstract

In addition to the news content, most web news pages also contain navigation panels, advertisements, related news links etc. These non-news items not only exist outside the news region, but are also present in the news content region. Effectively extracting the news content and filtering the noise have important effects on the follow-up activities of content management and analysis. Our extensive case studies have indicated that there exists potential relevance between web content layouts and their tag paths. Based on this observation, we design two tag path features to measure the importance of nodes: Text to tag Path Ratio (TPR) and Extended Text to tag Path Ratio (ETPR), and describe the calculation process of TPR by traversing the parsing tree of a web news page. In this paper, we present Content Extraction via Path Ratios (CEPR) - a fast, accurate and general on-line method for distinguishing news content from non-news content by the TPR/ETPR histogram effectively. In order to improve the ability of CEPR in extracting short texts, we propose a Gaussian smoothing method weighted by a tag path edit distance. This approach can enhance the importance of internal-link nodes but ignore noise nodes existing in news content. Experimental results on the CleanEval datasets and web news pages randomly selected from well-known websites show that CEPR can extract across multi-resources, multi-styles, and multi-languages. The average F and average score with CEPR is 8.69% and 14.25% higher than CETR, which demonstrates better web news extraction performance than most existing methods.

References

[1]
Gibson, D., Punera, K. and Tomkins, A. 2005. The volume and evolution of web page templates. In Proceedings of WWW '05. New York, NY, USA, ACM Press, 830--839.
[2]
Wu, X., Wu, G.-Q., Xie, F., Zhu, Z., Hu, X.-G., Lu, H., and Li, H. 2010. News filtering and summarization on the web. IEEE Intelligent Systems. 25(5): 68--76.
[3]
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., and Weischedel, R. 2004. The Automatic Content Extraction (ACE) program--tasks, data, and evaluation. In Proceedings of LREC '04. 837--840.
[4]
Crescenzi, V. and Mecca, G. 1998. Grammars have exceptions. Information Systems. December 1998, 23(8): 539--565.
[5]
Arocena, G.O. and Mendelzon, A.O. 1998. WebOQL: Restructuring documents, databases, and webs, In Proceedings of ICDE '98. Orlando, Florida, USA, Feb 23-27, 1998, 24--33.
[6]
Sahuguet, A. and Azavant, F. 2001. Building intelligent web applications using lightweight wrappers. Data and Knowledge Engineering. March 2001, 36(3): 283--316.
[7]
Liu, L., Pu, C., and Han, W. 2000. XWRAP: An XML-enabled wrapper construction system for web information sources, In Proceedings of ICDE '00. San Diego, California, USA, February 28-March 03, 2000, 611--621.
[8]
Soderland, S. 1999. Learning information extraction rules for semi-structured and free text. Journal of Machine Learning. February 1999, 34(1-3): 233--272.
[9]
Laender, A.H.F., Ribeiro-Neto, B., and Silva, A.S. 2002. DEByE - Data extraction by example. Data and Knowledge Engineering. February 2002, 40(2): 121--154.
[10]
Hsu, C.N. and Dung, M.T. 1998. Generating finite-state transducers for semi-structured data extraction from the web. Journal of Information Systems. 1998, 23(8): 521--538.
[11]
Freitag, D. 1998. Information extraction from HTML: Application of a general learning approach, In Proceedings of AAAI '98. Madison, Wisconsin, USA, July 26-30, 1998, 517--523.
[12]
Wu, G. and Wu, X. 2012. Extracting Web News Using Tag Path Patterns. In Proceedings of WI-IAT '12, Macau, China, December 4-7, 2012, 588--595.
[13]
Wu, X., Xie, F., Wu, G.-Q., and Ding, W. 2011. Personalized news filtering and summarization on the web, In Proceedings of ICTAI '11. Boca Raton, Florida, USA, November 07-09, 2011.
[14]
Chang, C.H. and Lui, S.C. 2001. IEPAD: Information extraction based on pattern discovery. In Proceedings of WWW '01. Hong-Kong, China, May 01-05, 2001, 223--231.
[15]
Chang, C.H. and Kuo, S.C. 2004. OLERA: A semi-supervised approach for web data extraction with visual support. IEEE Intelligent Systems. November 2004, 19(6): 56--64.
[16]
Hogue, A. and Karger, D. 2005. Thresher: automating the unwrapping of semantic content from the World Wide Web. In Proceedings of WWW '05. ACM, New York, NY, USA, 86--95.
[17]
Crescenzi, V., Mecca, G., and Merialdo, P. 2001. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of VLDB '01. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 109--118.
[18]
Arasu, A. and Garcia-Molina, H. 2003. Extracting structured data from web pages. In Proceedings of SIGMOD '05. ACM, New York, NY, USA, 337--348.
[19]
Wang, J. and Lochovsky, F.H. 2003. Data extraction and label assignment for web databases. In Proceedings of WWW '03. Budapest, Hungary, May 20-24, 2003, 187--196.
[20]
Zhai, Y. and Liu, B. 2005. Web data extraction based on partial tree alignment. In Proceedings of WWW '05. Japan, 2005, 76--85.
[21]
Liu, B. and Zhai, Y. 2005. NET - A system for extracting web data from flat and nested data records, In Proceedings of WISE '05. 487--495.
[22]
Cai, D., He, X., Wen, J.R., and Ma, W.Y. 2004. Block-level link analysis. In Proceedings of SIGIR '04. Sheffield, UK, July 25-29, 2004, 440--447.
[23]
Zheng, S., Song, R., and Wen, J.R. 2007. Template-independent news extraction based on visual consistency. In Proceedings of AAAI '07. Vancouver, British Columbia, Anthony Cohn (Ed.), Vol. 2. AAAI Press, 1507--1512.
[24]
Wang, J., Chen, C., Wang, C., Pei, J., Bu, J., Guan, Z., and Zhang, W.V. 2009. Can we learn a template-independent wrapper for news article extraction from a single training site? In Proceedings of KDD '09. Paris, France, 1345--1354.
[25]
Baroni, M., Chantree, F., Kilgarriff, A., and Sharoff, S. 2008. Cleaneval: a competition for cleaning web pages. In Proceedings of LREC '08. Marrakech, Morocco, May 28-30, 2008, 638--643.
[26]
Gottron, T. 2008. Content code blurring: a new approach to content extraction. In Proceedings of DEXA '08. IEEE Computer Society, Washington, DC, USA, 29--33.
[27]
Weninger, T., Hsu, W.H., and Han, J. 2010. CETR: content extraction via tag ratios. In Proceedings of WWW '10. A Raleigh, North Carolina, USA, April 26-30, 2010, 971--980.
[28]
Gulhane, P., Madaan, A., Mehta, R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S. H., Tengli, A., and Tiwari, C. 2011. Web-scale information extraction with vertex. In Proceedings of ICDE '11. Hannover, Apr 11-16, 2011, 1209--1220.
[29]
Levenshtein V.I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8): 707--710.

Cited By

View all
  • (2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 30-Nov-2023
  • (2023)Adaptive Behavior-aware Driven Intelligent Method for Detecting News Webpage Structure2023 8th International Conference on Information Systems Engineering (ICISE)10.1109/ICISE60366.2023.00030(115-120)Online publication date: 23-Jun-2023
  • (2019)Web-AM: An Efficient Boilerplate Removal Algorithm for Web Articles2019 International Conference on Frontiers of Information Technology (FIT)10.1109/FIT47737.2019.00061(287-2875)Online publication date: Dec-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
October 2013
2612 pages
ISBN:9781450322638
DOI:10.1145/2505515
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content extraction
  2. text to tag path ratio
  3. web news
  4. weighted gaussian smoothing

Qualifiers

  • Research-article

Conference

CIKM'13
Sponsor:
CIKM'13: 22nd ACM International Conference on Information and Knowledge Management
October 27 - November 1, 2013
California, San Francisco, USA

Acceptance Rates

CIKM '13 Paper Acceptance Rate 143 of 848 submissions, 17%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 30-Nov-2023
  • (2023)Adaptive Behavior-aware Driven Intelligent Method for Detecting News Webpage Structure2023 8th International Conference on Information Systems Engineering (ICISE)10.1109/ICISE60366.2023.00030(115-120)Online publication date: 23-Jun-2023
  • (2019)Web-AM: An Efficient Boilerplate Removal Algorithm for Web Articles2019 International Conference on Frontiers of Information Technology (FIT)10.1109/FIT47737.2019.00061(287-2875)Online publication date: Dec-2019
  • (2019)A novel approach for Web page modeling in personal information extractionWorld Wide Web10.1007/s11280-018-0631-922:2(603-620)Online publication date: 1-Mar-2019
  • (2018)Learning Web Content Extraction with DOM Features2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP)10.1109/ICCP.2018.8516632(5-11)Online publication date: Sep-2018
  • (2018)Title-Based Extraction of News Contents for Text MiningIEEE Access10.1109/ACCESS.2018.28775926(64085-64095)Online publication date: 2018
  • (2018)Exploiting Multi-Category Characteristics and Unified Framework to Extract Web ContentData Science and Engineering10.1007/s41019-018-0067-33:2(101-114)Online publication date: 7-Jun-2018
  • (2018)Automatic Web News Extraction Based on DS Theory Considering Content TopicsComputational Science – ICCS 201810.1007/978-3-319-93698-7_15(194-207)Online publication date: 11-Jun-2018
  • (2017)An efficient method for extracting web news content2017 International Conference on Engineering and Technology (ICET)10.1109/ICEngTechnol.2017.8308202(1-5)Online publication date: Aug-2017
  • (2017)A Method Study of Online Publication Time Extraction for Chinese Web News2017 IEEE International Conference on Big Knowledge (ICBK)10.1109/ICBK.2017.57(242-247)Online publication date: Aug-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media