skip to main content
10.1145/1458082.1458237acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

A densitometric approach to web page segmentation

Published: 26 October 2008 Publication History

Abstract

Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segment-level text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web.

References

[1]
Gabriel Altmann. Glottometrika 9, Verteilungen der Satzlängen (Distribution of Sentence Lengths). Brockmeyer, 1988.
[2]
A. Antonacopoulos, B. Gatos, and D. Bridson. Page segmentation competition. Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, 2:1279--1283, 23-26 Sept. 2007.
[3]
Shumeet Baluja. Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 33--42, New York, NY, USA, 2006. ACM.
[4]
Ziv Bar-Yossef and Sridhar Rajagopalan. Template detection via data mining and its applications. In WWW, pages 580--591, 2002.
[5]
Karl-Heinz Best. Quantitative Linguistics - An international Handbook, chapter Satzlänge (Sentence length), pages 298--304. de Gruyter, 2005.
[6]
Karl-Heinz Best. Sprachliche Einheiten in Textblöcken. In Glottometrics 9, pages 1--12. RAM Verlag, Lüdenscheid, 2005.
[7]
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Extracting content structure for web pages based on visual representation. In X. Zhou, Y. Zhang, and M. E. Orlowska, editors, APWeb, volume 2642 of LNCS, pages 406--417. Springer, 2003.
[8]
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Block-based web search. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 456--463, New York, NY, USA, 2004. ACM.
[9]
Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. Page-level template detection via isotonic smoothing. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 61--70, New York, NY, USA, 2007. ACM.
[10]
Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. A graph-theoretic approach to webpage segmentation. In WWW '08: Proceeding of the 17th international conference on World Wide Web, pages 377--386, New York, NY, USA, 2008. ACM.
[11]
Ming Chen, Xiaoqing Ding, and Jian Liang. Analysis, understanding and representation of chinese newspaper with complex layout. Image Processing, 2000. Proceedings. 2000 International Conference on, 2:590--593 vol.2, 2000.
[12]
Yu Chen, Wei-Ying Ma, and Hong-Jiang Zhang. Detecting web page structure for adaptive viewing on small form factor devices. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 225--233, New York, NY, USA, 2003. ACM.
[13]
Sandip Debnath, Prasenjit Mitra, Nirmal Pal, and C. Lee Giles. Automatic identification of informative sections of web pages. IEEE Trans. on Knowledge and Data Engineering, 17(9):1233--1246, 2005.
[14]
Lukasz Debowski. Zipf's law against the text size: a half-rational model. In Glottometrics 4, pages 49--60. RAM Verlag, Ludenscheid, 2002.
[15]
David Fernandes, Edleno S. de Moura, Berthier Ribeiro-Neto, Altigran S. da Silva, and Marcos André Gonçalves. Computing block importance for searching on web sites. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 165--174, New York, NY, USA, 2007. ACM.
[16]
David Gibson, Kunal Punera, and Andrew Tomkins. The volume and evolution of web page templates. In Allan Ellis and Tatsuya Hagino, editors, WWW (Special interest track), pages 830--839. ACM, 2005.
[17]
Peter Grzybek. On the systematic and system-based study of grapheme frequencies - a re-analysis of german letter frequencies. In G. Altmann, K.-H. Best, and P. Grzybek et al., editors, Glottometrics 15, pages 82--91. RAM Verlag, Lüdenscheid, 2007.
[18]
Marti A. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 9--16, Morristown, NJ, USA, 1994. Association for Computational Linguistics.
[19]
Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193--218, December 1985.
[20]
Hung-Yu Kao, Jan-Ming Ho, and Ming-Syan Chen. Wisdom: Web intrapage informative structure mining based on document object model. Knowledge and Data Engineering, IEEE Transactions on, 17(5):614--627, May 2005.
[21]
Jared M. Spool, Tara Scanlon, Carolyn Snyder, Will Schroeder, and Terri DeAngelo. Web site usability: a designer's guide. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999.
[22]
George Stockman and Linda G. Shapiro. Computer Vision. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2001.
[23]
Alexander Strehl and Joydeep Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3:583--617, 2003.
[24]
Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, ao M. B. Cavalcanti Jo and Juliana Freire. A fast and robust method for web page template detection and removal. In CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 258--267, New York, NY, USA, 2006. ACM.
[25]
Relja Vulanovic and Reinhard Köhler. Quantitative Linguistics - An international Handbook, chapter Syntactic units and structures, pages 274--291. de Gruyter, 2005.
[26]
Lan Yi, Bing Liu, and Xiaoli Li. Eliminating noisy information in web pages for data mining. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296--305, New York, NY, USA, 2003. ACM.

Cited By

View all
  • (2024)Internet Web page content block dataset and solutions for its data labelling simplificationundefined10.20334/2024-032-MOnline publication date: 2024
  • (2023)Web Page Content Block Identification with Extended Block PropertiesApplied Sciences10.3390/app1309568013:9(5680)Online publication date: 5-May-2023
  • (2023)Understanding and Detecting Abused Image Hosting Modules as Malicious ServicesProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623143(3213-3227)Online publication date: 15-Nov-2023
  • Show More Cited By

Index Terms

  1. A densitometric approach to web page segmentation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
    October 2008
    1562 pages
    ISBN:9781595939913
    DOI:10.1145/1458082
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 October 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. full-text extraction
    2. noise removal
    3. template detection
    4. web page segmentation

    Qualifiers

    • Research-article

    Conference

    CIKM08
    CIKM08: Conference on Information and Knowledge Management
    October 26 - 30, 2008
    California, Napa Valley, USA

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Internet Web page content block dataset and solutions for its data labelling simplificationundefined10.20334/2024-032-MOnline publication date: 2024
    • (2023)Web Page Content Block Identification with Extended Block PropertiesApplied Sciences10.3390/app1309568013:9(5680)Online publication date: 5-May-2023
    • (2023)Understanding and Detecting Abused Image Hosting Modules as Malicious ServicesProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623143(3213-3227)Online publication date: 15-Nov-2023
    • (2023)Web Page Segmentation: A DOM-Structural Cohesion Analysis ApproachWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_25(319-333)Online publication date: 21-Oct-2023
    • (2022)Multimodal Web Page Segmentation Using Self-organized Multi-objective ClusteringACM Transactions on Information Systems10.1145/348096640:3(1-49)Online publication date: 7-Mar-2022
    • (2022)Extracting the Main Content of Web Pages Using the First Impression AreaIEEE Access10.1109/ACCESS.2022.322908010(129958-129969)Online publication date: 2022
    • (2021)Boilerplate Detection via Semantic Classification of TextBlocks2021 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN52387.2021.9534308(1-8)Online publication date: 18-Jul-2021
    • (2021)Web Content Extraction by Weighing the Fundamental Contextual Rules2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)10.1109/ICSPIS54653.2021.9729342(01-08)Online publication date: 29-Dec-2021
    • (2021)Postal address extraction from the web: a comprehensive surveyArtificial Intelligence Review10.1007/s10462-021-09983-1Online publication date: 14-Mar-2021
    • (2020)A Semantic Focused Web Crawler Based on a Knowledge Representation SchemaApplied Sciences10.3390/app1011383710:11(3837)Online publication date: 31-May-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media