skip to main content
10.1145/2695664.2695786acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Web page segmentation evaluation

Published:13 April 2015Publication History

ABSTRACT

In this paper, we present a framework for evaluating segmentation algorithms for Web pages. Web page segmentation consists in dividing a Web page into coherent fragments, called blocks. Each block represents one distinct information element in the page. We define an evaluation model that includes different metrics to evaluate the quality of a segmentation obtained with a given algorithm. Those metrics compute the distance between the obtained segmentation and a manually built segmentation that serves as a ground truth. We apply our framework to four state-of-the-art segmentation algorithms (BOM, Block Fusion, VIPS and JVIPS) on several categories (types) of Web pages. Results show that the tested algorithms usually perform rather well for text extraction, but may have serious problems for the extraction of geometry. They also show that the relative quality of a segmentation algorithm depends on the category of the segmented page.

References

  1. Abiteboul, S.: Querying semi-structured data. In: Afrati, F. N., Kolaitis, P. G. (eds.) Database Theory - ICDT '97, 6th International Conference, Delphi, Greece, January 8-10, 1997, Proceedings. Lecture Notes in Computer Science, vol. 1186, pp. 1--18. Springer (1997) Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Asakawa, C., Takagi, H.: Annotation-based transcoding for nonvisual web access. In: Proceedings of the Fourth International ACM Conference on Assistive Technologies. pp. 172--179. Assets '00, ACM, New York, NY, USA (2000), http://doi.acm.org/10.1145/354324.354588 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baluja, S.: Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In: Proceedings of the 15th international conference on World Wide Web. pp. 33--42. ACM (2006) Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Breuel, T. M.: Representations and metrics for off-line handwriting segmentation. In: Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth International Workshop on. pp. 428--433. IEEE (2002) Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cai, D., Yu, S., Wen, J. R., Ma, W. Y.: Extracting content structure for web pages based on visual representation. In: APWeb 2003. LNCS, vol. 2642, pp. 406--417. Springer (2003) Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.: Geometric layout analysis techniques for document image understanding: a review. ITC-irst Technical Report 9703(09) (1998)Google ScholarGoogle Scholar
  7. Chakrabarti, D., Kumar, R., Punera, K.: A graph-theoretic approach to webpage segmentation. In: Proceedings of the 17th international conference on World Wide Web. pp. 377--386. ACM (2008) Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chen, Y., Xie, X., Ma, W. Y., Zhang, H. J.: Adapting web pages for small-screen devices. IEEE Internet Computing 9(1), 50--56 (2005) Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hu, J., Kashi, R., Wilfong, G.: Document image layout comparison and classification. In: 1999. ICDAR '99. Proceedings of the Fifth International Conference on Document Analysis and Recognition. pp. 285--288 (Sep 1999) Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Proceedings of the 17th ACM conference on Information and knowledge management. pp. 1173--1182. ACM (2008) Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kreuzer, R.: A Quantitative Comparison of Semantic Web Page Segmentation Algorithms. Master's thesis, Universiteit Utrecht (2013)Google ScholarGoogle Scholar
  12. Pehlivan, Z., Saad, M. B., Gançarski, S.: Vi-diff: Understanding web pages changes. In: DEXA (1). pp. 1--15 (2010) Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Popela, T.: IMPLEMENTACE ALGORITMU PRO VIZUALNI SEGMENTACI WWW STRANEK. Master's thesis, BRNO University of Technology (2012)Google ScholarGoogle Scholar
  14. Saad, M. B., Gançarski, S.: Using visual pages analysis for optimizing web archiving. In: Proceedings of the 2010 EDBT/ICDT Workshops. p. 43. ACM (2010) Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Saad, M. B., Gançarski, S.: Archiving the web using page changes patterns: a case study. Int. J. on Digital Libraries 13(1), 33--49 (2012) Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sanoja, A., Gançarski, S.: Block-o-matic: A web page segmentation framework. In: International Conference on Multimedia Computing and Systems (ICMCS'14). Marrakeh, Morroco (2014)Google ScholarGoogle ScholarCross RefCross Ref
  17. Shafait, F., Keysers, D., Breuel, T.: Performance evaluation and benchmarking of six-page segmentation algorithms. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30(6), 941--954 (2008) Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Solis, B.: The conversation prism (2014), https://conversationprism.com/Google ScholarGoogle Scholar
  19. Tang, Y. Y., Suen, C. Y.: Document structures: a survey. International journal of pattern recognition and artificial intelligence 8(05), 1081--1111 (1994)Google ScholarGoogle Scholar
  20. Xiao, Y., Tao, Y., Li, Q.: Web page adaptation for mobile device. In: Wireless Communications, Networking and Mobile Computing, 2008. WiCOM '08. 4th International Conference on. pp. 1--5 (2008)Google ScholarGoogle Scholar
  21. Yesilada, Y.: Web page segmentation: A review. Tech. rep., University of Manchester and Middle East Technical University Northern Cyprus Campus (2011)Google ScholarGoogle Scholar
  22. Zhang, Y., Gerbrands, J.: Objective and quantitative segmentation evaluation and comparison. Signal processing 39(1), 43--54 (1994) Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Web page segmentation evaluation

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing
            April 2015
            2418 pages
            ISBN:9781450331968
            DOI:10.1145/2695664

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 13 April 2015

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            SAC '15 Paper Acceptance Rate291of1,211submissions,24%Overall Acceptance Rate1,650of6,669submissions,25%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader