ABSTRACT
In this paper, we present a framework for evaluating segmentation algorithms for Web pages. Web page segmentation consists in dividing a Web page into coherent fragments, called blocks. Each block represents one distinct information element in the page. We define an evaluation model that includes different metrics to evaluate the quality of a segmentation obtained with a given algorithm. Those metrics compute the distance between the obtained segmentation and a manually built segmentation that serves as a ground truth. We apply our framework to four state-of-the-art segmentation algorithms (BOM, Block Fusion, VIPS and JVIPS) on several categories (types) of Web pages. Results show that the tested algorithms usually perform rather well for text extraction, but may have serious problems for the extraction of geometry. They also show that the relative quality of a segmentation algorithm depends on the category of the segmented page.
- Abiteboul, S.: Querying semi-structured data. In: Afrati, F. N., Kolaitis, P. G. (eds.) Database Theory - ICDT '97, 6th International Conference, Delphi, Greece, January 8-10, 1997, Proceedings. Lecture Notes in Computer Science, vol. 1186, pp. 1--18. Springer (1997) Google ScholarDigital Library
- Asakawa, C., Takagi, H.: Annotation-based transcoding for nonvisual web access. In: Proceedings of the Fourth International ACM Conference on Assistive Technologies. pp. 172--179. Assets '00, ACM, New York, NY, USA (2000), http://doi.acm.org/10.1145/354324.354588 Google ScholarDigital Library
- Baluja, S.: Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework. In: Proceedings of the 15th international conference on World Wide Web. pp. 33--42. ACM (2006) Google ScholarDigital Library
- Breuel, T. M.: Representations and metrics for off-line handwriting segmentation. In: Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth International Workshop on. pp. 428--433. IEEE (2002) Google ScholarDigital Library
- Cai, D., Yu, S., Wen, J. R., Ma, W. Y.: Extracting content structure for web pages based on visual representation. In: APWeb 2003. LNCS, vol. 2642, pp. 406--417. Springer (2003) Google ScholarDigital Library
- Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.: Geometric layout analysis techniques for document image understanding: a review. ITC-irst Technical Report 9703(09) (1998)Google Scholar
- Chakrabarti, D., Kumar, R., Punera, K.: A graph-theoretic approach to webpage segmentation. In: Proceedings of the 17th international conference on World Wide Web. pp. 377--386. ACM (2008) Google ScholarDigital Library
- Chen, Y., Xie, X., Ma, W. Y., Zhang, H. J.: Adapting web pages for small-screen devices. IEEE Internet Computing 9(1), 50--56 (2005) Google ScholarDigital Library
- Hu, J., Kashi, R., Wilfong, G.: Document image layout comparison and classification. In: 1999. ICDAR '99. Proceedings of the Fifth International Conference on Document Analysis and Recognition. pp. 285--288 (Sep 1999) Google ScholarDigital Library
- Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Proceedings of the 17th ACM conference on Information and knowledge management. pp. 1173--1182. ACM (2008) Google ScholarDigital Library
- Kreuzer, R.: A Quantitative Comparison of Semantic Web Page Segmentation Algorithms. Master's thesis, Universiteit Utrecht (2013)Google Scholar
- Pehlivan, Z., Saad, M. B., Gançarski, S.: Vi-diff: Understanding web pages changes. In: DEXA (1). pp. 1--15 (2010) Google ScholarDigital Library
- Popela, T.: IMPLEMENTACE ALGORITMU PRO VIZUALNI SEGMENTACI WWW STRANEK. Master's thesis, BRNO University of Technology (2012)Google Scholar
- Saad, M. B., Gançarski, S.: Using visual pages analysis for optimizing web archiving. In: Proceedings of the 2010 EDBT/ICDT Workshops. p. 43. ACM (2010) Google ScholarDigital Library
- Saad, M. B., Gançarski, S.: Archiving the web using page changes patterns: a case study. Int. J. on Digital Libraries 13(1), 33--49 (2012) Google ScholarDigital Library
- Sanoja, A., Gançarski, S.: Block-o-matic: A web page segmentation framework. In: International Conference on Multimedia Computing and Systems (ICMCS'14). Marrakeh, Morroco (2014)Google ScholarCross Ref
- Shafait, F., Keysers, D., Breuel, T.: Performance evaluation and benchmarking of six-page segmentation algorithms. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30(6), 941--954 (2008) Google ScholarDigital Library
- Solis, B.: The conversation prism (2014), https://conversationprism.com/Google Scholar
- Tang, Y. Y., Suen, C. Y.: Document structures: a survey. International journal of pattern recognition and artificial intelligence 8(05), 1081--1111 (1994)Google Scholar
- Xiao, Y., Tao, Y., Li, Q.: Web page adaptation for mobile device. In: Wireless Communications, Networking and Mobile Computing, 2008. WiCOM '08. 4th International Conference on. pp. 1--5 (2008)Google Scholar
- Yesilada, Y.: Web page segmentation: A review. Tech. rep., University of Manchester and Middle East Technical University Northern Cyprus Campus (2011)Google Scholar
- Zhang, Y., Gerbrands, J.: Objective and quantitative segmentation evaluation and comparison. Signal processing 39(1), 43--54 (1994) Google ScholarDigital Library
Index Terms
- Web page segmentation evaluation
Recommendations
Web Page Segmentation Revisited: Evaluation Framework and Dataset
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge ManagementEach web page can be segmented into semantically coherent units that fulfill specific purposes. Though the task of automatic web page segmentation was introduced two decades ago, along with several applications in web content analysis, its foundations ...
Unsupervised segmentation evaluation: an edge-based method
Unsupervised segmentation evaluation method quantifies the quality of segmentation without the reference segmentation or user assistance. Although some methods have been proposed to statistically analyze the pixel values, these methods are not sensitive ...
Evaluation method for MRI brain tissue abnormalities segmentation study
Proceedings of the 15th WSEAS international conference on ComputersSegmentation poses one of the most challenging problems in medical imaging. Segmentation of Magnetic Resonance Imaging (MRI) images is an important part of brain imaging research as it can facilitates the neurological diseases diagnosis. However, there ...
Comments