Abstract
Semantic block identification is an approach to retrieve information from Web pages and applications. As Website design evolves, however, traditional methodologies cannot perform well any more. This paper proposes a new model to merge Web page content into semantic blocks by simulating human perception. A “layer tree” is constructed to remove hierarchical inconsistencies between the DOM tree representation and the visual layout of the Web page. Subsequently, the Gestalt laws of grouping are interpreted as the rules for semantic block detection. During interpretation, the normalized Hausdorff distance, the CIE-Lab color difference, the normalized compression distance, and the series of visual information are proposed to operationalize these Gestalt laws. Finally, a classifier is trained to combine each operationalized law into a unified rule for identifying semantic blocks from the Web page. Experiments are conducted to compare the efficiency of the model to a state-of-art algorithm, the VIPS. The comparison results of the first experiment show that the GLM model generates more “true positives” and less “false negatives” than VIPS. The next experiment upon a large-scale test set produces an average precision of 90.53 % and recall rate of 90.85 %, which is approximately 25 % better than that of VIPS.






Similar content being viewed by others
Notes
http://www.alexa.com/topsites. The top sites were retrieved on April 4, 2014.
References
Albrecht, P., et al.: Retinal neurodegeneration in Wilson’s disease revealed by spectral domain optical coherence tomography. PLoS ONE 7(11), e49825 (2012)
Baluja, S.: Browsing on small screens: recasting Web-page segmentation into an efficient machine learning framework Proceedings 15th International Conference World Wide Web, pp 33–42 (2006)
Cai, D., et al.: Extracting content structure for Web pages based on visual representation. Web Technol. & Appl., 406–417 (2003)
Cai, D., et al.: VIPS: a vision-based page segmentation algorithm. Microsoft Technical Rep. MSR-TR-2003-79 (2003)
Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8(6), 679–698 (1986)
Cao, J., Mao, B., Luo, J.: A segmentation method for Web page analysis using shrinking and dividing. Int. J. Parallel Emergent Distrib. Syst. 25(2), 93–104 (2010)
Chakrabarti, D., Kumar, R., Punera, K.: A graph-theoretic approach to Webpage segmentation Proceedings of 17th Internatonal Conference World Wide Web, pp 377–386 (2008)
Chaudhuri, B.B., Rosenfeld, A.: A modified Hausdorff distance between fuzzy sets. Inf. Sci. 118(1–4), 159–171 (1999)
Cilibrasi, R.: Statistical Inference through Data Compression. Lulu.com, Raleigh (2007)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Connolly, C., Fleiss, T.: A study of efficiency and accuracy in the transformation from RGB to CIELAB color space. IEEE Trans. Image Process 6(7), 1046–1048 (1997)
de Castro Reis, D., et al.: Automatic Web news extraction using tree edit distance Proceedings of 13th International Conference World Wide Web, pp 502–511 (2004)
Gupta, S., et al.: DOM-based content extraction of HTML documents Proceedings of 12th International Conference World Wide Web, pp 207–214 (2003)
Gwet, K.L.: Handbook of Inter-rater Reliability. Advanced Analytics, Gaithersburg (2010)
Hattori, G., et al.: Robust Web page segmentation for mobile terminal using content-distances and page layout information Proceedings of 16th International Conference World Wide Web, pp 361–370 (2007)
Hauzeur, J.P., Mathy, L., De Maertelaer, V.: Comparison between clinical evaluation and ultrasonography in detecting hydrarthrosis of the knee. J. Rheumatol. 26(12), 2681–2683 (1999)
Johnson, G.M., Fairchild, M.D.: A top down description of S-CIELAB and CIEDE2000. Color. Res. Appl. 28(6), 425–435 (2003)
Kang, J., Yang, J., Choi, J.: Repetition-based Web page segmentation by detecting tag patterns for small-screen devices. IEEE Trans. Consumer Electron., 980–986 (2010)
Koffka, K.: Principles of Gestalt Psychology. Routledge, London (1955)
Kohlschütter, C., Nejdl, W.: A densitometric approach to Web page segmentation Proceedings of 17th ACM Conference Inf. and Knowl. Management, pp 1173–1182 (2008)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biom. 33, 159–174 (1977)
Li, M., et al.: The similarity metric. IEEE Trans. Info. Theory, 3250–3264 (2004)
Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from Web documents Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 588–593 (2002)
Liu, H., et al.: A discussion on printing color difference tolerance by CIEDE2000 color difference formula. Appl. Mechanics and Mater. 262, 96–99 (2012)
Luo, M.R., Cui, G., Rigg, B.: The development of the CIE 2000 colour-difference formula: CIEDE2000. Color. Res. Appl. 26(5), 340–350 (2001)
McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification Proceedings of AAAI-98 Workshop on Learn. Text Categ., pp 41–48 (1998)
Narges, R., Miller, J.: Finding homoglyphs – a step towards detecting unicode-based visual spoofing attacks. Int. Conf. Web Inf. Syst. Eng., 1–14 (2011)
Palmer, S.E.: Modern theories of Gestalt perception. Mind Lang. 5(4), 289–323 (1990)
Pereira, A.C., et al.: Validity of caries detection on occlusal surfaces and treatment decisions based on results from multiple caries-detection methods. Eur. J. Oral Sci. 117 (1), 51–57 (2009)
Sharma, G., Wu, W., Dalal, E.N.: The CIEDE2000 color-difference formula: implementation notes, supplementary test data, and mathematical observations. Color. Res. Appl. 30(1), 21–30 (2005)
Sim, D.-G., Kwon, O.-K., Park, R.-H.: Object matching algorithms using robust Hausdorff distance measures. IEEE Trans. Image Process. 8(3), 425–429 (1999)
Song, R., et al.: Learning block importance models for Web pages Proceedings of 13th International Conference World Wide Web, pp 203–211 (2004)
Sternberg, R.J.: Cognitive Psychology, 3rd edn. Wadsworth, Belmont (2003)
Tewarie, P., et al.: The OSCAR-IB consensus criteria for retinal OCT quality assessment. PLoS One 7(4), e34823 (2012)
Unwin, N.: Comparison of the current WHO and new ADA criteria for the diagnosis of diabetes mellitus in three ethnic groups in the UK. Diabet. Med. 15(7), 554–557 (1998)
Yu, C., Ma, W.-Y., Zhang, H.-J.: Detecting Web page structure for adaptive viewing on small form factor devices Proceedings of 12th International Conference World Wide Web, pp 225–233 (2003)
Yu, S., et al.: Improving pseudo-relevance feedback in Web information retrieval using Web page segmentation Proceedings of 12th International Conference World Wide Web, pp 11–18 (2003)
Zhao, C., Shi, W., Deng, Y.: A new Hausdorff distance for image matching. Pattern Recognit. Lett. 26(5), 581–586 (2005)
Acknowledgments
The authors give thanks to China Scholarship Council (CSC) for their financial support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xu, Z., Miller, J. Identifying semantic blocks in Web pages using Gestalt laws of grouping. World Wide Web 19, 957–978 (2016). https://doi.org/10.1007/s11280-015-0370-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-015-0370-0