Skip to main content

Advertisement

Log in

Identifying semantic blocks in Web pages using Gestalt laws of grouping

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Semantic block identification is an approach to retrieve information from Web pages and applications. As Website design evolves, however, traditional methodologies cannot perform well any more. This paper proposes a new model to merge Web page content into semantic blocks by simulating human perception. A “layer tree” is constructed to remove hierarchical inconsistencies between the DOM tree representation and the visual layout of the Web page. Subsequently, the Gestalt laws of grouping are interpreted as the rules for semantic block detection. During interpretation, the normalized Hausdorff distance, the CIE-Lab color difference, the normalized compression distance, and the series of visual information are proposed to operationalize these Gestalt laws. Finally, a classifier is trained to combine each operationalized law into a unified rule for identifying semantic blocks from the Web page. Experiments are conducted to compare the efficiency of the model to a state-of-art algorithm, the VIPS. The comparison results of the first experiment show that the GLM model generates more “true positives” and less “false negatives” than VIPS. The next experiment upon a large-scale test set produces an average precision of 90.53 % and recall rate of 90.85 %, which is approximately 25 % better than that of VIPS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6

Similar content being viewed by others

Notes

  1. http://getbootstrap.com/.

  2. https://github.com/tpopela/vips_java.

  3. http://www.alexa.com/topsites. The top sites were retrieved on April 4, 2014.

References

  1. Albrecht, P., et al.: Retinal neurodegeneration in Wilson’s disease revealed by spectral domain optical coherence tomography. PLoS ONE 7(11), e49825 (2012)

    Article  MathSciNet  Google Scholar 

  2. Baluja, S.: Browsing on small screens: recasting Web-page segmentation into an efficient machine learning framework Proceedings 15th International Conference World Wide Web, pp 33–42 (2006)

  3. Cai, D., et al.: Extracting content structure for Web pages based on visual representation. Web Technol. & Appl., 406–417 (2003)

  4. Cai, D., et al.: VIPS: a vision-based page segmentation algorithm. Microsoft Technical Rep. MSR-TR-2003-79 (2003)

  5. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8(6), 679–698 (1986)

    Article  Google Scholar 

  6. Cao, J., Mao, B., Luo, J.: A segmentation method for Web page analysis using shrinking and dividing. Int. J. Parallel Emergent Distrib. Syst. 25(2), 93–104 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  7. Chakrabarti, D., Kumar, R., Punera, K.: A graph-theoretic approach to Webpage segmentation Proceedings of 17th Internatonal Conference World Wide Web, pp 377–386 (2008)

  8. Chaudhuri, B.B., Rosenfeld, A.: A modified Hausdorff distance between fuzzy sets. Inf. Sci. 118(1–4), 159–171 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  9. Cilibrasi, R.: Statistical Inference through Data Compression. Lulu.com, Raleigh (2007)

  10. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)

    Article  Google Scholar 

  11. Connolly, C., Fleiss, T.: A study of efficiency and accuracy in the transformation from RGB to CIELAB color space. IEEE Trans. Image Process 6(7), 1046–1048 (1997)

    Article  Google Scholar 

  12. de Castro Reis, D., et al.: Automatic Web news extraction using tree edit distance Proceedings of 13th International Conference World Wide Web, pp 502–511 (2004)

  13. Gupta, S., et al.: DOM-based content extraction of HTML documents Proceedings of 12th International Conference World Wide Web, pp 207–214 (2003)

  14. Gwet, K.L.: Handbook of Inter-rater Reliability. Advanced Analytics, Gaithersburg (2010)

  15. Hattori, G., et al.: Robust Web page segmentation for mobile terminal using content-distances and page layout information Proceedings of 16th International Conference World Wide Web, pp 361–370 (2007)

  16. Hauzeur, J.P., Mathy, L., De Maertelaer, V.: Comparison between clinical evaluation and ultrasonography in detecting hydrarthrosis of the knee. J. Rheumatol. 26(12), 2681–2683 (1999)

    Google Scholar 

  17. Johnson, G.M., Fairchild, M.D.: A top down description of S-CIELAB and CIEDE2000. Color. Res. Appl. 28(6), 425–435 (2003)

    Article  Google Scholar 

  18. Kang, J., Yang, J., Choi, J.: Repetition-based Web page segmentation by detecting tag patterns for small-screen devices. IEEE Trans. Consumer Electron., 980–986 (2010)

  19. Koffka, K.: Principles of Gestalt Psychology. Routledge, London (1955)

    Google Scholar 

  20. Kohlschütter, C., Nejdl, W.: A densitometric approach to Web page segmentation Proceedings of 17th ACM Conference Inf. and Knowl. Management, pp 1173–1182 (2008)

  21. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biom. 33, 159–174 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  22. Li, M., et al.: The similarity metric. IEEE Trans. Info. Theory, 3250–3264 (2004)

  23. Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from Web documents Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 588–593 (2002)

  24. Liu, H., et al.: A discussion on printing color difference tolerance by CIEDE2000 color difference formula. Appl. Mechanics and Mater. 262, 96–99 (2012)

    Article  Google Scholar 

  25. Luo, M.R., Cui, G., Rigg, B.: The development of the CIE 2000 colour-difference formula: CIEDE2000. Color. Res. Appl. 26(5), 340–350 (2001)

    Article  Google Scholar 

  26. McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification Proceedings of AAAI-98 Workshop on Learn. Text Categ., pp 41–48 (1998)

  27. Narges, R., Miller, J.: Finding homoglyphs – a step towards detecting unicode-based visual spoofing attacks. Int. Conf. Web Inf. Syst. Eng., 1–14 (2011)

  28. Palmer, S.E.: Modern theories of Gestalt perception. Mind Lang. 5(4), 289–323 (1990)

    Article  Google Scholar 

  29. Pereira, A.C., et al.: Validity of caries detection on occlusal surfaces and treatment decisions based on results from multiple caries-detection methods. Eur. J. Oral Sci. 117 (1), 51–57 (2009)

    Article  Google Scholar 

  30. Sharma, G., Wu, W., Dalal, E.N.: The CIEDE2000 color-difference formula: implementation notes, supplementary test data, and mathematical observations. Color. Res. Appl. 30(1), 21–30 (2005)

    Article  Google Scholar 

  31. Sim, D.-G., Kwon, O.-K., Park, R.-H.: Object matching algorithms using robust Hausdorff distance measures. IEEE Trans. Image Process. 8(3), 425–429 (1999)

    Article  Google Scholar 

  32. Song, R., et al.: Learning block importance models for Web pages Proceedings of 13th International Conference World Wide Web, pp 203–211 (2004)

  33. Sternberg, R.J.: Cognitive Psychology, 3rd edn. Wadsworth, Belmont (2003)

    Google Scholar 

  34. Tewarie, P., et al.: The OSCAR-IB consensus criteria for retinal OCT quality assessment. PLoS One 7(4), e34823 (2012)

    Article  Google Scholar 

  35. Unwin, N.: Comparison of the current WHO and new ADA criteria for the diagnosis of diabetes mellitus in three ethnic groups in the UK. Diabet. Med. 15(7), 554–557 (1998)

    Article  Google Scholar 

  36. Yu, C., Ma, W.-Y., Zhang, H.-J.: Detecting Web page structure for adaptive viewing on small form factor devices Proceedings of 12th International Conference World Wide Web, pp 225–233 (2003)

  37. Yu, S., et al.: Improving pseudo-relevance feedback in Web information retrieval using Web page segmentation Proceedings of 12th International Conference World Wide Web, pp 11–18 (2003)

  38. Zhao, C., Shi, W., Deng, Y.: A new Hausdorff distance for image matching. Pattern Recognit. Lett. 26(5), 581–586 (2005)

    Article  Google Scholar 

Download references

Acknowledgments

The authors give thanks to China Scholarship Council (CSC) for their financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Miller.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Z., Miller, J. Identifying semantic blocks in Web pages using Gestalt laws of grouping. World Wide Web 19, 957–978 (2016). https://doi.org/10.1007/s11280-015-0370-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-015-0370-0

Keywords

Navigation