Skip to main content

Indexing and querying segmented web pages: the BlockWeb Model

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

We present in this paper a model for indexing and querying web pages, based on the hierarchical decomposition of pages into blocks. Splitting up a page into blocks has several advantages in terms of page design, indexing and querying such as (i) blocks of a page most similar to a query may be returned instead of the page as a whole (ii) the importance of a block can be taken into account, as well as (iii) the permeability of the blocks to neighbor blocks: a block b is said to be permeable to a block b′ in the same page if b′ content (text, image, etc.) can be (partially) inherited by b upon indexing. An engine implementing this model is described including: the transformation of web pages into blocks hierarchies, the definition of a dedicated language to express indexing rules and the storage of indexed blocks into an XML repository. The model is assessed on a dataset of electronic news, and a dataset drawn from web pages of the ImagEval campaign where it improves by 16% the mean average precision of the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bruno, E., Faessel, N., Glotin, H., Le Maitre, J., Scholl, M.: Indexing by permeability in block structured web pages. In: Proceedings of the 9th ACM Symposium on Document Engineering (DocEng 2009), pp. 70–73, (2009) (short paper)

  2. Bruno, E., Faessel, N., Le Maitre, J., Scholl, M.: Blockweb: an IR model for block structured web pages. In: Proc. of 7th Int. Workshop on Content Based Multimedia Indexing (CBMI 2009), pp. 219–224. Chania, Crete, June 3–5 (2009)

  3. Cai, D., He, X., Li, Z., Ma, W.-Y., Wen, J.-R.: Hierarchical clustering of WWW image search results using visual, textual and link information. In: Proc. of the 12th ACM Int. Conf. on Multimedia, MULTIMEDIA ’04, pp. 952–959. ACM, New York, NY, USA (2004)

    Chapter  Google Scholar 

  4. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: A Vision-Based Page Segmentation Algorithm. Technical report, Microsoft Research (2003)

  5. Cui, H., Wen, J.: Hierarchical indexing and flexible element retrieval for structured documents. In: Proc. of the 25th European Conf. on IR Research (ECIR 2003), pp. 73–87. Pisa, Italy (2003)

  6. Debnath, S., Mitra, P., Pal, N., Giles, C.L.: Automatic identification of informative sections of web pages. IEEE Trans. Knowl. Data Eng. 17(9), 1233–1246 (2005)

    Article  Google Scholar 

  7. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience (2000)

  8. Ha, J., Haralick, R.M., Phillips, I.T.: Recursive X-Y cut using bounding boxes of connected components. In: Proc. of the Third International Conference on Document Analysis and Recognition (ICDAR’95), vol. 2, pp. 952–955. Washington, DC, USA (1995)

  9. Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from Web documents. In: Proc. of the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 588–593. Edmonton, Alberta, Canada (2002)

  10. Moëllic, P.A., Fluhr, C.: ImagEval 2006 official campaign. CEA List (2006)

  11. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  12. Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning block importance models for web pages. In: Proc. of the 13th Int. Conf. on World Wide Web (WWW 2004), pp. 203–211. Manhattan, NY, USA (2004)

  13. Tollari, S., Glotin, H.: Web image retrieval on ImagEVAL: Evidences on visualness and textualness concept dependency in fusion model. In: Proc. of the ACM Int. Conf. on Image and Video Retrieval (CIVR 2007), pp. 65–72 (2007)

  14. Tollari, S., Glotin, H.: Learning optimal visual features from web sampling in online image retrieval. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP 2008, pp. 1229–1232. IEEE (2008)

  15. Vadrevu, S., Gelgi, F., Davulcu, H.: Information extraction from web pages using presentation regularities and domain knowledge. World Wide Web 10(2), 157–179 (2007)

    Article  Google Scholar 

  16. Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proc of the 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 296–305. Washington, DC, USA, ACM (2003)

  17. Zou, J., Le, D., Thoma, G.R.: Combining DOM tree and geometric layout analysis for online medical journal article segmentation. In: Proc. of the 6th ACM/IEEE-CS Joint Conf. on Digital Libraries, pp. 119–128. Chapel Hill, North Carolina, USA (2006)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emmanuel Bruno.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bruno, E., Faessel, N., Glotin, H. et al. Indexing and querying segmented web pages: the BlockWeb Model. World Wide Web 14, 623–649 (2011). https://doi.org/10.1007/s11280-011-0124-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-011-0124-6

Keywords

Mathematics Subject Classifications (2010)