Abstract
We present in this paper a model for indexing and querying web pages, based on the hierarchical decomposition of pages into blocks. Splitting up a page into blocks has several advantages in terms of page design, indexing and querying such as (i) blocks of a page most similar to a query may be returned instead of the page as a whole (ii) the importance of a block can be taken into account, as well as (iii) the permeability of the blocks to neighbor blocks: a block b is said to be permeable to a block b′ in the same page if b′ content (text, image, etc.) can be (partially) inherited by b upon indexing. An engine implementing this model is described including: the transformation of web pages into blocks hierarchies, the definition of a dedicated language to express indexing rules and the storage of indexed blocks into an XML repository. The model is assessed on a dataset of electronic news, and a dataset drawn from web pages of the ImagEval campaign where it improves by 16% the mean average precision of the baseline.
Similar content being viewed by others
References
Bruno, E., Faessel, N., Glotin, H., Le Maitre, J., Scholl, M.: Indexing by permeability in block structured web pages. In: Proceedings of the 9th ACM Symposium on Document Engineering (DocEng 2009), pp. 70–73, (2009) (short paper)
Bruno, E., Faessel, N., Le Maitre, J., Scholl, M.: Blockweb: an IR model for block structured web pages. In: Proc. of 7th Int. Workshop on Content Based Multimedia Indexing (CBMI 2009), pp. 219–224. Chania, Crete, June 3–5 (2009)
Cai, D., He, X., Li, Z., Ma, W.-Y., Wen, J.-R.: Hierarchical clustering of WWW image search results using visual, textual and link information. In: Proc. of the 12th ACM Int. Conf. on Multimedia, MULTIMEDIA ’04, pp. 952–959. ACM, New York, NY, USA (2004)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: A Vision-Based Page Segmentation Algorithm. Technical report, Microsoft Research (2003)
Cui, H., Wen, J.: Hierarchical indexing and flexible element retrieval for structured documents. In: Proc. of the 25th European Conf. on IR Research (ECIR 2003), pp. 73–87. Pisa, Italy (2003)
Debnath, S., Mitra, P., Pal, N., Giles, C.L.: Automatic identification of informative sections of web pages. IEEE Trans. Knowl. Data Eng. 17(9), 1233–1246 (2005)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience (2000)
Ha, J., Haralick, R.M., Phillips, I.T.: Recursive X-Y cut using bounding boxes of connected components. In: Proc. of the Third International Conference on Document Analysis and Recognition (ICDAR’95), vol. 2, pp. 952–955. Washington, DC, USA (1995)
Lin, S.-H., Ho, J.-M.: Discovering informative content blocks from Web documents. In: Proc. of the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 588–593. Edmonton, Alberta, Canada (2002)
Moëllic, P.A., Fluhr, C.: ImagEval 2006 official campaign. CEA List (2006)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning block importance models for web pages. In: Proc. of the 13th Int. Conf. on World Wide Web (WWW 2004), pp. 203–211. Manhattan, NY, USA (2004)
Tollari, S., Glotin, H.: Web image retrieval on ImagEVAL: Evidences on visualness and textualness concept dependency in fusion model. In: Proc. of the ACM Int. Conf. on Image and Video Retrieval (CIVR 2007), pp. 65–72 (2007)
Tollari, S., Glotin, H.: Learning optimal visual features from web sampling in online image retrieval. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP 2008, pp. 1229–1232. IEEE (2008)
Vadrevu, S., Gelgi, F., Davulcu, H.: Information extraction from web pages using presentation regularities and domain knowledge. World Wide Web 10(2), 157–179 (2007)
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proc of the 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 296–305. Washington, DC, USA, ACM (2003)
Zou, J., Le, D., Thoma, G.R.: Combining DOM tree and geometric layout analysis for online medical journal article segmentation. In: Proc. of the 6th ACM/IEEE-CS Joint Conf. on Digital Libraries, pp. 119–128. Chapel Hill, North Carolina, USA (2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bruno, E., Faessel, N., Glotin, H. et al. Indexing and querying segmented web pages: the BlockWeb Model. World Wide Web 14, 623–649 (2011). https://doi.org/10.1007/s11280-011-0124-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-011-0124-6
Keywords
- web page segmentation
- block importance
- block permeability
- web image indexing
- document indexing
- document retrieval