Abstract
Crawling the web to find relevant pages of the desired topics is called focused crawling. In this paper we propose a focused crawling method based on vision-based page segmentation (VIPS) algorithm. VIPS determines related parts of a web page which is called page blocks. The proposed method considers the text of the block as the link contexts of containing links of the block. Link contexts are terms that appear around the hyperlinks within the text of the web page. Since VIPS algorithm utilizes visual clues in the page segmentation process and is independent from the HTML structure of the page, it can find link contexts in an accurate manner. Our empirical study show higher performance of the proposed focused crawling method in comparison with the existing state of the art results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gulli, A.: The indexable web is more than 11.5 billion pages. In: International Conference on World Wide Web (2005)
Lewandowski, D.: A three-year study on the freshness of Web search engine databases. Journal of Information Science 34(6), 817–831 (2008)
Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering 18(1), 107–122 (2006)
Attardi, G., Gullì, A., Sebastiani, F.: Automatic Web page categorization by link and context analysis. In: Proc. THAI 1999, pp. 105–119 (1999)
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Block-based web search. In: Proceedings of the 27th ACM SIGIR Conference, pp. 456–463 (2004)
De Bra, P.M.E., Post, R.D.J.: Information retrieval in the World Wide Web: Making client-based searching feasible. Computer Networks and ISDN Systems 27(2), 183–192 (1994)
Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999)
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the Eleventh International Conference on World Wide Web, WWW 2002, pp. 148–159 (2002)
Peng, T., Zhang, C., Zuo, W.: Tunneling enhanced by web page content block partition for focused crawling. Concurrency and Computation: Practice and Experience 20(1), 61–74 (2008)
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a visionbased page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)
Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling Using Context Graphs. In: Proceedings of 26th VLDB Conference, pp. 527–534 (2000)
Rennie, J., McCallum, A.K.: Using reinforcement learning to spider the web efficiently. In: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 335–343 (1999)
Liu, H., Janssen, J., Milios, E.: Using HMM to learn user browsing patterns for focused web crawling. Data & Knowledge Engineering 59(2), 270–329 (2006)
Batsakis, S., Petrakis, E.G.M., Milios, E.: Improving the performance of focused web crawlers. Data & Knowledge Engineering 68(10), 1001–1013 (2009)
Wang, C., Guan, Z.-Y., Chen, C., Bu, J.-J., Wang, J.-F., Lin, H.-Z.: On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis. Journal of Zhejiang University SCIENCE A 10(8), 1114–1124 (2009)
http://rdf.dmoz.org/ (accessed, October 2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Naghibi, M., Rahmani, A.T. (2012). Focused Crawling Using Vision-Based Page Segmentation. In: Dua, S., Gangopadhyay, A., Thulasiraman, P., Straccia, U., Shepherd, M., Stein, B. (eds) Information Systems, Technology and Management. ICISTM 2012. Communications in Computer and Information Science, vol 285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29166-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-29166-1_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29165-4
Online ISBN: 978-3-642-29166-1
eBook Packages: Computer ScienceComputer Science (R0)