Skip to main content

Focused Crawling Using Vision-Based Page Segmentation

  • Conference paper
Information Systems, Technology and Management (ICISTM 2012)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 285))

Abstract

Crawling the web to find relevant pages of the desired topics is called focused crawling. In this paper we propose a focused crawling method based on vision-based page segmentation (VIPS) algorithm. VIPS determines related parts of a web page which is called page blocks. The proposed method considers the text of the block as the link contexts of containing links of the block. Link contexts are terms that appear around the hyperlinks within the text of the web page. Since VIPS algorithm utilizes visual clues in the page segmentation process and is independent from the HTML structure of the page, it can find link contexts in an accurate manner. Our empirical study show higher performance of the proposed focused crawling method in comparison with the existing state of the art results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gulli, A.: The indexable web is more than 11.5 billion pages. In: International Conference on World Wide Web (2005)

    Google Scholar 

  2. Lewandowski, D.: A three-year study on the freshness of Web search engine databases. Journal of Information Science 34(6), 817–831 (2008)

    Article  MathSciNet  Google Scholar 

  3. Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering 18(1), 107–122 (2006)

    Article  Google Scholar 

  4. Attardi, G., Gullì, A., Sebastiani, F.: Automatic Web page categorization by link and context analysis. In: Proc. THAI 1999, pp. 105–119 (1999)

    Google Scholar 

  5. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Block-based web search. In: Proceedings of the 27th ACM SIGIR Conference, pp. 456–463 (2004)

    Google Scholar 

  6. De Bra, P.M.E., Post, R.D.J.: Information retrieval in the World Wide Web: Making client-based searching feasible. Computer Networks and ISDN Systems 27(2), 183–192 (1994)

    Article  Google Scholar 

  7. Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999)

    Article  Google Scholar 

  8. Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the Eleventh International Conference on World Wide Web, WWW 2002, pp. 148–159 (2002)

    Google Scholar 

  9. Peng, T., Zhang, C., Zuo, W.: Tunneling enhanced by web page content block partition for focused crawling. Concurrency and Computation: Practice and Experience 20(1), 61–74 (2008)

    Article  Google Scholar 

  10. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a visionbased page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)

    Google Scholar 

  11. Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling Using Context Graphs. In: Proceedings of 26th VLDB Conference, pp. 527–534 (2000)

    Google Scholar 

  12. Rennie, J., McCallum, A.K.: Using reinforcement learning to spider the web efficiently. In: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 335–343 (1999)

    Google Scholar 

  13. Liu, H., Janssen, J., Milios, E.: Using HMM to learn user browsing patterns for focused web crawling. Data & Knowledge Engineering 59(2), 270–329 (2006)

    Article  Google Scholar 

  14. Batsakis, S., Petrakis, E.G.M., Milios, E.: Improving the performance of focused web crawlers. Data & Knowledge Engineering 68(10), 1001–1013 (2009)

    Article  Google Scholar 

  15. Wang, C., Guan, Z.-Y., Chen, C., Bu, J.-J., Wang, J.-F., Lin, H.-Z.: On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis. Journal of Zhejiang University SCIENCE A 10(8), 1114–1124 (2009)

    Article  Google Scholar 

  16. http://rdf.dmoz.org/ (accessed, October 2011)

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Naghibi, M., Rahmani, A.T. (2012). Focused Crawling Using Vision-Based Page Segmentation. In: Dua, S., Gangopadhyay, A., Thulasiraman, P., Straccia, U., Shepherd, M., Stein, B. (eds) Information Systems, Technology and Management. ICISTM 2012. Communications in Computer and Information Science, vol 285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29166-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29166-1_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29165-4

  • Online ISBN: 978-3-642-29166-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics