Skip to main content

Combining Text and Link Analysis for Focused Crawling

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3686))

Abstract

The number of vertical search engines and portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain specific web documents. We compare its efficiency with other well-known web information retrieval techniques. Our implementation presents a different approach to focused crawling and aims to overcome the limitations of the necessity to provide initial training data while maintaining a high recall/precision ratio.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Google Search Technology Online at, http://www.google.com/technology/index.html

  2. Steele, R.: Techniques for Specialized Search Engines. In: Proc. Internet Computing, Las Vegas (2001)

    Google Scholar 

  3. Chakrabarti, S., Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31, 1623–1640 (1999)

    Article  Google Scholar 

  4. Najork, M., Wiener, J.: Breadth-first search crawling yields high-quality pages. In: Proc. 10th Int. World Wide Web Conf., pp. 114–118 (2001)

    Google Scholar 

  5. Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S.: Searching the Web. ACM Transactions on Internet Technology 1(1), 2–43 (2001)

    Article  Google Scholar 

  6. Yang, K.: Combining text- and link-based methods for Web IR. In: Proc. 10th Text Rerieval Conf (TREC-10), Washington, DC, U.S. Government Printing Office (2002)

    Google Scholar 

  7. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  8. Ng, A., Zheng, A., Jordan, M.: Stable algorithms for link analysis. In: ACM Conf. on Research and Development in Infomation Retrieval, pp. 258–266 (2001)

    Google Scholar 

  9. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. WWW7 / Computer Networks 30(1-7), 107–117 (1998)

    Google Scholar 

  10. Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. 9th Annual ACM-SIAM Symposium Discrete Algorithms, January 1998, pp. 668–677 (1998)

    Google Scholar 

  11. Berry, M., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval, Society of Industrial and Applied Mathematics, Philadelphia (1999)

    Google Scholar 

  12. O’Brien, G.: Information Management Tools for Updating an SVD-Encoded Indexing Scheme. Master’s thesis, University of Tennessee, Knoxville, TN (1994)

    Google Scholar 

  13. Bharat, K., Henzinger, M.: Improved algorithms for topic distillation in hyperlinked environments. In: Proc. Int. Conf. Research and Development in Information Retrieval, Melbourne (Australia), August 1998, pp. 104–111 (1998)

    Google Scholar 

  14. Cohn, D., Chang, H.: Learning to probabilistically identify authoritative documents. In: Proc. 17th Int. Conf. Machine Learning, pp. 167–174 (2000)

    Google Scholar 

  15. Srinivasan, P., Pant, G., Menczer, F.: Target Seeking Crawlers and their Topical Performance. In: Proc. Int. Conf. Research and Development in Information Retrieval (August 2002)

    Google Scholar 

  16. Chau, M., Chen, H.: Comparison of three vertical search spiders. Computer 36(5), 56–62 (2003)

    Article  Google Scholar 

  17. Cohn, D., Hoffman, T.: The Missing Link-A probabilistic model of document content and hypertext connectivity. Advances in Neural Information Processing Systems 13, 430–436 (2001)

    Google Scholar 

  18. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proc. 26th Int. Conf. Very Large Databases (VLDB 2000), Cairo, pp. 527–534 (2000)

    Google Scholar 

  19. Rennie, J., McCallum, A.: Using reinforcement learning to spider the Web efficiently. In: Proc. 16th Int. Conf. Machine Learning (ICML 1999), pp. 335–343 (1999)

    Google Scholar 

  20. Chakrabarti, S.: Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In: Proc. 10th Int. World Wide Web Conf., Hong Kong, pp. 211–220 (2001)

    Google Scholar 

  21. Cho, J., Molina, H.G., Page, L.: Efficient Crawling through URL Ordering. In: Proc. 7th Int. World Wide Web Conf., Brisbane, Australia, pp. 161–172 (1998)

    Google Scholar 

  22. Aggarwal, C., Al-Garawi, F., Yu, P.: Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In: Proc. 10th Int. World Wide Web Conf., Hong Kong, pp. 96–105 (2001)

    Google Scholar 

  23. Menczer, F., Pant, G., Ruiz, M., Srinivasan, P.: Evaluating topic-driven web crawlers. In: Proc. Int. Conf. Research and Development in Information, New Orleans, pp. 241–249 (2001)

    Google Scholar 

  24. Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: Proc. 12th Int. Conf. Information and Knowledge Management, New Orleans, USA, November 2003, pp. 394–401 (2003)

    Google Scholar 

  25. Varlamis, I., Vazirgiannis, M., Halkidi, M., Nguyen, B.: THESUS: Effective thematic selection and organization of web document collections based on link semantics. IEEE Trans. Knowledge & Data Engineering 16(6), 585–600 (2004)

    Google Scholar 

  26. Bergmark, D., Lagoze, C., Sbityakov, A.: Focused Crawls, Tunneling, and Digital Libraries. In: Proc. 6th European Conf. Research and Advanced Technology for Digital Libraries, pp. 91–106 (2002)

    Google Scholar 

  27. Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2002)

    Google Scholar 

  28. Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm. An Application: tailored Web site mapping. Computer Networks and ISDN Systems 30, 317–326 (1998)

    Article  Google Scholar 

  29. CMU World Wide Knowledge Base and WebKB dataset. Online at, http://www-2.cs.cmu.edu/~webkb

  30. Pant, G., Srinivasan, P., Menczer, F.: Exploration versus exploitation in topic driven crawlers. In: Proc. 2nd Int. Workshop Web Dynamics (May 2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Almpanidis, G., Kotropoulos, C. (2005). Combining Text and Link Analysis for Focused Crawling. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds) Pattern Recognition and Data Mining. ICAPR 2005. Lecture Notes in Computer Science, vol 3686. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551188_30

Download citation

  • DOI: https://doi.org/10.1007/11551188_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28757-5

  • Online ISBN: 978-3-540-28758-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics