Skip to main content

Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4293))

Abstract

Topical web crawling technology is important for domain-specific resource discovery. Topical crawlers yield good recall as well as good precision by restricting themselves to a specific domain from web pages. There is an intuition that the text surrounding a link or the link-context on the HMTL page is a good summary of the target page. Motivated by that, This paper investigates some alternative methods and advocates that the link-context derived from reference page’s HTML tag tree can provide a wealth of illumination for steering crawler to stay on domain-specific topic. In order that crawler can acquire enough illumination from link-context, we initially look for some referring pages by traversing backward from seed URLs, and then build initial term-based feature set by parsing the link-contexts extracted from those reference web pages. Used to measure the similarity between the crawled pages’ link-context, the feature set can be adaptively trained by some link-contexts to relevant pages during crawling. This paper also presents some important metrics and an evaluation function for ranking URLs about pages relevance. A comprehensive experiment has been conducted, the result shows obviously that this approach outperforms Best-First and Breath-First algorithm both in harvest rate and efficiency.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   239.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pinkerton, B.: Finding What People Want: Experiences with the WebCrawler. In: Proc. 1st international World Wide Web Conference (1994)

    Google Scholar 

  2. De Bra, R., Post, D.J.: Information Retrieval in the World-Wide Web: Making Client-based Searching Feasible. In: Proceedings of the First International World-Wide Web conference, Geneva (1994)

    Google Scholar 

  3. Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm-An application: Tailored Web site mapping. In: Proc. 7th Intl. World-Wide Web Conference (1998)

    Google Scholar 

  4. Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling Through URL Ordering. In: Proceedings of 7th World Wide Web Conference (1998)

    Google Scholar 

  5. Menczer, F., Belew, R.: Adaptive retrieval agents: internalizing local context and scaling up to the web. Machine Learning 39(2–3), 203–242 (2000)

    Article  MATH  Google Scholar 

  6. Pant, G., Menczer, F.: Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, Springer, Heidelberg (2003)

    Google Scholar 

  7. Johnson, J., Tsioutsiouliklis, K., Giles, C.L.: Evolving strategies for focused Web crawling. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington (2003)

    Google Scholar 

  8. Li, J., Furuse, K., Yamaguchi, K.: Focused Crawling by Exploiting Anchor Text Using Decision Tree. In: WWW 2005, Chiba, Japan, May 10-14, 2005, ACM, New York (2005), 1-59593-051-5/05/0005

    Google Scholar 

  9. Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. of the 9th ACM-SIAM Symposium on Discrete Algorithms (1998)

    Google Scholar 

  10. Brin, S., Page, L.: The PageRank Citation Ranking: Bringing Order to the Web. In: Technical Report (January 1998), available at http://www-db.stanford.edu/~backrub/pageranksub.ps

  11. McBryan, O.A.: GENVL and WWWW: Tools for taming the Web. In: Proceedings of the First International Conference on the World Wide Web, May 1994, Geneva, Switzerland, CERN (1994)

    Google Scholar 

  12. Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. In: SIGIR 2003, pp. 459–460 (2003)

    Google Scholar 

  13. Tateishi, K., Kawai, H., Akamine, S., Matsuda, K., Fukushima, T.: Evaluation of Web Retrieval Method Using Anchor Text. In: Proceedings of the 3rd NTCIR Workshop, pp. 25–29 (2002)

    Google Scholar 

  14. Iwazume, M., Shirakami, K., Hatadani, K., Takeda, H., Nishida, T.: Iica: An ontology-based internet navigation system. In: AAAI 1996 Workshop on Internet Based Information Systems (1996)

    Google Scholar 

  15. Chakrabarti, Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW 2002, pp. 148–159 (2002)

    Google Scholar 

  16. Pant, G.: Deriving Link-context from HTML Tag Tree. In: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Peng, T., He, F., Zuo, W., Zhang, C. (2006). Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context. In: Gelbukh, A., Reyes-Garcia, C.A. (eds) MICAI 2006: Advances in Artificial Intelligence. MICAI 2006. Lecture Notes in Computer Science(), vol 4293. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11925231_92

Download citation

  • DOI: https://doi.org/10.1007/11925231_92

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-49026-5

  • Online ISBN: 978-3-540-49058-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics