Abstract
Topical web crawling technology is important for domain-specific resource discovery. Topical crawlers yield good recall as well as good precision by restricting themselves to a specific domain from web pages. There is an intuition that the text surrounding a link or the link-context on the HMTL page is a good summary of the target page. Motivated by that, This paper investigates some alternative methods and advocates that the link-context derived from reference page’s HTML tag tree can provide a wealth of illumination for steering crawler to stay on domain-specific topic. In order that crawler can acquire enough illumination from link-context, we initially look for some referring pages by traversing backward from seed URLs, and then build initial term-based feature set by parsing the link-contexts extracted from those reference web pages. Used to measure the similarity between the crawled pages’ link-context, the feature set can be adaptively trained by some link-contexts to relevant pages during crawling. This paper also presents some important metrics and an evaluation function for ranking URLs about pages relevance. A comprehensive experiment has been conducted, the result shows obviously that this approach outperforms Best-First and Breath-First algorithm both in harvest rate and efficiency.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Pinkerton, B.: Finding What People Want: Experiences with the WebCrawler. In: Proc. 1st international World Wide Web Conference (1994)
De Bra, R., Post, D.J.: Information Retrieval in the World-Wide Web: Making Client-based Searching Feasible. In: Proceedings of the First International World-Wide Web conference, Geneva (1994)
Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm-An application: Tailored Web site mapping. In: Proc. 7th Intl. World-Wide Web Conference (1998)
Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling Through URL Ordering. In: Proceedings of 7th World Wide Web Conference (1998)
Menczer, F., Belew, R.: Adaptive retrieval agents: internalizing local context and scaling up to the web. Machine Learning 39(2–3), 203–242 (2000)
Pant, G., Menczer, F.: Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, Springer, Heidelberg (2003)
Johnson, J., Tsioutsiouliklis, K., Giles, C.L.: Evolving strategies for focused Web crawling. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington (2003)
Li, J., Furuse, K., Yamaguchi, K.: Focused Crawling by Exploiting Anchor Text Using Decision Tree. In: WWW 2005, Chiba, Japan, May 10-14, 2005, ACM, New York (2005), 1-59593-051-5/05/0005
Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. of the 9th ACM-SIAM Symposium on Discrete Algorithms (1998)
Brin, S., Page, L.: The PageRank Citation Ranking: Bringing Order to the Web. In: Technical Report (January 1998), available at http://www-db.stanford.edu/~backrub/pageranksub.ps
McBryan, O.A.: GENVL and WWWW: Tools for taming the Web. In: Proceedings of the First International Conference on the World Wide Web, May 1994, Geneva, Switzerland, CERN (1994)
Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. In: SIGIR 2003, pp. 459–460 (2003)
Tateishi, K., Kawai, H., Akamine, S., Matsuda, K., Fukushima, T.: Evaluation of Web Retrieval Method Using Anchor Text. In: Proceedings of the 3rd NTCIR Workshop, pp. 25–29 (2002)
Iwazume, M., Shirakami, K., Hatadani, K., Takeda, H., Nishida, T.: Iica: An ontology-based internet navigation system. In: AAAI 1996 Workshop on Internet Based Information Systems (1996)
Chakrabarti, Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW 2002, pp. 148–159 (2002)
Pant, G.: Deriving Link-context from HTML Tag Tree. In: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Peng, T., He, F., Zuo, W., Zhang, C. (2006). Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context. In: Gelbukh, A., Reyes-Garcia, C.A. (eds) MICAI 2006: Advances in Artificial Intelligence. MICAI 2006. Lecture Notes in Computer Science(), vol 4293. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11925231_92
Download citation
DOI: https://doi.org/10.1007/11925231_92
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49026-5
Online ISBN: 978-3-540-49058-6
eBook Packages: Computer ScienceComputer Science (R0)