Abstract
Topical web crawling technology is important for domain-specific resource discovery. Topical crawlers yield good recall as well as good precision by restricting themselves to a specific domain from web pages. There is an intuition that the text surrounding a link or the link-context on the HMTL page is a good summary of the target page. Motivated by that, This paper investigates some alternative methods and advocates that the link-context derived from reference page’s HTML tag tree can provide a wealth of illumination for steering crawler to stay on domain-specific topic. In order that crawler can acquire enough illumination from link-context, we initially look for some referring pages by traversing backward from seed URLs, and then build initial term-based feature set by parsing the link-contexts extracted from those reference web pages. Used to measure the similarity between the crawled pages’ link-context, the feature set can be adaptively trained by some link-contexts to relevant pages during crawling. This paper also presents some important metrics and an evaluation function for ranking URLs about pages relevance. A comprehensive experiment has been conducted, the result shows obviously that this approach outperforms Best-First and Breath-First algorithm both in harvest rate and efficiency.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Pinkerton, B.: Finding What People Want: Experiences with the WebCrawler. In: Proc. 1st international World Wide Web Conference (1994)
De Bra, R., Post, D.J.: Information Retrieval in the World-Wide Web: Making Client-based Searching Feasible. In: Proceedings of the First International World-Wide Web conference, Geneva (1994)
Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm-An application: Tailored Web site mapping. In: Proc. 7th Intl. World-Wide Web Conference (1998)
Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling Through URL Ordering. In: Proceedings of 7th World Wide Web Conference (1998)
Menczer, F., Belew, R.: Adaptive retrieval agents: internalizing local context and scaling up to the web. Machine Learning 39(2–3), 203–242 (2000)
Pant, G., Menczer, F.: Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, Springer, Heidelberg (2003)
Johnson, J., Tsioutsiouliklis, K., Giles, C.L.: Evolving strategies for focused Web crawling. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington (2003)
Li, J., Furuse, K., Yamaguchi, K.: Focused Crawling by Exploiting Anchor Text Using Decision Tree. In: WWW 2005, Chiba, Japan, May 10-14, 2005, ACM, New York (2005), 1-59593-051-5/05/0005
Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. of the 9th ACM-SIAM Symposium on Discrete Algorithms (1998)
Brin, S., Page, L.: The PageRank Citation Ranking: Bringing Order to the Web. In: Technical Report (January 1998), available at http://www-db.stanford.edu/~backrub/pageranksub.ps
McBryan, O.A.: GENVL and WWWW: Tools for taming the Web. In: Proceedings of the First International Conference on the World Wide Web, May 1994, Geneva, Switzerland, CERN (1994)
Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. In: SIGIR 2003, pp. 459–460 (2003)
Tateishi, K., Kawai, H., Akamine, S., Matsuda, K., Fukushima, T.: Evaluation of Web Retrieval Method Using Anchor Text. In: Proceedings of the 3rd NTCIR Workshop, pp. 25–29 (2002)
Iwazume, M., Shirakami, K., Hatadani, K., Takeda, H., Nishida, T.: Iica: An ontology-based internet navigation system. In: AAAI 1996 Workshop on Internet Based Information Systems (1996)
Chakrabarti, Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW 2002, pp. 148–159 (2002)
Pant, G.: Deriving Link-context from HTML Tag Tree. In: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Peng, T., He, F., Zuo, W., Zhang, C. (2006). Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context. In: Gelbukh, A., Reyes-Garcia, C.A. (eds) MICAI 2006: Advances in Artificial Intelligence. MICAI 2006. Lecture Notes in Computer Science(), vol 4293. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11925231_92
Download citation
DOI: https://doi.org/10.1007/11925231_92
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49026-5
Online ISBN: 978-3-540-49058-6
eBook Packages: Computer ScienceComputer Science (R0)