Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context

Peng, Tao; He, Fengling; Zuo, Wanli; Zhang, Changli

doi:10.1007/11925231_92

Tao Peng²⁰,
Fengling He²⁰,
Wanli Zuo²⁰ &
…
Changli Zhang²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4293))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

814 Accesses
6 Citations

Abstract

Topical web crawling technology is important for domain-specific resource discovery. Topical crawlers yield good recall as well as good precision by restricting themselves to a specific domain from web pages. There is an intuition that the text surrounding a link or the link-context on the HMTL page is a good summary of the target page. Motivated by that, This paper investigates some alternative methods and advocates that the link-context derived from reference page’s HTML tag tree can provide a wealth of illumination for steering crawler to stay on domain-specific topic. In order that crawler can acquire enough illumination from link-context, we initially look for some referring pages by traversing backward from seed URLs, and then build initial term-based feature set by parsing the link-contexts extracted from those reference web pages. Used to measure the similarity between the crawled pages’ link-context, the feature set can be adaptively trained by some link-contexts to relevant pages during crawling. This paper also presents some important metrics and an evaluation function for ranking URLs about pages relevance. A comprehensive experiment has been conducted, the result shows obviously that this approach outperforms Best-First and Breath-First algorithm both in harvest rate and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Efficient Topical Focused Crawling Through Neighborhood Feature

Article 15 December 2017

URL-Based Relevance-Ranking Approach to Facilitate Domain-Specific Crawling and Searching

Ranking Web Page with Path Trust Knowledge Graph

References

Pinkerton, B.: Finding What People Want: Experiences with the WebCrawler. In: Proc. 1st international World Wide Web Conference (1994)
Google Scholar
De Bra, R., Post, D.J.: Information Retrieval in the World-Wide Web: Making Client-based Searching Feasible. In: Proceedings of the First International World-Wide Web conference, Geneva (1994)
Google Scholar
Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm-An application: Tailored Web site mapping. In: Proc. 7th Intl. World-Wide Web Conference (1998)
Google Scholar
Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling Through URL Ordering. In: Proceedings of 7th World Wide Web Conference (1998)
Google Scholar
Menczer, F., Belew, R.: Adaptive retrieval agents: internalizing local context and scaling up to the web. Machine Learning 39(2–3), 203–242 (2000)
Article MATH Google Scholar
Pant, G., Menczer, F.: Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, Springer, Heidelberg (2003)
Google Scholar
Johnson, J., Tsioutsiouliklis, K., Giles, C.L.: Evolving strategies for focused Web crawling. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington (2003)
Google Scholar
Li, J., Furuse, K., Yamaguchi, K.: Focused Crawling by Exploiting Anchor Text Using Decision Tree. In: WWW 2005, Chiba, Japan, May 10-14, 2005, ACM, New York (2005), 1-59593-051-5/05/0005
Google Scholar
Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. of the 9th ACM-SIAM Symposium on Discrete Algorithms (1998)
Google Scholar
Brin, S., Page, L.: The PageRank Citation Ranking: Bringing Order to the Web. In: Technical Report (January 1998), available at http://www-db.stanford.edu/~backrub/pageranksub.ps
McBryan, O.A.: GENVL and WWWW: Tools for taming the Web. In: Proceedings of the First International Conference on the World Wide Web, May 1994, Geneva, Switzerland, CERN (1994)
Google Scholar
Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. In: SIGIR 2003, pp. 459–460 (2003)
Google Scholar
Tateishi, K., Kawai, H., Akamine, S., Matsuda, K., Fukushima, T.: Evaluation of Web Retrieval Method Using Anchor Text. In: Proceedings of the 3rd NTCIR Workshop, pp. 25–29 (2002)
Google Scholar
Iwazume, M., Shirakami, K., Hatadani, K., Takeda, H., Nishida, T.: Iica: An ontology-based internet navigation system. In: AAAI 1996 Workshop on Internet Based Information Systems (1996)
Google Scholar
Chakrabarti, Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW 2002, pp. 148–159 (2002)
Google Scholar
Pant, G.: Deriving Link-context from HTML Tag Tree. In: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, Changchun, 130012, China
Tao Peng, Fengling He, Wanli Zuo & Changli Zhang

Authors

Tao Peng
View author publications
You can also search for this author in PubMed Google Scholar
Fengling He
View author publications
You can also search for this author in PubMed Google Scholar
Wanli Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Changli Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, 07738, Mexico City, México
Alexander Gelbukh
Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE), Luis Enrique Erro No. 1, Sta. Ma. Tonanzintla, 72840, Puebla, México
Carlos Alberto Reyes-Garcia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, T., He, F., Zuo, W., Zhang, C. (2006). Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context. In: Gelbukh, A., Reyes-Garcia, C.A. (eds) MICAI 2006: Advances in Artificial Intelligence. MICAI 2006. Lecture Notes in Computer Science(), vol 4293. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11925231_92

Download citation

DOI: https://doi.org/10.1007/11925231_92
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49026-5
Online ISBN: 978-3-540-49058-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context

Abstract

Access this chapter

Preview

Similar content being viewed by others

Efficient Topical Focused Crawling Through Neighborhood Feature

URL-Based Relevance-Ranking Approach to Facilitate Domain-Specific Crawling and Searching

Ranking Web Page with Path Trust Knowledge Graph

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Adaptive Topical Web Crawling for Domain-Specific Resource Discovery Guided by Link-Context

Abstract

Access this chapter

Preview

Similar content being viewed by others

Efficient Topical Focused Crawling Through Neighborhood Feature

URL-Based Relevance-Ranking Approach to Facilitate Domain-Specific Crawling and Searching

Ranking Web Page with Path Trust Knowledge Graph

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation