Abstract
The shark-search algorithm is a classical content-based theme crawling algorithm. However, it has some disadvantages on crawling scope, including the viscousness phenomenon. To avoid this shortcoming of the original shark-search algorithm, an improved shark-search algorithm combining URL-analysis algorithm and host-control strategy is proposed in this paper. The accessed frequency of a host is considered in this new algorithm. The experimental results show that the proposed algorithm can overcome shortages of the original shark-search algorithm and improve the efficiency of a theme crawler.
This research is supported in parts by Youth Fund Project of Humanities and Social Sciences Research from the Chinese Ministry of Education(No.12YJCZH201) and National Natural Science Fund (No.61103101).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Panidis, A., Poulos, G.K.C., Pitas, I.: Combining Text and Link Analysis for Focused Crawling-an Application for Vertical Search Engines. Information System 32(6), 886–908 (2007)
Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: evaluating adaptive algorithms. ACM Transactions on Internet Technology 4(4), 378–419 (2004)
Herseovici, M., Jacov, M., Maarek, Y.S.: The Shark-Search Algorithm-An Application: Tailored Web Site Mapping. Computer Networks and ISDN Systems 30, 317–326 (1998)
Ouyang, L.-B., Li, X.-Y., Li, G.-H., et al.: A survey of web spiders searching strategies of topic-specific search engine. Computer Engineering 30(13), 32–46 (2004)
Bra, D.P., Post, R.: Searching for arbitrary information in the WWW: the fish-search for mosaic. In: Second WWW Conference, pp. 45–51. ACM Press, Chicago (1994)
Page, L., Brin, S., Motwani, R.: The PageRank Citation Ranking: Bring Order to the Web. Stanford University (1998)
Kleinberg, J.: Authoritative Sources in A Hyperlinked Environment. Journal of the ACM 46(5), 604–632 (1999)
Liu, Y.-F.: Focus crawler researching in search engine. SUN Yat-Sen University, Guangzhou (2005)
Liu, P., Lin, H., Gao, D.-W.: Research on crawling strategy of subject searching spider by content-based and hyperlink-based analysis. Computer & Digital Engineering, 22–24 (January 2009)
Chen, Y.-F., Zhao, H.-K., Yu, X.-Q., Wan, W.-G.: Improvement of focused crawling strategy based on genetic algorithm. Computer Simulation 27(17), 87–90 (2010)
Liu, S.-M., Xia, L., Xu, N.-S.: Search strategy and achieve of the topic search engine crawler. Computer System & Applications 19(3), 49–52 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Luo, L., Wang, Rb., Huang, Xx., Chen, Zq. (2012). A Novel Shark-Search Algorithm for Theme Crawler. In: Wang, F.L., Lei, J., Gong, Z., Luo, X. (eds) Web Information Systems and Mining. WISM 2012. Lecture Notes in Computer Science, vol 7529. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33469-6_75
Download citation
DOI: https://doi.org/10.1007/978-3-642-33469-6_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33468-9
Online ISBN: 978-3-642-33469-6
eBook Packages: Computer ScienceComputer Science (R0)