Abstract
When users need to analyze webpages related to some specific topics, generally they use crawlers to acquire webpages, and then analyze the results to extract those match the users’ interests. However, in data acquisition stage, users usually have customize demand on acquiring data. Ordinary crawler systems are very resource-constrained so they cannot traverse the entire internet. Meanwhile, search engines can satisfy these demand but it relies on many manual interactions. The traditional solution is to constrain the crawlers in some limited domain, but this will lead to the problem of low recall rate as well as inefficiency. In order to solve the problems above, this paper does some research on focused crawlers framework based on open search engine. It takes advantage of open search engine’s information gather and retrieval capabilities, and can automatically/semi-automatically generate the topic model to interpret and complete users search intents, with only a few seed keywords need to be provided initially. Then it uses open search engine interfaces to iteratively crawl topic-specific webpages. Compared with the traditional ways, the focused crawler based on open search engine proposed in this paper improves the recall rate and efficiency under the premise of ensuring the accuracy.
This work is supported by the National Key Research and Development Program of China (No. 2016YFB0800402) and the National Natural Science Foundation of China (No. U1536201, U1705261).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Page, L.: The PageRank citation ranking: bringing order to the web. Stanf. Dig. Libr. Work. Paper 9(1), 1–14 (1999)
Kleinberg, J.M.: Hubs, authorities, and communities. ACM Comput. Surv. 31(4es), 5 (1999)
Chakrabarti, S., Berg, M.V.D., Dom, B.: Focused crawling: a new approach to topic specific resource discovery. Comput. Netw. 31(11–16), 1623–1640 (2000)
Bra, D.P.M.E.D.: Searching for arbitrary information in the www: the fish-search for mosaic. In: World Wide Web Conference Series (1994)
Vieira, K., Barbosa, L., Silva, A.S.D., Freire, J., Moura, E.: Finding seeds to boot-strap focused crawlers. World Wide Web-Internet Web Inf. Syst. 19(3), 449–474 (2016)
Rawat, S., Patil, D.R.: Efficient focused crawling based on best first search. In: Advance Computing Conference, pp. 908–911 (2013)
Hersovici, M., Jacovi, M., Maarek, Y.S., Dan, P., Shtalhaim, M., Ur, S.: The shark-search algorithm. An application: tailored web site mapping. In: International Conference on World Wide Web, pp. 317–326 (1998)
Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the world wide web with arbitrary predicates. In: International Conference on World Wide Web, pp. 96–105 (2001)
Novak, B.: A survey of focused web crawling algorithms (2004)
Baidu Encyclopedia: Meta-search engine. https://baike.baidu.com/item/%E5%85%83%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E/205513?fr=aladdin. Accessed 27 Feb 2018
Blei, D.M., Lafferty, J.D.: Topic models. In: Text Mining, pp. 101–124. Chapman and Hall/CRC (2009)
Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60(5), 503–520 (2013)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. Arch. 3, 993–1022 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, J., Huang, Y. (2018). Focused Crawler Framework Based on Open Search Engine. In: Sun, X., Pan, Z., Bertino, E. (eds) Cloud Computing and Security. ICCCS 2018. Lecture Notes in Computer Science(), vol 11065. Springer, Cham. https://doi.org/10.1007/978-3-030-00012-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-00012-7_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00011-0
Online ISBN: 978-3-030-00012-7
eBook Packages: Computer ScienceComputer Science (R0)