Abstract
Web crawler is widely used in Chinese information processing. According to the problem to be dealt with, crawling related domains data, it provides the basis for subsequent Chinese information processing. The traditional multi-threaded model has obvious limitations and deficiencies when dealing with high concurrency and large number of I/O blocking operations. To solve the above problems, this paper proposes a solution based on the coroutine model. In this paper, the basic principles and implementation methods of coroutine are discussed in detail, then give a complete implementation of web crawler based on coroutine. Experimental results had shown that our scheme can effectively reduce system load and improve web crawler crawling efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Network crawler. https://en.wikipedia.org/wiki/networkcrawler. Accessed 1 Feb 2018
Hua, B.L.: Research on knowledge extraction system architecture based on NLP. New Library Information Technology, pp. 38–41 (2007)
WordNet structure and relations. http://wordnet.princeton.edu/. Accessed 1 Feb 2018
Li, X.M., Yan, H.F., Wang, J.M.: Search Engine: Principle, Technology and System. Science China Press, Beijing (2012)
Yin, J., Yin, Z.B., Huang, H.: Analysis and solution of the bottleneck of web reptilian efficiency. J. Comput. Appl. 28(5), 1114–1119 (2008)
Zhou, D.M., Li, Z.J.: High-performance web reptiles: a review of research. Comput. Sci. 36(8), 26–29 (2009)
Shaver, C., Lee, E.A.: The coroutine model of computation. In: Proceedings of the International Conference on Model Driven Engineering Languages and Systems, pp. 319–334 (2012)
Document of Beautiful Soup. https://www.crummy.com/software/BeautifulSoup/. Accessed 1 Mar 2018
Coroutines with async and await syntax. https://www.python.org/dev/peps/pep-0492/. Accessed 1 Mar 2018
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ding, R., Wang, M. (2018). Design and Implementation of Web Crawler Based on Coroutine Model. In: Sun, X., Pan, Z., Bertino, E. (eds) Cloud Computing and Security. ICCCS 2018. Lecture Notes in Computer Science(), vol 11063. Springer, Cham. https://doi.org/10.1007/978-3-030-00006-6_39
Download citation
DOI: https://doi.org/10.1007/978-3-030-00006-6_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00005-9
Online ISBN: 978-3-030-00006-6
eBook Packages: Computer ScienceComputer Science (R0)