Abstract
With the explosive growth of data in the Internet, the single vertical crawler cannot meet the requirements of the high performance of the crawler. The existing distributed vertical crawlers also have the problem of weak capability of customization. In order to solve the above problem, this paper proposes a distributed vertical crawler named ChainMR Crawler. We adopt ChainMapper/ChainReducer model to design each module of the crawler, use Redis to manage URLs and choose the distributed database Hbase to store the key content of web pages. Experimental results demonstrate that the efficiency of ChainMR Crawler is 6 % higher than Nutch in the field of vertical crawler, which achieves the expected effect.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Zhou, B., Xiao, B., Lin, Z., et al.: A distributed vertical crawler using crawling-period based strategy. In: 2010 2nd International Conference on Future Computer and Communication (ICFCC), pp. 306–311. IEEE (2010)
Google search statistics, http://www.internetlivestats.com (2012)
The 35th China Internet network development state statistical report (2015). http://www.cnnic.cn
Kobayashi, M., Takeda, K.: Information retrieval on the web. ACM Comput. Surv. (CSUR) 32, 144–173 (2000)
Guo, Y., Li, K., Zhang, K., et al.: Board forum crawling: a Web crawling method for Web forum. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 745–748. IEEE Computer Society (2006)
Trivedi, H.P., Daxini, G.N., Oswal, J.A., et al.: An approach to design personalized focused crawler. Int. J. Comput. Sci. Eng. (2014)
Zhang, X., Xian, M.: Optimization of distributed crawler under Hadoop. In: MATEC Web of Conferences. EDP Sciences (2015)
Li, X.Z., Cheng, G., Zhao, Q.J., et al.: Design and implementation of the distributed crawler system. China Sci. Technol. Inf. 15, 116–117 (2014)
Boldi, A., Marino, M., Santini, S.V.: Bubing: Massive crawling for the masses. In: Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion, pp. 227–228 (2014)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Nutch Apache. http://nutch.apache.Org
Zhan, H.F., Yang, Y.X., Fang, H., et al.: A study on distributed network crawler and its applications. J. Front. Comput. Sci. Technol. 5, 68–74 (2011)
Yuan, W., Xue, A.R., Zhou, X.M., et al.: A study on the optimization of distributed crawler based on Nutch. Wirel. Commun. Tech. 23(3), 44–47 (2014)
Zhu, X.L., Wang, B.: Community mining in complex network based on parallel genetic algorithm. In: 2010 Fourth International Conference on Genetic and Evolutionary Computing (ICGEC), pp. 325–328. IEEE (2010)
Shvachko, K., Kuang, H., Radia, S., et al.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Junsheng, W., Yunmei, S., Yangsen, Z.: Key technologies of distributed search engine based on Hadoop. J. Beijing Inf. Univ. Sci. Technol. 26(4), 4–7 (2011)
Acknowledgments
This work is supported by NSFC (Grant Nos. 61300181, 61502044), the Fundamental Research Funds for the Central Universities (Grant Nos. 2015RC23).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Liu, X., Jin, Z. (2016). ChainMR Crawler: A Distributed Vertical Crawler Based on MapReduce. In: Wang, G., Ray, I., Alcaraz Calero, J., Thampi, S. (eds) Security, Privacy and Anonymity in Computation, Communication and Storage. SpaCCS 2016. Lecture Notes in Computer Science(), vol 10067. Springer, Cham. https://doi.org/10.1007/978-3-319-49145-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-49145-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49144-8
Online ISBN: 978-3-319-49145-5
eBook Packages: Computer ScienceComputer Science (R0)