Skip to main content

ChainMR Crawler: A Distributed Vertical Crawler Based on MapReduce

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 10067))

Abstract

With the explosive growth of data in the Internet, the single vertical crawler cannot meet the requirements of the high performance of the crawler. The existing distributed vertical crawlers also have the problem of weak capability of customization. In order to solve the above problem, this paper proposes a distributed vertical crawler named ChainMR Crawler. We adopt ChainMapper/ChainReducer model to design each module of the crawler, use Redis to manage URLs and choose the distributed database Hbase to store the key content of web pages. Experimental results demonstrate that the efficiency of ChainMR Crawler is 6 % higher than Nutch in the field of vertical crawler, which achieves the expected effect.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Zhou, B., Xiao, B., Lin, Z., et al.: A distributed vertical crawler using crawling-period based strategy. In: 2010 2nd International Conference on Future Computer and Communication (ICFCC), pp. 306–311. IEEE (2010)

    Google Scholar 

  2. Google search statistics, http://www.internetlivestats.com (2012)

  3. The 35th China Internet network development state statistical report (2015). http://www.cnnic.cn

  4. Kobayashi, M., Takeda, K.: Information retrieval on the web. ACM Comput. Surv. (CSUR) 32, 144–173 (2000)

    Article  Google Scholar 

  5. Guo, Y., Li, K., Zhang, K., et al.: Board forum crawling: a Web crawling method for Web forum. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 745–748. IEEE Computer Society (2006)

    Google Scholar 

  6. Trivedi, H.P., Daxini, G.N., Oswal, J.A., et al.: An approach to design personalized focused crawler. Int. J. Comput. Sci. Eng. (2014)

    Google Scholar 

  7. Zhang, X., Xian, M.: Optimization of distributed crawler under Hadoop. In: MATEC Web of Conferences. EDP Sciences (2015)

    Google Scholar 

  8. Li, X.Z., Cheng, G., Zhao, Q.J., et al.: Design and implementation of the distributed crawler system. China Sci. Technol. Inf. 15, 116–117 (2014)

    Google Scholar 

  9. Boldi, A., Marino, M., Santini, S.V.: Bubing: Massive crawling for the masses. In: Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion, pp. 227–228 (2014)

    Google Scholar 

  10. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  11. Nutch Apache. http://nutch.apache.Org

  12. Zhan, H.F., Yang, Y.X., Fang, H., et al.: A study on distributed network crawler and its applications. J. Front. Comput. Sci. Technol. 5, 68–74 (2011)

    Google Scholar 

  13. Yuan, W., Xue, A.R., Zhou, X.M., et al.: A study on the optimization of distributed crawler based on Nutch. Wirel. Commun. Tech. 23(3), 44–47 (2014)

    Google Scholar 

  14. Zhu, X.L., Wang, B.: Community mining in complex network based on parallel genetic algorithm. In: 2010 Fourth International Conference on Genetic and Evolutionary Computing (ICGEC), pp. 325–328. IEEE (2010)

    Google Scholar 

  15. Shvachko, K., Kuang, H., Radia, S., et al.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)

    Google Scholar 

  16. Junsheng, W., Yunmei, S., Yangsen, Z.: Key technologies of distributed search engine based on Hadoop. J. Beijing Inf. Univ. Sci. Technol. 26(4), 4–7 (2011)

    Google Scholar 

Download references

Acknowledgments

This work is supported by NSFC (Grant Nos. 61300181, 61502044), the Fundamental Research Funds for the Central Universities (Grant Nos. 2015RC23).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xixia Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Liu, X., Jin, Z. (2016). ChainMR Crawler: A Distributed Vertical Crawler Based on MapReduce. In: Wang, G., Ray, I., Alcaraz Calero, J., Thampi, S. (eds) Security, Privacy and Anonymity in Computation, Communication and Storage. SpaCCS 2016. Lecture Notes in Computer Science(), vol 10067. Springer, Cham. https://doi.org/10.1007/978-3-319-49145-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49145-5_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49144-8

  • Online ISBN: 978-3-319-49145-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics