Abstract
Extracting and mining social networks information from massive Web data is of both theoretical and practical significance. However, one of definite features of this task was a large scale data processing, which remained to be a great challenge that would be addressed. MapReduce is a kind of distributed programming model. Just through the implementation of map and reduce those two functions, the distributed tasks can work well. Nevertheless, this model does not directly support heterogeneous datasets processing, while heterogeneous datasets are common in Web. This article proposes a new framework which improves original MapReduce framework into a new one called Map-Reduce-Merge. It adds merge phase that can efficiently solve the problems of heterogeneous data processing. At the same time, some works of optimization and improvement are done based on the features of Web data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dean, R., Ghemawat, A.: MapReduce: implified data processing on large cluster. In: SDI, pp. 137–149 (2004)
Ghemawat, N., Gobioff, H., Leung, S.T.: The google file system. In: Operating Systems Principles, pp. 29–43 (2003)
Fei, X., Lu, S., Lin, C.: A MapReduce-Enabled scientific workflow composition framework. In: Proceedings of the IEEE International Conference on Web Services, Los Angeles, pp. 663–670 (2009)
Hadoop 0.20 Documentation, Capacity Scheduler
Lammel, R.: Google’s MapReduce programming model—revisited. Draft, p. 26 (2006)
Tian, C., Zhou, H., He, Y., Zha, L.: A dynamic mapreduce scheduler for heterogeneous workloads. In: Proceedings of the 8th International Conference on Grid and Cooperative Computing, Lanzhou, pp. 218–224 (2009)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
Pike, R., Dorward, S., Griesemer, R.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13, 227–298 (2005). doi:10.1155/2005/962135
Kim, K., Jeon, K., Han, H., Kim, S., Jung, H., Yeom, H.Y., Bench, M.R.: A benchmark for MapReduce framework. In: Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems, Victoria, pp. 11–18 (2008). http://dx.doi.org/10.1109/ICPADS.2008.70
Zhang, D., Zhou, J., Guo, M., Cao, J.: TASA: tag-free activity sensing using RFID tag arrays. IEEE Trans. Parallel Distrib. Syst. (TPDS) 22, 558–570 (2011)
Zhang, D., Chen, M., Guizani, M., Xiong, H., Zhang, D.: Mobility prediction in telecom cloud using mobile calls. IEEE Wirel. Commun. 21(1), 26–32 (2014)
Chen, Q., Zhang, D., Guo, M.: SAMR: a self-adaptive Map-Reduce scheduling algorithm in heterogeneous environments. In: Proceedings of the 10th IEEE International Conference on Scalable Computing and Communications (ScalCom 2010), pp. 2736–2743. Bradford, UK (2010)
Zhang, D., Yang, L.T., Chen, M., Zhao, S., Guo, M., Zhang, Y.: A real-time locating system using active RFID for internet of things. IEEE Syst. J. doi:10.1109/JSYST.2014.2346625
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Yin, C., Zhang, S., Liu, S., Song, S., Gao, G., Zhou, X. (2015). The Optimization and Improvement of MapReduce in Web Data Mining. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9528. Springer, Cham. https://doi.org/10.1007/978-3-319-27119-4_53
Download citation
DOI: https://doi.org/10.1007/978-3-319-27119-4_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27118-7
Online ISBN: 978-3-319-27119-4
eBook Packages: Computer ScienceComputer Science (R0)