The Optimization and Improvement of MapReduce in Web Data Mining

Yin, Changqing; Zhang, Shichao; Liu, Shukun; Song, Shangwei; Gao, Guangyu; Zhou, Xiyuan

doi:10.1007/978-3-319-27119-4_53

Changqing Yin¹⁷,
Shichao Zhang¹⁷,
Shukun Liu¹⁷,
Shangwei Song¹⁸,
Guangyu Gao¹⁷ &
…
Xiyuan Zhou¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9528))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1715 Accesses

Abstract

Extracting and mining social networks information from massive Web data is of both theoretical and practical significance. However, one of definite features of this task was a large scale data processing, which remained to be a great challenge that would be addressed. MapReduce is a kind of distributed programming model. Just through the implementation of map and reduce those two functions, the distributed tasks can work well. Nevertheless, this model does not directly support heterogeneous datasets processing, while heterogeneous datasets are common in Web. This article proposes a new framework which improves original MapReduce framework into a new one called Map-Reduce-Merge. It adds merge phase that can efficiently solve the problems of heterogeneous data processing. At the same time, some works of optimization and improvement are done based on the features of Web data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dean, R., Ghemawat, A.: MapReduce: implified data processing on large cluster. In: SDI, pp. 137–149 (2004)
Google Scholar
Ghemawat, N., Gobioff, H., Leung, S.T.: The google file system. In: Operating Systems Principles, pp. 29–43 (2003)
Google Scholar
Fei, X., Lu, S., Lin, C.: A MapReduce-Enabled scientific workflow composition framework. In: Proceedings of the IEEE International Conference on Web Services, Los Angeles, pp. 663–670 (2009)
Google Scholar
Hadoop 0.20 Documentation, Capacity Scheduler
Google Scholar
Lammel, R.: Google’s MapReduce programming model—revisited. Draft, p. 26 (2006)
Google Scholar
Tian, C., Zhou, H., He, Y., Zha, L.: A dynamic mapreduce scheduler for heterogeneous workloads. In: Proceedings of the 8th International Conference on Grid and Cooperative Computing, Lanzhou, pp. 218–224 (2009)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
Google Scholar
Pike, R., Dorward, S., Griesemer, R.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13, 227–298 (2005). doi:10.1155/2005/962135
Google Scholar
Kim, K., Jeon, K., Han, H., Kim, S., Jung, H., Yeom, H.Y., Bench, M.R.: A benchmark for MapReduce framework. In: Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems, Victoria, pp. 11–18 (2008). http://dx.doi.org/10.1109/ICPADS.2008.70
Zhang, D., Zhou, J., Guo, M., Cao, J.: TASA: tag-free activity sensing using RFID tag arrays. IEEE Trans. Parallel Distrib. Syst. (TPDS) 22, 558–570 (2011)
Article Google Scholar
Zhang, D., Chen, M., Guizani, M., Xiong, H., Zhang, D.: Mobility prediction in telecom cloud using mobile calls. IEEE Wirel. Commun. 21(1), 26–32 (2014)
Article Google Scholar
Chen, Q., Zhang, D., Guo, M.: SAMR: a self-adaptive Map-Reduce scheduling algorithm in heterogeneous environments. In: Proceedings of the 10th IEEE International Conference on Scalable Computing and Communications (ScalCom 2010), pp. 2736–2743. Bradford, UK (2010)
Google Scholar
Zhang, D., Yang, L.T., Chen, M., Zhao, S., Guo, M., Zhang, Y.: A real-time locating system using active RFID for internet of things. IEEE Syst. J. doi:10.1109/JSYST.2014.2346625

Download references

Author information

Authors and Affiliations

School of Software Engineering, Tongji University, Shanghai, 201800, China
Changqing Yin, Shichao Zhang, Shukun Liu, Guangyu Gao & Xiyuan Zhou
College of Design and Innovation, Tongji University, Shanghai, 201800, China
Shangwei Song

Authors

Changqing Yin
View author publications
You can also search for this author in PubMed Google Scholar
Shichao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shukun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shangwei Song
View author publications
You can also search for this author in PubMed Google Scholar
Guangyu Gao
View author publications
You can also search for this author in PubMed Google Scholar
Xiyuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Changqing Yin .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Guojun Wang
The University of Sydney, Sydney, New South Wales, Australia
Albert Zomaya
University of Murcia, Murcia, Murcia, Spain
Gregorio Martinez
Hunan University , Changsha, China
Kenli Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yin, C., Zhang, S., Liu, S., Song, S., Gao, G., Zhou, X. (2015). The Optimization and Improvement of MapReduce in Web Data Mining. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9528. Springer, Cham. https://doi.org/10.1007/978-3-319-27119-4_53

Download citation

DOI: https://doi.org/10.1007/978-3-319-27119-4_53
Published: 16 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27118-7
Online ISBN: 978-3-319-27119-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics