Data Centric Text Processing Using MapReduce

Sandhya, N.; Samuel, Philip

doi:10.1007/978-3-319-28031-8_11

N. Sandhya¹⁹ &
Philip Samuel¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 424))

951 Accesses

Abstract

Processing huge volume of data opened new opportunities in ecommerce, engineering, business and large computing applications. MapReduce programming model is a parallel data processing approach for execution on computer clusters. This model provides an abstraction to design scalable computing algorithm for big data processing. For batch processing types of data processing, MapReduce model provides faster computation. The key/value pair generation of MapReduce program creates memory overhead and deserialization overhead due to data redundancy. Redundancy of data is one of the most important factors that consumes space and affect system performance while using large set of data. This overhead can be avoided considerably by using a novel approach that we developed named Data Triggered Multithreaded Programming (DTMP) model. In this paper, we demonstrate the use of DTMP model using a large dataset with author details and his publications. The Data Triggered Multithreaded Programming can dynamically allocate the resources and can identify the data repetition occurring during computation. DTMP model when applied to the MapReduce programming model brings performance improvement to the system. The major contributions of this work are a simple, scalable and powerful processing of text data that enables automatic parallelization and distribution of large-scale computations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arvind, Nikhil, R.S.: Executing a program on the mit tagged-token dataflow architecture. IEEE Trans. Comput. 300–318 (1990)
Google Scholar
Li, F., Ooi, B.C., Tamer Ozsu, M., Wu, S.: Distributed data management using MapReduce. In: ACM Computing Surveys (CSUR), 46(3) (2014)
Google Scholar
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R.: Incoop, MapReduce for incremental computations. In: ACM SOCC ’11 (2011)
Google Scholar
Tseng, H.-W., Tullsen, D.M.: Data-triggered threads: eliminating redundant computation. In: 17th International Symposium on High Performance Computer Architecture, pp. 181–192 (2011)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: ACM Proceedings, pp. 107–113, Jan 2008
Google Scholar
Arvind, Nikhil, R.S.: Executing a program on the mit tagged-token dataflow architecture. IEEE Trans. Comput. 300–318 (1990)
Google Scholar
Cave, V., Zhao, J., Shirako, J., and Sarkar, V.: Habanero-java: the new adventures of old x10. In: Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ ’11, pp. 51–61 (2011)
Google Scholar
Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ACM SIGARCH Computer Architecture News, pp. 152–163 (2009)
Google Scholar
Brunett, S., Thornley, J., Ellenbecker, M.: An initial evaluation of the tera multithreaded architecture and programming system using the the c3i parallel benchmark suite. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC 1998), pp. 1–19 (1998)
Google Scholar
Lewis, B., Berg, D.J.: Multithreaded Programming with Pthreads. Prentice Hall (1998)
Google Scholar
Hammer, M.A., Acar, U.A., Chen, Y.: CEAL: A C-based language for self-adjusting computation. In: ACM SIGPLAN 2009 Conference on Programming Language Design and Implementation, pp. 25–37 (2009)
Google Scholar
Steffan, J., Colohan, C., Zhai A., Mowry, T.: A scalable approach to thread-level speculation. In: 27th Annual International Symposium on Computer Architecture, pp. 1–12 (2000)
Google Scholar
Lin, J., Chris, D.: Data-intensive text processing with MapReduce. Synth. Lect. Hum. Lang. Technol. 3, 1–177 (2010)
Article Google Scholar
Tseng, H.-W., Tullsen, D.M.: Data-triggered multithreading for near-data processing. In: 1st Workshop on Near-Data Processing (WoNDP) (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Information Technology, SOE, Cochin University of Science and Technology, Kochi, 682022, India
N. Sandhya & Philip Samuel

Authors

N. Sandhya
View author publications
You can also search for this author in PubMed Google Scholar
Philip Samuel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to N. Sandhya .

Editor information

Editors and Affiliations

Dep. of Computer Science, VŠB – Technical Univ. of Ostrava, Ostrava, Czech Republic
Václav Snášel
(MIR Labs), Scientific Net Innov & Res Excel, Auburn, Washington, USA
Ajit Abraham
Faculty of Elec. Eng. & Comp. Sci., VŠB - Technical University of Ostrava, Ostrava-Poruba, Czech Republic
Pavel Krömer
Department of Paper Technology, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
Millie Pant
Fac of Info & Comm, Comp Inte & Tech Lab, Universiti Teknikal Malaysia Melaka, Durian Tunggal, Malaysia
Azah Kamilah Muda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sandhya, N., Samuel, P. (2016). Data Centric Text Processing Using MapReduce. In: Snášel, V., Abraham, A., Krömer, P., Pant, M., Muda, A. (eds) Innovations in Bio-Inspired Computing and Applications. Advances in Intelligent Systems and Computing, vol 424. Springer, Cham. https://doi.org/10.1007/978-3-319-28031-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-28031-8_11
Published: 15 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28030-1
Online ISBN: 978-3-319-28031-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics