Skip to main content

Data Centric Text Processing Using MapReduce

  • Conference paper
  • First Online:
Innovations in Bio-Inspired Computing and Applications

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 424))

  • 951 Accesses

Abstract

Processing huge volume of data opened new opportunities in ecommerce, engineering, business and large computing applications. MapReduce programming model is a parallel data processing approach for execution on computer clusters. This model provides an abstraction to design scalable computing algorithm for big data processing. For batch processing types of data processing, MapReduce model provides faster computation. The key/value pair generation of MapReduce program creates memory overhead and deserialization overhead due to data redundancy. Redundancy of data is one of the most important factors that consumes space and affect system performance while using large set of data. This overhead can be avoided considerably by using a novel approach that we developed named Data Triggered Multithreaded Programming (DTMP) model. In this paper, we demonstrate the use of DTMP model using a large dataset with author details and his publications. The Data Triggered Multithreaded Programming can dynamically allocate the resources and can identify the data repetition occurring during computation. DTMP model when applied to the MapReduce programming model brings performance improvement to the system. The major contributions of this work are a simple, scalable and powerful processing of text data that enables automatic parallelization and distribution of large-scale computations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arvind, Nikhil, R.S.: Executing a program on the mit tagged-token dataflow architecture. IEEE Trans. Comput. 300–318 (1990)

    Google Scholar 

  2. Li, F., Ooi, B.C., Tamer Ozsu, M., Wu, S.: Distributed data management using MapReduce. In: ACM Computing Surveys (CSUR), 46(3) (2014)

    Google Scholar 

  3. Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R.: Incoop, MapReduce for incremental computations. In: ACM SOCC ’11 (2011)

    Google Scholar 

  4. Tseng, H.-W., Tullsen, D.M.: Data-triggered threads: eliminating redundant computation. In: 17th International Symposium on High Performance Computer Architecture, pp. 181–192 (2011)

    Google Scholar 

  5. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: ACM Proceedings, pp. 107–113, Jan 2008

    Google Scholar 

  6. Arvind, Nikhil, R.S.: Executing a program on the mit tagged-token dataflow architecture. IEEE Trans. Comput. 300–318 (1990)

    Google Scholar 

  7. Cave, V., Zhao, J., Shirako, J., and Sarkar, V.: Habanero-java: the new adventures of old x10. In: Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ ’11, pp. 51–61 (2011)

    Google Scholar 

  8. Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ACM SIGARCH Computer Architecture News, pp. 152–163 (2009)

    Google Scholar 

  9. Brunett, S., Thornley, J., Ellenbecker, M.: An initial evaluation of the tera multithreaded architecture and programming system using the the c3i parallel benchmark suite. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC 1998), pp. 1–19 (1998)

    Google Scholar 

  10. Lewis, B., Berg, D.J.: Multithreaded Programming with Pthreads. Prentice Hall (1998)

    Google Scholar 

  11. Hammer, M.A., Acar, U.A., Chen, Y.: CEAL: A C-based language for self-adjusting computation. In: ACM SIGPLAN 2009 Conference on Programming Language Design and Implementation, pp. 25–37 (2009)

    Google Scholar 

  12. Steffan, J., Colohan, C., Zhai A., Mowry, T.: A scalable approach to thread-level speculation. In: 27th Annual International Symposium on Computer Architecture, pp. 1–12 (2000)

    Google Scholar 

  13. Lin, J., Chris, D.: Data-intensive text processing with MapReduce. Synth. Lect. Hum. Lang. Technol. 3, 1–177 (2010)

    Article  Google Scholar 

  14. Tseng, H.-W., Tullsen, D.M.: Data-triggered multithreading for near-data processing. In: 1st Workshop on Near-Data Processing (WoNDP) (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to N. Sandhya .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Sandhya, N., Samuel, P. (2016). Data Centric Text Processing Using MapReduce. In: Snášel, V., Abraham, A., Krömer, P., Pant, M., Muda, A. (eds) Innovations in Bio-Inspired Computing and Applications. Advances in Intelligent Systems and Computing, vol 424. Springer, Cham. https://doi.org/10.1007/978-3-319-28031-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28031-8_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28030-1

  • Online ISBN: 978-3-319-28031-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics