Programming Platforms for Big Data Analysis

Cao, Jiannong; Chawla, Shailey; Wang, Yuqi; Wu, Hanqing

doi:10.1007/978-3-319-49340-4_3

Jiannong Cao³,
Shailey Chawla³,
Yuqi Wang³ &
…
Hanqing Wu³

7343 Accesses
2 Citations
1 Altmetric

Abstract

Big data analysis imposes new challenges and requirements on programming support. Programming platforms need to provide new abstractions and run time techniques with key features like scalability, fault tolerance, efficient task distribution, usability and processing speed. In this chapter, we first provide a comprehensive survey of the requirements, give an overview and classify existing big data programming platforms based on different dimensions. Then, we present details of the architecture, methodology and features of major programming platforms like MapReduce, Storm, Spark, Pregel, GraphLab, etc. Last, we compare existing big data platforms, discuss the need for a unifying framework, present our proposed framework MatrixMap, and give a vision about future work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

V. Agneeswaran, Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives, 1st edn. (Pearson FT Press, USA, 2014)
Google Scholar
Apache storm documentation, https://storm.apache.org/documentation/Home.html
Apache zookeeper, http://zookeeper.apache.org
Architecture of mapreduce model, https://cloud.google.com/appengine/docs/-python/images/mapreduce_mapshuffle.png
A.B. Bondi, Characteristics of scalability and their impact on performance, in Workshop on Software and Performance (2000), pp. 195C203
Google Scholar
W. Daniel Hillis, G.L. Steele, Jr., Data parallel algorithms. Commun. ACM, 29(12), 1170C1183 (1986)
Google Scholar
T. Das, Deep dive into spark streaming. http://spark.apache.org/-documentation.html (2013)
J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1):107C113 (2008)
Google Scholar
T. Feng, Z. Zhuang, Y. Pan, H. Ramachandra, A memory capacity model for high performing data-filtering applications in samza framework, in 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29 - November 1, 2015, p. 2600C2605
Google Scholar
A. Fernández, S. del Ró, V. López, A. Bawakid, M. José del Jesús, J. Manuel Bentez, F. Herrera, Big data with cloud computing: an insight on the computing environment, mapreduce, and programming frameworks. Wiley Interdisc. Rew.: Data Min. Knowl. Discov. 4(5), 380C409 (2014)
Google Scholar
J.E. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin, Powergraph: distributed graph-parallel computation on natural graphs, in 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, October 8-10, 2012, p. 17C30
Google Scholar
J.E. Gonzalez, R.S. Xin, A. Dave, D. Crankshaw, M.J. Franklin, I. Stoica, Graphx: graph processing in a distributed dataflow framework, in 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 14, Broomfield, CO, USA, October 6–8, 2014, p. 599C613
Google Scholar
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R.H. Katz, S. Shenker, I. Stoica, Mesos: A platform for fine-grained resource sharing in the data center, in Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011, Boston, MA, USA (2011)
Google Scholar
Htcondor, http://research.cs.wisc.edu/htcondor/description.html
Y. Huangfu, J. Cao, H. Lu, G. Liang, Matrixmap: programming abstraction and implementation of matrix computation for big data applications, in 21st IEEE International Conference on Parallel and Distributed Systems, ICPADS 2015, Melbourne, Australia (2015), p. 19C28
Google Scholar
Implementation of pregel, http://people.apache.org/~edwardyoon/documents/-pregel.pdf
M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: distributed data-parallel programs from sequential building blocks, in Proceedings of the 2007 EuroSys Conference, Lisbon, Portugal, March 21–23, 2007, p. 59C72
Google Scholar
Key concepts in s4 (incubator), https://incubator.apache.org/s4/doc/0.6.0/-overview
M. J. Litzkow, M. Livny, M.W. Mutka, Condor - a hunter of idle workstations, in Proceedings of the 8th International Conference on Distributed Computing Systems, San Jose, California, USA, June 13–17, 1988, p. 104C111
Google Scholar
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J.M. Hellerstein, Graphlab: a new framework for parallel machine learning, in UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, July 8–11, 2010, p. 340C349
Google Scholar
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J.M. Hellerstein, Distributed graphlab: a framework for machine learning in the cloud. PVLDB 5(8), 716C727 (2012)
Google Scholar
G. Malewicz, M.H. Austern, A.J.C. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for large-scale graph processing, in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA (2010), p. 135C146
Google Scholar
P. Mhashilkar, Z. Miller, R. Kettimuthu, G. Garzoglio, B. Holzman, C. Weiss, X. Duan, L. Lacinski, End-to-end solution for integrated workload and data management using glideinwms and globus online. J. Phys. Conf. Ser. 396(3), 032076 (2012)
Article Google Scholar
L. Neumeyer, B. Robbins, A. Nair, A. Kesari, S4: distributed stream computing platform, in ICDMW 2010, The 10th IEEE International Conference on Data Mining Workshops, Sydney, Australia, 13 Dec 2010, p. 170C177
Google Scholar
Scala programming language, http://www.scala-lang.org
M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra, MPI-The Complete Reference, vol. 1: The MPI Core, 2nd (revised) edn. (MIT Press, Cambridge 1998)
Google Scholar
Spark programming model, http://blog.cloudera.com/blog/2013/11/-putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications
The structure of dryad jobs, http://research.microsoft.com/en-us/projects/dryad
M. Tim Jones, Process real-time big data with twitter storm. Technical Report pp. 1-9, IBM Developer Works (2013)
Google Scholar
A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, D.V. Ryaboy, Storm@twitter, in International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, p. 147C156
Google Scholar
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. Kumar Gunda, J. Currey, Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language, in 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, San Diego, California, USA, Proceedings (2008), p. 1C14
Google Scholar
M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: cluster computing with working sets, in 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud10, Boston, MA, USA (2010)
Google Scholar
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, I. Stoica, Discretized streams: fault-tolerant streaming computation at scale, in ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP 13, Farmington, PA, USA (2013), p. 423C438
Google Scholar

Download references

Acknowledgements

This work was partially supported by the funding for Project of Strategic Importance provided by The Hong Kong Polytechnic University (1-ZE26) and HK RGC under GRF Grant (PolyU 5104/13E).

Author information

Authors and Affiliations

Department of Computing, Hong Kong Polytechnic University, King’s Park, Hong Kong
Jiannong Cao, Shailey Chawla, Yuqi Wang & Hanqing Wu

Authors

Jiannong Cao
View author publications
You can also search for this author in PubMed Google Scholar
Shailey Chawla
View author publications
You can also search for this author in PubMed Google Scholar
Yuqi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hanqing Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiannong Cao .

Editor information

Editors and Affiliations

School of Information Technologies, The University of Sydney, Sydney, New South Wales, Australia
Albert Y. Zomaya
The School of Computer Science, The University of New South Wales, Eveleigh, New South Wales, Australia
Sherif Sakr

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cao, J., Chawla, S., Wang, Y., Wu, H. (2017). Programming Platforms for Big Data Analysis. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-49340-4_3
Published: 26 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49339-8
Online ISBN: 978-3-319-49340-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics