A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model to a Cluster Architecture

Yin, Fei; Shi, Feng

doi:10.1007/s10766-021-00717-y

A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model to a Cluster Architecture

Published: 26 May 2021

Volume 50, pages 27–64, (2022)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

1119 Accesses
7 Citations
Explore all metrics

Abstract

With the rapid growth of artificial intelligence (AI), the Internet of Things (IoT) and big data, emerging applications that cross stacks with different techniques bring new challenges to parallel computing systems. These cross-stack functionalities require one system to possess multiple characteristics, such as the ability to process data under high throughput and low latency, the ability to carry out iterative and incremental computation, transparent fault tolerance, and the ability to perform heterogeneous tasks that evolve dynamically. However, high-performance computing (HPC) and big data computing, as two categories of parallel computing architecture, are incapable of meeting all these requirements. Therefore, by performing a comparative analysis of HPC and big data computing from the perspective of the parallel programming model layer, middleware layer, and infrastructure layer, we explore the design principles of the two architectures and discuss a converged architecture to address the abovementioned challenges.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Big Data: An Introduction

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

References

Akidau, T., Balikov, A., Bekiroğlu, Kaya, et al.: MillWheel: fault-tolerant stream processing at internet scale [J]. Proc. VLDB Endow. 6(11), 1033–1044 (2013)
Article Google Scholar
Almasi, G.: PGAS (Partitioned Global Address Space) Languages [J]. Encycl. Parallel Comput. 1, 1539–1545 (2011)
Google Scholar
Apache Giraph. https://giraph.apache.org/
Asaadi, H. R., Khaldi, D., Chapman, B. A. (2016) Comparative Survey of the HPC and Big Data Paradigms: Analysis and Experiments [C]// IEEE International Conference on Cluster Computing (CLUSTER). IEEE,:423–432.
Bröse E. ZeroCopy: Techniques, Benefits and Pitfalls [EB/OL]. https://static.aminer.org/pdf/PDF/000/253/158/design_and_implementation_of_zero_copy_data_path_for_efficient.pdf
Browning, S. A. The Tree Machine: A Highly Concurrent Computing Environment [EB/OL]. 1980. http://resolver.caltech.edu/CaltechCSTR:3760-tr-80.
Carbone, P., Fóra, G., Ewen, S et al. (2015) Lightweight Asynchronous Snapshots for Distributed Dataflows [J]. Computer Science.
Carbone, P., Katsifodimos, A., Ewen, S., et al.: Apache flinktm: stream and batch processing in a single engine [J]. IEEE Data Eng. Bulletin 38(4), 28–38 (2015)
Google Scholar
Chaimov, N., Malony, A., Canon, S et al. Scaling Spark on HPC Systems [C]// Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing – HPDC. ACM, 2016:97–110.
Chambers, C., Raniwala, A., Perry, F et al. FlumeJava: Easy, Efficient Data-parallel Pipelines [C]// ACM Sigplan Conference on Programming Language Design & Implementation. ACM, 2010.
Chan, E., Heimlich, M., Purkayastha, A., et al.: Collective communication: theory, practice, and experience [J]. Concurrency Computat. Pract. Exper. 19(13), 1749–1783 (2007)
Article Google Scholar
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems [J]. ACM Transact. Comput. Syst. (TOCS) 3(1), 63–75 (1985)
Article Google Scholar
Chapman, B., Curtis, T., Pophale, S et al. Introducing OpenSHMEM: SHMEM for the PGAS community [C]// Conference on Partitioned Global Address Space Programming Model. 2010.
Clos, C.: A study of non-blocking switching networks [J]. Bell Syst. Tech. J. 32(2), 406–424 (1953)
Article Google Scholar
Crankshaw, D., Bailis, P., Gonzalez, J.E., et al.: The missing piece in complex analytics: low latency, scalable model management and serving with velox [J]. European J. Obstet. Gynecol. Reprod. Biol 185, 181–182 (2014)
Google Scholar
Cristina P. The technology Stacks of High Performance Computing & Big Data Computing: What They Can Learn from Each Other. https://www.etp4hpc.eu/pujades/files/bigdata_and_hpc_FINAL_20Nov18.pdf
Dagum, L., Menon, R.: OpenMP: an industry standard api for shared-memory programming [J]. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters [J]. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Doulkeridis, C., Nørvåg, Kjetil: A survey of large-scale analytical query processing in mapreduce [J]. VLDB J. Int. J. Very Large Data Bases 23(3), 355–380 (2014)
Article Google Scholar
Duell, J., Hargrove, P., Roman, E. The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart [R]. Berkeley Lab Technical Report LBNL-54941, 2002.
Egwutuoha, I P., Chen, S., Levy, D, et al. A Fault Tolerance Framework for High Performance Computing in Cloud [C]//12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 2012: 709–710.
Egwutuoha, I.P., Levy, D., Selic, B., et al.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems [J]. J. Supercomput. 65(3), 1302–1326 (2013)
Article Google Scholar
Ekanayake, J., Hui, L., Zhang, B et al. Twister: A Runtime for Iterative MapReduce [C]// Proceedings of the 19^th ACM International Symposium on High Performance Distributed Computing. DBLP, 2010.
El-Ghazawi, T., Smith, L UPC: Unified parallel [C]//ACM/IEEE Conference on High Performance Networking & Computing. DBLP, 2006.
Fagg G E, Dongarra J. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world [C]// Proceedings of EuroPVM-MPI 2000. Springer, 2000.
Foster I. The MPI Programming Model. https://www.mcs.anl.gov/~itf/dbpp/text/node95.html
Gan G, Manzano J. TL-DAE: Thread-Level Decoupled Access/Execution for OpenMP on the Cyclops-64 Many-Core Processor [C]// Languages and Compilers for Parallel Computing, 22nd International Workshop, LCPC 2009. Springer, 2009.
Graham, R.L., Choi, S.E., Daniel, D.J., et al.: A Network-failure-tolerant message-passing system for terascale clusters [J]. Int. J. Parallel Prog. 31, 285–303 (2003)
Article Google Scholar
Gropp W, Huss-Lederman S, Lumsdaine A, et al. (1998) MPI: The Complete Reference. Volume 2, the MPI-2 Extensions [M]. Cambridge: MIT Press, .
Gupta P, Goel A, Lin J, et al. WTF: The Who to Follow Service at Twitter [C]// Proceedings of the 22nd international conference on World Wide Web. ACM, 2013.
Hindma B, Konwinski A, Zaharia M, et al. Mesos: A Platform For Fine-Grained Resource Sharing in the Data Center [C]// Proceedings of the 8th USENIX conference on Networked systems design and implementation. NSDI, 2011.
Holtslag A A M, De Bruijn E I F, Pan H L. A High Resolution Air Mass Transformation Model for Short-Range Weather Forecasting [J]. Monthly Weather Review, 1990, 118(8):1561–1575. http://docs.jboss.org/drools/release/6.0.0.Final/drools-docs/html/HybridReasoningChapter.html#ReteOO
Hovestadt M, Kao O, Keller A, et al. Scheduling in HPC Resource Management Systems: Queuing vs. Planning [C]// Job Scheduling Strategies for Parallel Processing, 9th International Workshop. Springer, 2003.
https://en.wikipedia.org/wiki/Iterative_method
https://hadoop.apache.org/
https://hortonworks.com/apache/yarn/
https://www.pnnl.gov/computing/hpda/
https://www.pnnl.gov/computing/HPDA/ResearchAreas/Tasks/HPDA_EventAnalysis_17.pdf
Hua Z, Jason N. CRAK: Linux Checkpoint/Restart as a Kernel Module [R]. Technical Report CUCS-014–01, Department of Computer Science, Columbia University, 2001.
Hursey J, Squyres J M, Mattox T I, et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI [C]// 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2007.
Husbands P: Unified Parallel C. https://pdfs.semanticscholar.org/9b65/a5dfffbfc9165cc7f2a366f54f8085f51773.pdf
Introduction to OpenMP. https://www3.nd.edu/~zxu2/acms60212-40212/Lec-12-OpenMP.pdf
Isard, M.: Dryad: distributed data-parallel programs from sequential building block [J]. SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007)
Article Google Scholar
Jette M A, Yoo A B, Grondona M. Slurm: Simple Linux Utility for Resource Management [C]//Proceedings of Job Scheduling Strategies for Parallel Processing. Springer, 2003.
Jha, S., Qiu, J., Luckow, A et al. (2014), A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures 3. https://arxiv:1403.1528
Jiawei, H., Micheline, K.: Data mining: concepts and techniques [J]. Data Min. Concep. Models Methods Algorithms Second Edition 5(4), 1–18 (2006)
MATH Google Scholar
Kaur, D., Chadha, R., Verma, N.: Comparison of micro-batch and streaming engine on real time data [J]. Int. J. Eng.Sci. Res. Techonol. 4, 756–761 (2017)
Google Scholar
Kune, R., Konugurthi, P.K., Agarwal, A., et al.: The anatomy of big data computing [J]. Software Pract. Exper. 46(1), 79–105 (2016)
Article Google Scholar
Lathia N, Hailes S, Capra L. kNN CF: A Temporal Social Network [C]// Proceedings of the 2008 ACM Conference on Recommender Systems. ACM, 2008.
Lei W, Jianfeng Z, Chunjie L, et al. BigDataBench: A big data benchmark suite from internet services [C]// IEEE International Symposium on High Performance Computer Architecture. IEEE, 2014.
Lifka D. The ANL/IBM SP scheduling system [C]// Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1995: 295–303.
Mahdavinejad, M.S., Rezvan, M., Barekatain, M., et al.: Machine learning for internet of things data analysis: a survey [J]. Digital Commun. Netw. 3, 161–175 (2018)
Article Google Scholar
Malewicz G, Austern M H, Bik A J C, et al. Pregel: A System for Large-scale Graph Processing [C]// Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010.
Mapreduce Tutorial. https://hadoop.apache.org/docs/r3.1.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Maria C C, Giuseppe S. Workload Characterization: A Survey [C]// Proceedings of the IEEE. IEEE, 1993, 81(8):1136–1150. https://doi.org/ 10.1109/5.236191
Martín Abadi, Paul Barham, Jianmin Chen, et al. TensorFlow: A System for Large-scale Machine Learning [C]//12th USENIX Symposium on Operating Systems Design and Implementation. USENIX, 2016, 265–283.
Marz N. Trident. https://github.com/ nathanmarz/storm/wiki/Trident-tutorial. 2012.
McSherry F, Isaacs R, Isard M, et al. Composable Incremental and Iterative Data-Parallel Computation with Naiad [R]. Microsoft Research, 2012. https://www.microsoft.com/en-us/research/wp-content/uploads/2012/10/naiad.pdf
McSherry F, Murray D G, Isaacs R, and Isard M. Differential Dataflow [C]// Proceedings of 6th Biennial Conference on Innovative Data Systems Research. 2013. http://cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf
Mehdi, M., Ala, A.F., Sameh, S., et al.: Deep learning for iot big data and streaming analytics: a survey [J]. IEEE Commun. Surv. Tutorials 1(1), 99 (2017)
Google Scholar
Mina J, Verde C. Fault Detection Using Dynamic Principal Component Analysis by Average Estimation [C]// IEEE International Conference on Electrical & Electronics Engineering. IEEE, 2005.
Mu’Alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling [J]. IEEE Transact. Parallel Distributed Syst. 6(12), 529–543 (2001)
Article Google Scholar
Murray D G, McSherry F, Isaacs R, et al. Naiad: A Timely Dataflow System [C]// ACM Symposium on Operating Systems Principles (SOSP). ACM, 2013: 439–455.
Neumaier, A.: Molecular modeling of proteins and mathematical prediction of protein structure [J]. SIAM Rev. 39(3), 407–460 (1997)
Article MathSciNet Google Scholar
Nishihara R, Moritz P, Wang S, et al. Real-Time Machine Learning: The Missing Pieces. 2017. https://arxiv.org/abs/1703.03924
OpenMP Application Program Interface. OpenMP Architecture Review Board. 2008. http://www.openmp.org/mp-docu- ments/spec30.pdf
Ordónez, F.J., Roggen, D.: Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition [J]. Sensors 16(1), 115 (2016)
Article Google Scholar
Pan R, Dolog P, Xu G. KNN-Based Clustering for Improving Social Recommender Systems [C]// Agents and Data Mining Interaction: 8th International Workshop, ADMI 2012. Springer, 2013. https://doi: 10.1007/978-3-642-36288-0_11
Philipp M, Nishihara R, Stephanie W, et al. Ray: A Distributed Framework for Emerging AI Applications [C]// Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation. 2018. https://arxiv.org/abs/1712.05889
Philipp M, Robert N, Stephanie W, et al. Ray: A Distributed Framework for Emerging AI Applications [C]// USENIX Symposium on Operating Systems Design and Implementation. USENIX, 2018.
Quoc-Cuong, T., Juan, S., Volker, M.: A survey of state management in big data processing systems [J]. Int. J. Very Large Data Bases 27(6), 847–872 (2018)
Article Google Scholar
Ramalingam, G.: Bounded Incremental Computation [M]. Springer, Berlin (1996)
Book Google Scholar
Reuther, A., Byun, C., Arcand, W., et al.: Scalable System Scheduling for HPC and Big Data [J]. J. Parallel Distrib. Comput. 111(1), 76–92 (2018)
Article Google Scholar
Richer S. A Deep Dive into Rescalable State in Apache Flink. 2017. https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html
Sakr, S., Liu, A., Fayoumi, A.: The family of mapreduce and large scale data processing systems [J]. ACM Comput. Surv. 46(1), 1–44 (2013)
Article Google Scholar
Sankaran, S., Squyres, J.M., Barrett, B., et al.: The lam/mpi checkpoint/restart framework: system-initiated checkpointing[J]. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005)
Article Google Scholar
Saraswat V, Almasi G, Bikshandi G, et al. The Asynchronous Partitioned Global Address Space Model. http://www.cs.rochester.edu/u/cding/amp/papers/full/The%20Asynchronous%20Partitioned%20Global%20Address%20Space%20Model.pdf
Schulz M, Bronevetsky G, Fernandes R, et al. Implementation and Evaluation of a Scalable Application-level Checkpoint-recovery Scheme for MPI Programs [C]// Proceedings of the 2004 ACM/IEEE Conference on Supercomputing. IEEE, 2004.
Severson, K., Chaiwatanodom, P., Braatz, R.D.: Perspectives on process monitoring of industrial systems [J]. Annu. Rev. Control. 42, 190–200 (2016)
Article Google Scholar
Stephen P B. Multidimentional Scaling. 1997. http://www.analytictech.com/borgatti/mds.htm
Supun, K., Pulasthi, W., Saliya, E., et al.: Anatomy of machine learning algorithm implementations in mpi, spark, and flink [J]. Int. J. High Perform. Comput. 32(1), 61–73 (2018)
Article Google Scholar
The Beowulf Cluster site. http://www.beowulf.org
Tianqi Chen, Mu Li, Yutian Li, et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. Neural Information Processing Systems, Workshop on Machine Learning Systems. 2016.
Tony H, Stewart T, Kristin T. The Fourth Paradigm: Data-Intensive Scientific Discovery [M]. Microsoft Research. 2009. https://www.microsoft.com/en-us/research/wp-content/uploads/2009/10/Fourth_Paradigm.pdf
Tzoumas K. High-throughput, Low-latency, and Exactly-once Stream Processing with Apache Flink™. 2015. https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink
Wenguang, C.: Big data and high performance computing [J]. Big Data Res. 1(001), 20–27 (2015)
Google Scholar
陈文光. 大数据与高性能计算[J]. 大数据, 2015, 1(001):20–27. http://www.infocomm-journal.com/bdr/article/2015/2096-0271/2096-0271-1-1-00020.shtml
Wickramasinghe U , Lumsdaine A . A Survey of Methods for Collective Communication Optimization and Tuning. 2016. ArXiv, abs/1611.06334.
Woodall T S, Shipman G M, Bosilca G, et al. High Performance RDMA Protocols in HPC [C]// European Pvm/mpi Users Group Conference on Recent Advances in Parallel Virtual Machine & Message Passing Interface. Springer, 2006.
Yanpei C, Francois R, Randy K. From TPC-C to Big Data Benchmarks: A Functional Workload Model [R]. 1st Workshop on Specifying Big Data Benchmarks, 2012, 8163: 28–43. https://doi.org/10.1007/978-3-642-53974-9_4
Zaharia M, Chowdhury M, Das T, et al. Resilient Distributed Datasets: A fault-tolerant Abstraction for In-memory Cluster Computing [C]// Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.
Zhang B, Ruan Y, Qiu J. Harp: Collective Communication on Hadoop [C]// 2015 IEEE International Conference on Cloud Engineering. IEEE, 2015.
Zhang, H., Chen, G., Ooi, B.C., et al.: In-memory big data management and processing: a survey [J]. IEEE Trans. Knowl. Data Eng. 27(7), 1920–1948 (2015)
Article Google Scholar
Zhen J, Jianfeng Z, Lei W et al. Characterizing and Subsetting Big Data Workloads [C]// 2014 IEEE International Symposium on Workload Characterization. IEEE, 2014.

Download references

Author information

Authors and Affiliations

School of Computer Science, Beijing Institute of Technology, Beijing, 100081, China
Fei Yin & Feng Shi

Authors

Fei Yin
View author publications
You can also search for this author in PubMed Google Scholar
Feng Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Yin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yin, F., Shi, F. A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model to a Cluster Architecture. Int J Parallel Prog 50, 27–64 (2022). https://doi.org/10.1007/s10766-021-00717-y

Download citation

Received: 07 April 2020
Accepted: 13 May 2021
Published: 26 May 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s10766-021-00717-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model to a Cluster Architecture

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big Data: An Introduction

A survey on the evolution of stream processing systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model to a Cluster Architecture

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big Data: An Introduction

A survey on the evolution of stream processing systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation