Skip to main content
Log in

A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model to a Cluster Architecture

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

With the rapid growth of artificial intelligence (AI), the Internet of Things (IoT) and big data, emerging applications that cross stacks with different techniques bring new challenges to parallel computing systems. These cross-stack functionalities require one system to possess multiple characteristics, such as the ability to process data under high throughput and low latency, the ability to carry out iterative and incremental computation, transparent fault tolerance, and the ability to perform heterogeneous tasks that evolve dynamically. However, high-performance computing (HPC) and big data computing, as two categories of parallel computing architecture, are incapable of meeting all these requirements. Therefore, by performing a comparative analysis of HPC and big data computing from the perspective of the parallel programming model layer, middleware layer, and infrastructure layer, we explore the design principles of the two architectures and discuss a converged architecture to address the abovementioned challenges.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Akidau, T., Balikov, A., Bekiroğlu, Kaya, et al.: MillWheel: fault-tolerant stream processing at internet scale [J]. Proc. VLDB Endow. 6(11), 1033–1044 (2013)

    Article  Google Scholar 

  2. Almasi, G.: PGAS (Partitioned Global Address Space) Languages [J]. Encycl. Parallel Comput. 1, 1539–1545 (2011)

    Google Scholar 

  3. Apache Giraph. https://giraph.apache.org/

  4. Asaadi, H. R., Khaldi, D., Chapman, B. A. (2016) Comparative Survey of the HPC and Big Data Paradigms: Analysis and Experiments [C]// IEEE International Conference on Cluster Computing (CLUSTER). IEEE,:423–432.

  5. Bröse E. ZeroCopy: Techniques, Benefits and Pitfalls [EB/OL]. https://static.aminer.org/pdf/PDF/000/253/158/design_and_implementation_of_zero_copy_data_path_for_efficient.pdf

  6. Browning, S. A. The Tree Machine: A Highly Concurrent Computing Environment [EB/OL]. 1980. http://resolver.caltech.edu/CaltechCSTR:3760-tr-80.

  7. Carbone, P., Fóra, G., Ewen, S et al. (2015) Lightweight Asynchronous Snapshots for Distributed Dataflows [J]. Computer Science.

  8. Carbone, P., Katsifodimos, A., Ewen, S., et al.: Apache flinktm: stream and batch processing in a single engine [J]. IEEE Data Eng. Bulletin 38(4), 28–38 (2015)

    Google Scholar 

  9. Chaimov, N., Malony, A., Canon, S et al. Scaling Spark on HPC Systems [C]// Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing – HPDC. ACM, 2016:97–110.

  10. Chambers, C., Raniwala, A., Perry, F et al. FlumeJava: Easy, Efficient Data-parallel Pipelines [C]// ACM Sigplan Conference on Programming Language Design & Implementation. ACM, 2010.

  11. Chan, E., Heimlich, M., Purkayastha, A., et al.: Collective communication: theory, practice, and experience [J]. Concurrency Computat. Pract. Exper. 19(13), 1749–1783 (2007)

    Article  Google Scholar 

  12. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems [J]. ACM Transact. Comput. Syst. (TOCS) 3(1), 63–75 (1985)

    Article  Google Scholar 

  13. Chapman, B., Curtis, T., Pophale, S et al. Introducing OpenSHMEM: SHMEM for the PGAS community [C]// Conference on Partitioned Global Address Space Programming Model. 2010.

  14. Clos, C.: A study of non-blocking switching networks [J]. Bell Syst. Tech. J. 32(2), 406–424 (1953)

    Article  Google Scholar 

  15. Crankshaw, D., Bailis, P., Gonzalez, J.E., et al.: The missing piece in complex analytics: low latency, scalable model management and serving with velox [J]. European J. Obstet. Gynecol. Reprod. Biol 185, 181–182 (2014)

    Google Scholar 

  16. Cristina P. The technology Stacks of High Performance Computing & Big Data Computing: What They Can Learn from Each Other. https://www.etp4hpc.eu/pujades/files/bigdata_and_hpc_FINAL_20Nov18.pdf

  17. Dagum, L., Menon, R.: OpenMP: an industry standard api for shared-memory programming [J]. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)

    Article  Google Scholar 

  18. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters [J]. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  19. Doulkeridis, C., Nørvåg, Kjetil: A survey of large-scale analytical query processing in mapreduce [J]. VLDB J. Int. J. Very Large Data Bases 23(3), 355–380 (2014)

    Article  Google Scholar 

  20. Duell, J., Hargrove, P., Roman, E. The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart [R]. Berkeley Lab Technical Report LBNL-54941, 2002.

  21. Egwutuoha, I P., Chen, S., Levy, D, et al. A Fault Tolerance Framework for High Performance Computing in Cloud [C]//12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 2012: 709–710.

  22. Egwutuoha, I.P., Levy, D., Selic, B., et al.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems [J]. J. Supercomput. 65(3), 1302–1326 (2013)

    Article  Google Scholar 

  23. Ekanayake, J., Hui, L., Zhang, B et al. Twister: A Runtime for Iterative MapReduce [C]// Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. DBLP, 2010.

  24. El-Ghazawi, T., Smith, L UPC: Unified parallel [C]//ACM/IEEE Conference on High Performance Networking & Computing. DBLP, 2006.

  25. Fagg G E, Dongarra J. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world [C]// Proceedings of EuroPVM-MPI 2000. Springer, 2000.

  26. Foster I. The MPI Programming Model. https://www.mcs.anl.gov/~itf/dbpp/text/node95.html

  27. Gan G, Manzano J. TL-DAE: Thread-Level Decoupled Access/Execution for OpenMP on the Cyclops-64 Many-Core Processor [C]// Languages and Compilers for Parallel Computing, 22nd International Workshop, LCPC 2009. Springer, 2009.

  28. Graham, R.L., Choi, S.E., Daniel, D.J., et al.: A Network-failure-tolerant message-passing system for terascale clusters [J]. Int. J. Parallel Prog. 31, 285–303 (2003)

    Article  Google Scholar 

  29. Gropp W, Huss-Lederman S, Lumsdaine A, et al. (1998) MPI: The Complete Reference. Volume 2, the MPI-2 Extensions [M]. Cambridge: MIT Press, .

  30. Gupta P, Goel A, Lin J, et al. WTF: The Who to Follow Service at Twitter [C]// Proceedings of the 22nd international conference on World Wide Web. ACM, 2013.

  31. Hindma B, Konwinski A, Zaharia M, et al. Mesos: A Platform For Fine-Grained Resource Sharing in the Data Center [C]// Proceedings of the 8th USENIX conference on Networked systems design and implementation. NSDI, 2011.

  32. Holtslag A A M, De Bruijn E I F, Pan H L. A High Resolution Air Mass Transformation Model for Short-Range Weather Forecasting [J]. Monthly Weather Review, 1990, 118(8):1561–1575. http://docs.jboss.org/drools/release/6.0.0.Final/drools-docs/html/HybridReasoningChapter.html#ReteOO

  33. Hovestadt M, Kao O, Keller A, et al. Scheduling in HPC Resource Management Systems: Queuing vs. Planning [C]// Job Scheduling Strategies for Parallel Processing, 9th International Workshop. Springer, 2003.

  34. https://en.wikipedia.org/wiki/Iterative_method

  35. https://hadoop.apache.org/

  36. https://hortonworks.com/apache/yarn/

  37. https://www.pnnl.gov/computing/hpda/

  38. https://www.pnnl.gov/computing/HPDA/ResearchAreas/Tasks/HPDA_EventAnalysis_17.pdf

  39. Hua Z, Jason N. CRAK: Linux Checkpoint/Restart as a Kernel Module [R]. Technical Report CUCS-014–01, Department of Computer Science, Columbia University, 2001.

  40. Hursey J, Squyres J M, Mattox T I, et al. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI [C]// 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 2007.

  41. Husbands P: Unified Parallel C. https://pdfs.semanticscholar.org/9b65/a5dfffbfc9165cc7f2a366f54f8085f51773.pdf

  42. Introduction to OpenMP. https://www3.nd.edu/~zxu2/acms60212-40212/Lec-12-OpenMP.pdf

  43. Isard, M.: Dryad: distributed data-parallel programs from sequential building block [J]. SIGOPS Oper. Syst. Rev. 41(3), 59–72 (2007)

    Article  Google Scholar 

  44. Jette M A, Yoo A B, Grondona M. Slurm: Simple Linux Utility for Resource Management [C]//Proceedings of Job Scheduling Strategies for Parallel Processing. Springer, 2003.

  45. Jha, S., Qiu, J., Luckow, A et al. (2014), A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures 3. https://arxiv:1403.1528

  46. Jiawei, H., Micheline, K.: Data mining: concepts and techniques [J]. Data Min. Concep. Models Methods Algorithms Second Edition 5(4), 1–18 (2006)

    MATH  Google Scholar 

  47. Kaur, D., Chadha, R., Verma, N.: Comparison of micro-batch and streaming engine on real time data [J]. Int. J. Eng.Sci. Res. Techonol. 4, 756–761 (2017)

    Google Scholar 

  48. Kune, R., Konugurthi, P.K., Agarwal, A., et al.: The anatomy of big data computing [J]. Software Pract. Exper. 46(1), 79–105 (2016)

    Article  Google Scholar 

  49. Lathia N, Hailes S, Capra L. kNN CF: A Temporal Social Network [C]// Proceedings of the 2008 ACM Conference on Recommender Systems. ACM, 2008.

  50. Lei W, Jianfeng Z, Chunjie L, et al. BigDataBench: A big data benchmark suite from internet services [C]// IEEE International Symposium on High Performance Computer Architecture. IEEE, 2014.

  51. Lifka D. The ANL/IBM SP scheduling system [C]// Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1995: 295–303.

  52. Mahdavinejad, M.S., Rezvan, M., Barekatain, M., et al.: Machine learning for internet of things data analysis: a survey [J]. Digital Commun. Netw. 3, 161–175 (2018)

    Article  Google Scholar 

  53. Malewicz G, Austern M H, Bik A J C, et al. Pregel: A System for Large-scale Graph Processing [C]// Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010.

  54. Mapreduce Tutorial. https://hadoop.apache.org/docs/r3.1.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

  55. Maria C C, Giuseppe S. Workload Characterization: A Survey [C]// Proceedings of the IEEE. IEEE, 1993, 81(8):1136–1150. https://doi.org/ 10.1109/5.236191

  56. Martín Abadi, Paul Barham, Jianmin Chen, et al. TensorFlow: A System for Large-scale Machine Learning [C]//12th USENIX Symposium on Operating Systems Design and Implementation. USENIX, 2016, 265–283.

  57. Marz N. Trident. https://github.com/ nathanmarz/storm/wiki/Trident-tutorial. 2012.

  58. McSherry F, Isaacs R, Isard M, et al. Composable Incremental and Iterative Data-Parallel Computation with Naiad [R]. Microsoft Research, 2012. https://www.microsoft.com/en-us/research/wp-content/uploads/2012/10/naiad.pdf

  59. McSherry F, Murray D G, Isaacs R, and Isard M. Differential Dataflow [C]// Proceedings of 6th Biennial Conference on Innovative Data Systems Research. 2013. http://cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf

  60. Mehdi, M., Ala, A.F., Sameh, S., et al.: Deep learning for iot big data and streaming analytics: a survey [J]. IEEE Commun. Surv. Tutorials 1(1), 99 (2017)

    Google Scholar 

  61. Mina J, Verde C. Fault Detection Using Dynamic Principal Component Analysis by Average Estimation [C]// IEEE International Conference on Electrical & Electronics Engineering. IEEE, 2005.

  62. Mu’Alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling [J]. IEEE Transact. Parallel Distributed Syst. 6(12), 529–543 (2001)

    Article  Google Scholar 

  63. Murray D G, McSherry F, Isaacs R, et al. Naiad: A Timely Dataflow System [C]// ACM Symposium on Operating Systems Principles (SOSP). ACM, 2013: 439–455.

  64. Neumaier, A.: Molecular modeling of proteins and mathematical prediction of protein structure [J]. SIAM Rev. 39(3), 407–460 (1997)

    Article  MathSciNet  Google Scholar 

  65. Nishihara R, Moritz P, Wang S, et al. Real-Time Machine Learning: The Missing Pieces. 2017. https://arxiv.org/abs/1703.03924

  66. OpenMP Application Program Interface. OpenMP Architecture Review Board. 2008. http://www.openmp.org/mp-docu- ments/spec30.pdf

  67. Ordónez, F.J., Roggen, D.: Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition [J]. Sensors 16(1), 115 (2016)

    Article  Google Scholar 

  68. Pan R, Dolog P, Xu G. KNN-Based Clustering for Improving Social Recommender Systems [C]// Agents and Data Mining Interaction: 8th International Workshop, ADMI 2012. Springer, 2013. https://doi: 10.1007/978-3-642-36288-0_11

  69. Philipp M, Nishihara R, Stephanie W, et al. Ray: A Distributed Framework for Emerging AI Applications [C]// Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation. 2018. https://arxiv.org/abs/1712.05889

  70. Philipp M, Robert N, Stephanie W, et al. Ray: A Distributed Framework for Emerging AI Applications [C]// USENIX Symposium on Operating Systems Design and Implementation. USENIX, 2018.

  71. Quoc-Cuong, T., Juan, S., Volker, M.: A survey of state management in big data processing systems [J]. Int. J. Very Large Data Bases 27(6), 847–872 (2018)

    Article  Google Scholar 

  72. Ramalingam, G.: Bounded Incremental Computation [M]. Springer, Berlin (1996)

    Book  Google Scholar 

  73. Reuther, A., Byun, C., Arcand, W., et al.: Scalable System Scheduling for HPC and Big Data [J]. J. Parallel Distrib. Comput. 111(1), 76–92 (2018)

    Article  Google Scholar 

  74. Richer S. A Deep Dive into Rescalable State in Apache Flink. 2017. https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html

  75. Sakr, S., Liu, A., Fayoumi, A.: The family of mapreduce and large scale data processing systems [J]. ACM Comput. Surv. 46(1), 1–44 (2013)

    Article  Google Scholar 

  76. Sankaran, S., Squyres, J.M., Barrett, B., et al.: The lam/mpi checkpoint/restart framework: system-initiated checkpointing[J]. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005)

    Article  Google Scholar 

  77. Saraswat V, Almasi G, Bikshandi G, et al. The Asynchronous Partitioned Global Address Space Model. http://www.cs.rochester.edu/u/cding/amp/papers/full/The%20Asynchronous%20Partitioned%20Global%20Address%20Space%20Model.pdf

  78. Schulz M, Bronevetsky G, Fernandes R, et al. Implementation and Evaluation of a Scalable Application-level Checkpoint-recovery Scheme for MPI Programs [C]// Proceedings of the 2004 ACM/IEEE Conference on Supercomputing. IEEE, 2004.

  79. Severson, K., Chaiwatanodom, P., Braatz, R.D.: Perspectives on process monitoring of industrial systems [J]. Annu. Rev. Control. 42, 190–200 (2016)

    Article  Google Scholar 

  80. Stephen P B. Multidimentional Scaling. 1997. http://www.analytictech.com/borgatti/mds.htm

  81. Supun, K., Pulasthi, W., Saliya, E., et al.: Anatomy of machine learning algorithm implementations in mpi, spark, and flink [J]. Int. J. High Perform. Comput. 32(1), 61–73 (2018)

    Article  Google Scholar 

  82. The Beowulf Cluster site. http://www.beowulf.org

  83. Tianqi Chen, Mu Li, Yutian Li, et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. Neural Information Processing Systems, Workshop on Machine Learning Systems. 2016.

  84. Tony H, Stewart T, Kristin T. The Fourth Paradigm: Data-Intensive Scientific Discovery [M]. Microsoft Research. 2009. https://www.microsoft.com/en-us/research/wp-content/uploads/2009/10/Fourth_Paradigm.pdf

  85. Tzoumas K. High-throughput, Low-latency, and Exactly-once Stream Processing with Apache Flink™. 2015. https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink

  86. Wenguang, C.: Big data and high performance computing [J]. Big Data Res. 1(001), 20–27 (2015)

    Google Scholar 

  87. 陈文光. 大数据与高性能计算[J]. 大数据, 2015, 1(001):20–27. http://www.infocomm-journal.com/bdr/article/2015/2096-0271/2096-0271-1-1-00020.shtml

  88. Wickramasinghe U , Lumsdaine A . A Survey of Methods for Collective Communication Optimization and Tuning. 2016. ArXiv, abs/1611.06334.

  89. Woodall T S, Shipman G M, Bosilca G, et al. High Performance RDMA Protocols in HPC [C]// European Pvm/mpi Users Group Conference on Recent Advances in Parallel Virtual Machine & Message Passing Interface. Springer, 2006.

  90. Yanpei C, Francois R, Randy K. From TPC-C to Big Data Benchmarks: A Functional Workload Model [R]. 1st Workshop on Specifying Big Data Benchmarks, 2012, 8163: 28–43. https://doi.org/10.1007/978-3-642-53974-9_4

  91. Zaharia M, Chowdhury M, Das T, et al. Resilient Distributed Datasets: A fault-tolerant Abstraction for In-memory Cluster Computing [C]// Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.

  92. Zhang B, Ruan Y, Qiu J. Harp: Collective Communication on Hadoop [C]// 2015 IEEE International Conference on Cloud Engineering. IEEE, 2015.

  93. Zhang, H., Chen, G., Ooi, B.C., et al.: In-memory big data management and processing: a survey [J]. IEEE Trans. Knowl. Data Eng. 27(7), 1920–1948 (2015)

    Article  Google Scholar 

  94. Zhen J, Jianfeng Z, Lei W et al. Characterizing and Subsetting Big Data Workloads [C]// 2014 IEEE International Symposium on Workload Characterization. IEEE, 2014.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Yin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yin, F., Shi, F. A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model to a Cluster Architecture. Int J Parallel Prog 50, 27–64 (2022). https://doi.org/10.1007/s10766-021-00717-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-021-00717-y

Keywords

Navigation