Auto-scaling for real-time stream analytics on HPC cloud

Cheng, Yingchao; Hao, Zhifeng; Cai, Ruichu

doi:10.1007/s11761-019-00262-0

Auto-scaling for real-time stream analytics on HPC cloud

Original Research Paper
Published: 01 June 2019

Volume 13, pages 169–183, (2019)
Cite this article

Service Oriented Computing and Applications Aims and scope Submit manuscript

381 Accesses
5 Citations
Explore all metrics

Abstract

There are very-high-volume streaming data in the cyber world today. With the popularization of 5G technology, the streaming Big Data grows larger. Moreover, it needs to be analyzed in real time. We propose a new strategy HPC2-ARS to enable streaming services on HPC platforms. This strategy includes a three-tier high-performance cloud computing (HPC2) platform and a novel autonomous resource-scheduling (ARS) framework. The HPC2 platform is our de facto base platform for research on streaming service. It has three components: Tianhe-2 high-performance computer, custom OpenStack cloud computing software, and Apache Storm stream data analytic system. Our ARS framework ensures real-time response on unpredictable and fluctuating stream, especially streaming Big Data in the 5G era. This strategy addresses an essential problem in the convergence of HPC Cloud, Big Data, and streaming service. Specifically, Our ARS framework provides theoretical and practical solutions for resource provisioning, placement, and scheduling optimization. Extensive experiments have validated the effectiveness of the proposed strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Heterogeneity-aware elastic scaling of streaming applications on cloud platforms

Article 05 March 2021

Scalable Online Analytics on Cloud Infrastructures

High Performance Computing and Big Data

References

Padgavankar MH, Gupta SR (2014) Big data storage and challenges. Int J Comput Sci Inf Technol 5(2):2218–2223
Google Scholar
Chen CLP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf Sci 275(11):314–347
Article Google Scholar
Fu TZJ, Ding J, Ma RTB, Winslett M, Yang Y, Zhang Z (2015) DRS: dynamic resource scheduling for real-time analytic over fast streams. In: IEEE, international conference on distributed computing systems, vol 690. IEEE, pp 411–420
Kambatla K, Kollias G, Kumar V, Grama A (2014) Trends in Big Data analytic. J Parallel Distrib Comput 74(7):2561–2573
Article Google Scholar
Khan M, Li M, Ashton P, Taylor G, Liu J (2014). Big Data analytic on PMU measurements. In: International conference on fuzzy systems and knowledge discovery. IEEE. (IEEE Transactions)
Ramírez-Gallego S, Krawczyk B, García S, Woźniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239:39–57
Article Google Scholar
Jin CQ, Qian WN, Zhou AY (2004) Analysis and management of streaming data: a survey. J Softw 15(8):1172–1181
MATH Google Scholar
Liao X, Xiao L, Yang C, Lu Y (2014) Milkyway-2 supercomputer: system and application. Front Comput Sci 8(3):345–356
Article MathSciNet Google Scholar
Assunção MD, Calheiros RN, Bianchi S, Netto MA, Buyya R (2015) Big Data computing and clouds: trends and future directions. J Parallel Distrib Comput 79:3–15
Article Google Scholar
Rehr JJ, Vila FD, Gardner JP, Svec L, Prange M (2010) Scientific computing in the cloud. Comput Sci Eng 12(3):34–43
Article Google Scholar
Kingsbury BK (1986) The network queueing system Tech. Rep. NASA-CR-177433, NASA
Henderson RL (1995) Job scheduling under the portable batch system. In: Workshop on job scheduling strategies for parallel processing. Springer, Berlin, Heidelberg, pp 279–294
Slapničar P, Seitz U, Bode A, Zoraja I (2001) Resource management in message passing environments. J Comput Inf Technol 9(1):43–54
Article MATH Google Scholar
Litzkow MJ, Livny M, Mutka MW (1988) Condor-a hunter of idle workstations. In: 8th international conference on distributed computing systems, 1988. IEEE, pp 104–111
Capit N, Da Costa G, Georgiou Y, Huard G, Martin C, Mounié G et al (2005) A batch scheduler with high level components. In: IEEE international symposium on cluster computing and the grid, 2005. CCGrid 2005, vol 2. IEEE, pp 776–783
Zhou S, Zheng X, Wang J, Delisle P (1993) Utopia: a load sharing facility for large, heterogeneous distributed computer systems. Softw Pract Exp 23(12):1305–1336
Article Google Scholar
Newhouse T, Pasquale J (2006) ALPS: an application-level proportional-share scheduler. In: HPDC, pp 279–290
Yoo AB, Jette, MA, Grondona M (2003) Slurm: simple linux utility for resource management. In: Workshop on job scheduling strategies for parallel processing. Springer, Berlin, Heidelberg, pp. 44–60
Chen M, Mao S, Liu Y (2014) Big data: a survey. Mobile Netw Appl 19(2):171–209
Article Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Schwarzkopf M, Konwinski A, Abd-El-Malek M, Wilkes J (2013) Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European conference on computer systems. ACM, pp 351–364
Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes J (2015) Large-scale cluster management at Google with Borg. In: Proceedings of the tenth European conference on computer systems. ACM, p 18
Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz RH et al (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX conference on Networked systems design and implementation, vol 11, pp 295–308
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S. Konar M, Evans R et al (2013) Apache Hadoop YARN: yet another resource negotiator. In: Symposium on cloud computing. ACM, pp 1–16
Lin Y, Agrawal D, Chen C, Ooi BC, Wu S (2011) Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data. ACM, pp 961–972
Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C (2015) Apache tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 1357–1369
Bernstein D (2014) Containers and cloud: from lxc to docker to kubernetes. IEEE Cloud Comput 1(3):81–84
Article Google Scholar
Dittrich J, Quiané-Ruiz JA (2012) Efficient big data processing in Hadoop MapReduce. Proc VLDB Endow 5(12):2014–2015
Article Google Scholar
Bird SL, Smith BJ (2011) PACORA: performance aware convex optimization for resource allocation. In: Proceedings of the 3rd USENIX workshop on hot topics in parallelism
Ousterhout K, Wendell P, Zaharia M, Stoica I (2013) Sparrow: distributed, low latency scheduling. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. ACM, pp 69–84
Hirzel M, Soulé R, Schneider S, Gedik B, Grimm R (2014) A catalog of stream processing optimizations. ACM Comput Surv (CSUR) 46(4):46
Article Google Scholar
Abadi DJ, Carney D, Çetintemel U, Cherniack M, Convey C, Lee S et al (2003) Aurora: a new model and architecture for data stream management. VLDB J 12(2):120–139
Article Google Scholar
Abadi DJ, Ahmad Y, Balazinskaur M, Cetintemel U, Cherniack M, Hwang J-H, Lindner W, Maskey AS, Rasin A, Ryvkina E, Tatbul N, Xing Y, Zdonik S (2005) The design of the borealis stream processing engine. In: 2nd biennial conference on innovative data systems research (CIDR’05)
Hormati AH, Choi Y, Woh M, Kudlur M, Rabbah R, Mudge T, Mahlke S (2010) MacroSS: macro-SIMDization of streaming applications. In: ACM SIGARCH computer architecture news, vol 38, no. 1. ACM, pp 285–296
Thies W, Karczmarek M, Amarasinghe S (2002) StreamIt: a language for streaming applications. In: International conference on compiler construction. Springer, Berlin, Heidelberg, pp 179–196
Welsh M, Culler D, Brewer E (2001) SEDA: an architecture for well-conditioned, scalable internet services. In: ACM SIGOPS operating systems review, vol 35, no. 5. ACM, pp 230–243
Arpaci-Dusseau RH, Anderson E, Treuhaft N, Culler DE, Hellerstein JM, Patterson D, Yelick K (1999) Cluster I/O with river: making the fast case common. In: Proceedings of the sixth workshop on I/O in parallel and distributed systems. ACM, pp 10–22
Wolf J, Bansal N, Hildrum K, Parekh S, Rajan D, Wagle R et al (2008) SODA: an optimizing scheduler for large-scale stream-based distributed computer systems. In: Proceedings of the 9th ACM/IFIP/USENIX international conference on middleware. Springer, New York, pp 306–325
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A et al (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65
Article Google Scholar
Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache flink: stream and batch processing in a single engine. In: Bulletin of the IEEE computer society technical committee on data engineering, Vol 36
Google Scholar
Toshniwal A, Taneja S, Shukla A, Ramasamy K, Patel JM, Kulkarni S et al (2014) Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, pp 147–156
Kulkarni S, Bhagat N, Fu M, Kedigehalli V, Kellogg C, Mittal S et al (2015) Twitter heron: stream processing at scale. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 239–250
Bitran GR, Morabito R (1996) State-of-the-art survey: open queueing networks: optimization and performance evaluation models for discrete manufacturing systems. Prod Oper Manag 5(2):163–193
Article Google Scholar
Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A et al (2010) A view of cloud computing. Commun ACM 53(4):50–58
Article Google Scholar
Mathis M, Mahdavi J, Floyd S, Romanow A (1996) TCP selective acknowledgment options (No. RFC 2018)
Savitzky A, Golay MJ (1964) Smoothing and differentiation of data by simplified least squares procedures. Anal Chem 36(8):1627–1639
Article Google Scholar
“Sahara”. wiki.openstack.org. Retrieved 24 September 2014
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R et al (2013) Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing. ACM, p 5
Pang Z, Xie M, Zhang J, Zheng Y, Wang G, Dong D, Suo G (2014) The TH express high-performance interconnect networks. Front Comput Sci 8(3):357–366
Article MathSciNet Google Scholar
Sefraoui O, Aissaoui M, Eleuldj M (2012) OpenStack: toward an open-source solution for cloud computing. Int J Comput Appl 55(3):38–42
Google Scholar
Nguyen DT, Jung JE (2017) Real-time event detection for online behavioral analytic of big social data. Future Gen Comput Syst 66:137–145
Article Google Scholar
Aiello LM, Petkos G, Martin C, Corney D, Papadopoulos S, Skraba R et al (2013) Sensing trending topics in twitter. IEEE Trans Multimed 15(6):1268–1282
Article Google Scholar
Liu Y, Wang J, Li Z, Li H (2017) Efficient logo recognition by local feature groups. Multimed Syst 23(3):1–9
Article Google Scholar
Romberg S, Pueyo LG, Lienhart R, Zwol RV (2011) Scalable logo recognition in real-world images. In: ACM international conference on multimedia retrieval. ACM, pp 25
Yun U (2007) Mining lossless closed frequent patterns with weight constraints. Knowl-Based Syst 20(1):86–97
Article Google Scholar
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, California, USA, August. DBLP, pp 133–142
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of VLDB, pp 81–92

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of China (NSFC)-Guangdong Joint Fund under Grant U1501254, in part by the NSFC under Grant 61876043 and Grant 61472089, in part by the China Scholarship Council under Grant 201608440336, in part by the Natural Science Foundation of Guangdong under Grant 2014A030306004 and Grant 2014A030308008, in part by the Guangdong High-level Personnel of Special Support Program under Grant 2015TQ01X140, in part by the Guangdong Provincial Key Laboratory of Cyber-Physical System under Grant 2016B030301008, and in part by the Pearl River S&T Nova Program of Guangzhou under Grant 201610010101.

Author information

Authors and Affiliations

School of Computers, Guangdong University of Technology, Guangzhou, 510006, China
Yingchao Cheng, Zhifeng Hao & Ruichu Cai
School of Mathematics and Big Data, Foshan University, Foshan, 528000, China
Zhifeng Hao
Department of Statistics, Texas A&M University, College Station, 77840, USA
Yingchao Cheng

Authors

Yingchao Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhifeng Hao
View author publications
You can also search for this author in PubMed Google Scholar
Ruichu Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhifeng Hao.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, Y., Hao, Z. & Cai, R. Auto-scaling for real-time stream analytics on HPC cloud. SOCA 13, 169–183 (2019). https://doi.org/10.1007/s11761-019-00262-0

Download citation

Received: 15 October 2018
Revised: 01 April 2019
Accepted: 22 May 2019
Published: 01 June 2019
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s11761-019-00262-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Auto-scaling for real-time stream analytics on HPC cloud

Abstract

Access this article

Similar content being viewed by others

Heterogeneity-aware elastic scaling of streaming applications on cloud platforms

Scalable Online Analytics on Cloud Infrastructures

High Performance Computing and Big Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Auto-scaling for real-time stream analytics on HPC cloud

Abstract

Access this article

Similar content being viewed by others

Heterogeneity-aware elastic scaling of streaming applications on cloud platforms

Scalable Online Analytics on Cloud Infrastructures

High Performance Computing and Big Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation