ABSTRACT
The emerging interest in Massively Parallel Stream Processing Engines (MPSPEs), which are able to process long-standing computations over data streams with ever-growing velocity at a large-scale cluster, calls for efficient dynamic resource management techniques to avoid any waste of resources and/or excessive processing latency. In this paper, we propose an approach to integrate dynamic resource management with passive fault-tolerance mechanisms in a MPSPE so that we can harvest the checkpoints prepared for failure recovery to enhance the efficiency of dynamic load migrations. To maximize the opportunity of reusing checkpoints for fast load migration, we formally define a checkpoint allocation problem and provide a pragmatic algorithm to solve it. We implement all the proposed techniques on top of Apache Storm, an open-source MPSPE, and conduct extensive experiments using a real dataset to examine various aspects of our techniques. The results show that our techniques can greatly improve the efficiency of dynamic resource reconfiguration without imposing significant overhead or latency to the normal job execution.
- T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. Millwheel: fault-tolerant stream processing at internet scale. VLDB, 2013. Google ScholarDigital Library
- D. Alves, P. Bizarro, and P. Marques. Flood: elastic streaming mapreduce. In DEBS '10, 2010. Google ScholarDigital Library
- M. Balazinska, H. Balakrishnan, S. R. Madden, and M. Stonebraker. Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst., 33(1), Mar. 2008. Google ScholarDigital Library
- R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In SIGMOD, 2013. Google ScholarDigital Library
- B. Gedik, S. Schneider, M. Hirzel, and K.-L. Wu. Elastic scaling for data stream processing. TPDS, 2013. Google ScholarDigital Library
- Y. Gu, Z. Zhang, F. Ye, H. Yang, M. Kim, H. Lei, and Z. Liu. An empirical study of high availability in stream processing systems. In Middleware, 2009. Google ScholarDigital Library
- V. Gulisano, R. Jimenez-Peris, M. Patino-Martinez, C. Soriente, and P. Valduriez. Streamcloud: An elastic and scalable data streaming system. IEEE Trans. Parallel Distrib. Syst., 2012. Google ScholarDigital Library
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: wait-free coordination for internet-scale systems. In USENIXATC'10, 2010. Google ScholarDigital Library
- J.-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-availability algorithms for distributed stream processing. In ICDE '05, 2005. Google ScholarDigital Library
- J. hyon Hwang, Y. Xing, and S. Zdonik. A cooperative, self-configuring high-availability solution for stream processing. In In ICDE, 2007.Google ScholarCross Ref
- W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan. Muppet: Mapreduce-style processing of fast data. PVLDB, 5(12), 2012. Google ScholarDigital Library
- J. Li, K. Tufte, V. Shkapenyuk, V. Papadimos, T. Johnson, and D. Maier. Out-of-order processing: A new architecture for high-performance stream systems. Proc. VLDB Endow., 1(1), Aug. 2008. Google ScholarDigital Library
- L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In ICDMW '10, 2010. Google ScholarDigital Library
- B. Satzger, W. Hummer, P. Leitner, and S. Dustdar. Esc: Towards an Elastic Stream Computing Platform for the Cloud. In CLOUD'11, 2011. Google ScholarDigital Library
- S. Schneider, H. Andrade, B. Gedik, A. Biem, and K.-L. Wu. Elastic scaling of data parallel operators in stream processing. In IPDPS '09, 2009. Google ScholarDigital Library
- Z. Sebepou and K. Magoutis. Cec: Continuous eventual checkpointing for data stream processing operators. In DSN, 2011. Google ScholarDigital Library
- M. A. Shah, J. M. Hellerstein, and E. Brewer. Highly available, fault-tolerant, parallel dataflows. In SIGMOD, 2004. Google ScholarDigital Library
- M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, and M. J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In ICDE, 2003.Google ScholarDigital Library
- A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm@twitter. In SIGMOD '14, 2014. Google ScholarDigital Library
- Y. Xing, S. Zdonik, and J.-H. Hwang. Dynamic load distribution in the borealis stream processor. In ICDE, 2005. Google ScholarDigital Library
- M. Yagiura, S. Iwasaki, T. Ibaraki, and F. Glover. A very large-scale neighborhood search algorithm for the multi-resource generalized assignment problem. Discrete Optimization, 1(1), 2004. Google ScholarDigital Library
- M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In HotCloud, 2012. Google ScholarDigital Library
- Y. Zhou, K. Aberer, and K.-L. Tan. Toward massive query optimization in large-scale distributed stream systems. In Middleware, 2008. Google ScholarDigital Library
- Y. Zhou, B. C. Ooi, K. Tan, and J. Wu. Efficient dynamic operator placement in a locally distributed continuous query system. In O™, 2006.Google ScholarDigital Library
Index Terms
- Dynamic Resource Management In a Massively Parallel Stream Processing Engine
Recommendations
Heterogeneous Resource Management for Dynamic Real-Time Systems
HCW '00: Proceedings of the 9th Heterogeneous Computing WorkshopDynamic real-time systems face many resource management problems.This paper addresses the following problems: (1) dynamic resource allocation to provide QoS objectives,(2) heterogeneous resources, and(3) non-intrusive accurate monitoring of QoS, ...
Robust parallel resource management in shared memory multiprocessor systems
IPPS '95: Proceedings of the 9th International Symposium on Parallel ProcessingParallel machines are being increasingly used for applications that require both quick response time and high reliability. This poses a challenge in programming these systems since it must be ensured that there is sufficient redundancy to cope with ...
Workload balancing and adaptive resource management for the swift storage system on cloud
The demand for big data storage and processing has become a challenge in today's industry. To meet the challenge, there is an increasing number of enterprises adopting distributed storage systems. Frequently, in these systems, storage nodes intensively ...
Comments