Abstract
Resource efficiency is essential for distributed stream processing engines (DSPEs), in which a streaming application is modeled as an operator graph where each operator is parallelized into a number of instances to meet the low-latency and high-throughput requirements. The major objectives of optimizing resource efficiency in DSPEs include minimizing the communication cost by collocating the tasks that transfer a lot of data between each other, and by dynamically configuring the systems according to the load variations at runtime. In the current literature, most proposals handle these two optimizations separately, and a shallow integration of these techniques, such as performing the two optimizations one after another, would result in a suboptimal solution. In this paper, we present component-based parallelization (CBP), a new paradigm for optimizing the resource efficiency of DSPEs, which provides a framework for a deeper integration of the two optimizations. In the CBP paradigm, the operators are encapsulated into a set of non-overlapping components, in which operators are parallelized consistently, i.e., using the same partitioning key, and hence the intra-component communication is eliminated. According to the changes of workload, each component can be adaptively partitioned into multiple instances, each of which is deployed on a computing node. We build a cost model to capture both the communication cost and adaptation cost of a CBP plan, and then propose several optimization algorithms. We implement the CBP scheme and the optimization algorithms on top of Apache Storm, and verify its efficiency by an extensive experiment study.
The author from North University of China is supported by National Natural Science Foundation of China (61602427) and Natural Science Foundation of Shanxi(201601D202037).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Apache Storm. http://storm.apache.org/
Gurobi Parallel MIP solver. http://www.gurobi.com/resources/getting-started/mip-basics
Abadi, D.J., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A.S., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.: The design of the borealis stream processing engine. In: CIDR 2005, Asilomar, CA, January 2005
Ahmad, Y., Çetintemel, U.: Network-aware query processing for stream-based applications. In: VLDB 2004, vol. 30, pp. 456–467 (2004)
Andrade, H., Gedik, B., Wu, K., Yu, P.S.: Scale-up strategies for processing high-rate data streams in system S. In: ICDE 2009
Arasu, A., Cherniack, M., Galvez, E., Maier, D., Maskey, A., Ryvkina, E., Stonebraker, M., Tibbetts, R.: Linear road: a stream data management benchmark. In VLDB 2004
Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., Zdonik, S.: Monitoring streams: a new class of data management applications. In: VLDB 2002, pp. 215–226 (2002)
Castro Fernandez, R., Migliavacca, M., Kalyvianaki, E., Pietzuch, P.: Integrating scale out and fault tolerance in stream processing using operator state management. In: SIGMOD 2013, pp. 725–736. ACM, New York (2013)
Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI 2004, vol. 6. USENIX Association, Berkeley (2004)
DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
Gedik, B., Schneider, S., Hirzel, M., Wu, K.-L.: Elastic scaling for data stream processing. IEEE Trans. Parallel Distrib. Syst. 25, 1447–1463 (2010)
Graefe, G.: Encapsulation of parallelism in the volcano query processing system. In: SIGMOD 1990, pp. 102–111. ACM (1990)
Gulisano, V., Jimenez-Peris, R., Patino-Martinez, M., Valduriez, P.: StreamCloud: a large scale data streaming system. In: ICDCS 2010, pp. 126–137 (2010)
Khandekar, R., Hildrum, K., Parekh, S., Rajan, D., Wolf, J., Wu, K.-L., Andrade, H., Gedik, B.: COLA: optimizing stream processing applications via graph partitioning. In: Bacon, J.M., Cooper, B.F. (eds.) Middleware 2009. LNCS, vol. 5896, pp. 308–327. Springer, Heidelberg (2009). doi:10.1007/978-3-642-10445-9_16
Lakshmanan, G.T., Li, Y., Strom, R.: Placement strategies for internet-scale data stream systems. IEEE Internet Comput. 12(6), 50–60 (2008)
Motwani, R., Widom, J., et al.: Query processing, resource management, and approximation in a data stream management system. In: CIDR 2003, pp. 245–256, January 2003
Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: distributed stream computing platform. In: ICDMW 2010, pp. 170–177. IEEE Computer Society, Washington, DC (2010)
Pietzuch, P., Ledlie, J., Shneidman, J., Roussopoulos, M., Welsh, M., Seltzer, M.: Network-aware operator placement for stream-processing systems. In: ICDE 2006. IEEE (2006)
Schneider, S., Andrade, H., Gedik, B., Biem, A., Wu, K.-L.: Elastic scaling of data parallel operators in stream processing. In: IPDPS, pp. 1–12 (2009)
Schneider, S., Hirzel, M., Gedik, B., Wu, K.-L.: Auto-parallelizing stateful distributed streaming applications. In: PACT 2012, pp. 53–64. ACM, New York (2012)
Shah, M.A., Chandrasekaran, S., Hellerstein, J.M., Franklin, M.J.:. Flux: an adaptive partitioning operator for continuous query systems. In: ICDE, pp. 25–36 (2002)
Wu, S., Kumar, V., Wu, K.-L., Ooi, B.C.: Parallelizing stateful operators in a distributed stream processing system: how, should you and how much? In: DEBS 2012, pp. 278–289 (2012)
Xing, Y., Hwang, J.-H., Çetintemel, U., Zdonik, S.: Providing resiliency to load variations in distributed stream processing. In: VLDB 2006, pp. 775–786. VLDB Endowment (2006)
Xing, Y., Zdonik, S., Hwang, J.-H.: Dynamic load distribution in the borealis stream processor. In: ICDE 2005, pp. 791–802. IEEE Computer Society (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Guo, Q., Zhou, Y. (2017). CBP: A New Parallelization Paradigm for Massively Distributed Stream Processing. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10178. Springer, Cham. https://doi.org/10.1007/978-3-319-55699-4_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-55699-4_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55698-7
Online ISBN: 978-3-319-55699-4
eBook Packages: Computer ScienceComputer Science (R0)