Skip to main content

CBP: A New Parallelization Paradigm for Massively Distributed Stream Processing

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10178))

Abstract

Resource efficiency is essential for distributed stream processing engines (DSPEs), in which a streaming application is modeled as an operator graph where each operator is parallelized into a number of instances to meet the low-latency and high-throughput requirements. The major objectives of optimizing resource efficiency in DSPEs include minimizing the communication cost by collocating the tasks that transfer a lot of data between each other, and by dynamically configuring the systems according to the load variations at runtime. In the current literature, most proposals handle these two optimizations separately, and a shallow integration of these techniques, such as performing the two optimizations one after another, would result in a suboptimal solution. In this paper, we present component-based parallelization (CBP), a new paradigm for optimizing the resource efficiency of DSPEs, which provides a framework for a deeper integration of the two optimizations. In the CBP paradigm, the operators are encapsulated into a set of non-overlapping components, in which operators are parallelized consistently, i.e., using the same partitioning key, and hence the intra-component communication is eliminated. According to the changes of workload, each component can be adaptively partitioned into multiple instances, each of which is deployed on a computing node. We build a cost model to capture both the communication cost and adaptation cost of a CBP plan, and then propose several optimization algorithms. We implement the CBP scheme and the optimization algorithms on top of Apache Storm, and verify its efficiency by an extensive experiment study.

The author from North University of China is supported by National Natural Science Foundation of China (61602427) and Natural Science Foundation of Shanxi(201601D202037).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Apache Storm. http://storm.apache.org/

  2. Gurobi Parallel MIP solver. http://www.gurobi.com/resources/getting-started/mip-basics

  3. Abadi, D.J., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A.S., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.: The design of the borealis stream processing engine. In: CIDR 2005, Asilomar, CA, January 2005

    Google Scholar 

  4. Ahmad, Y., Çetintemel, U.: Network-aware query processing for stream-based applications. In: VLDB 2004, vol. 30, pp. 456–467 (2004)

    Google Scholar 

  5. Andrade, H., Gedik, B., Wu, K., Yu, P.S.: Scale-up strategies for processing high-rate data streams in system S. In: ICDE 2009

    Google Scholar 

  6. Arasu, A., Cherniack, M., Galvez, E., Maier, D., Maskey, A., Ryvkina, E., Stonebraker, M., Tibbetts, R.: Linear road: a stream data management benchmark. In VLDB 2004

    Google Scholar 

  7. Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., Zdonik, S.: Monitoring streams: a new class of data management applications. In: VLDB 2002, pp. 215–226 (2002)

    Google Scholar 

  8. Castro Fernandez, R., Migliavacca, M., Kalyvianaki, E., Pietzuch, P.: Integrating scale out and fault tolerance in stream processing using operator state management. In: SIGMOD 2013, pp. 725–736. ACM, New York (2013)

    Google Scholar 

  9. Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)

    Google Scholar 

  10. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI 2004, vol. 6. USENIX Association, Berkeley (2004)

    Google Scholar 

  11. DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)

    Article  Google Scholar 

  12. Gedik, B., Schneider, S., Hirzel, M., Wu, K.-L.: Elastic scaling for data stream processing. IEEE Trans. Parallel Distrib. Syst. 25, 1447–1463 (2010)

    Article  Google Scholar 

  13. Graefe, G.: Encapsulation of parallelism in the volcano query processing system. In: SIGMOD 1990, pp. 102–111. ACM (1990)

    Google Scholar 

  14. Gulisano, V., Jimenez-Peris, R., Patino-Martinez, M., Valduriez, P.: StreamCloud: a large scale data streaming system. In: ICDCS 2010, pp. 126–137 (2010)

    Google Scholar 

  15. Khandekar, R., Hildrum, K., Parekh, S., Rajan, D., Wolf, J., Wu, K.-L., Andrade, H., Gedik, B.: COLA: optimizing stream processing applications via graph partitioning. In: Bacon, J.M., Cooper, B.F. (eds.) Middleware 2009. LNCS, vol. 5896, pp. 308–327. Springer, Heidelberg (2009). doi:10.1007/978-3-642-10445-9_16

    Chapter  Google Scholar 

  16. Lakshmanan, G.T., Li, Y., Strom, R.: Placement strategies for internet-scale data stream systems. IEEE Internet Comput. 12(6), 50–60 (2008)

    Article  Google Scholar 

  17. Motwani, R., Widom, J., et al.: Query processing, resource management, and approximation in a data stream management system. In: CIDR 2003, pp. 245–256, January 2003

    Google Scholar 

  18. Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: distributed stream computing platform. In: ICDMW 2010, pp. 170–177. IEEE Computer Society, Washington, DC (2010)

    Google Scholar 

  19. Pietzuch, P., Ledlie, J., Shneidman, J., Roussopoulos, M., Welsh, M., Seltzer, M.: Network-aware operator placement for stream-processing systems. In: ICDE 2006. IEEE (2006)

    Google Scholar 

  20. Schneider, S., Andrade, H., Gedik, B., Biem, A., Wu, K.-L.: Elastic scaling of data parallel operators in stream processing. In: IPDPS, pp. 1–12 (2009)

    Google Scholar 

  21. Schneider, S., Hirzel, M., Gedik, B., Wu, K.-L.: Auto-parallelizing stateful distributed streaming applications. In: PACT 2012, pp. 53–64. ACM, New York (2012)

    Google Scholar 

  22. Shah, M.A., Chandrasekaran, S., Hellerstein, J.M., Franklin, M.J.:. Flux: an adaptive partitioning operator for continuous query systems. In: ICDE, pp. 25–36 (2002)

    Google Scholar 

  23. Wu, S., Kumar, V., Wu, K.-L., Ooi, B.C.: Parallelizing stateful operators in a distributed stream processing system: how, should you and how much? In: DEBS 2012, pp. 278–289 (2012)

    Google Scholar 

  24. Xing, Y., Hwang, J.-H., Çetintemel, U., Zdonik, S.: Providing resiliency to load variations in distributed stream processing. In: VLDB 2006, pp. 775–786. VLDB Endowment (2006)

    Google Scholar 

  25. Xing, Y., Zdonik, S., Hwang, J.-H.: Dynamic load distribution in the borealis stream processor. In: ICDE 2005, pp. 791–802. IEEE Computer Society (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingsong Guo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Guo, Q., Zhou, Y. (2017). CBP: A New Parallelization Paradigm for Massively Distributed Stream Processing. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10178. Springer, Cham. https://doi.org/10.1007/978-3-319-55699-4_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-55699-4_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-55698-7

  • Online ISBN: 978-3-319-55699-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics