Skip to main content
Log in

The Stratosphere platform for big data analytics

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of “Big Data” use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. PACT is a portmanteau for “parallelization contract.”

  2. We follow the definitions from the original MapReduce paper [22] but exclude execution-specific assumptions (such as the presence of sorted reduce inputs).

    Fig. 5
    figure 5

    The five second-order functions (PACTs) currently implemented in Stratosphere. The parallelization units implied by the PACTs are enclosed in dotted boxes. a Map b Reduce c Cross d Match e CoGroup

  3. Nephele was a cloud nymph in ancient Greek mythology. The name comes from Greek “\(\nu \epsilon \phi o \varsigma \),” meaning “cloud.” The name tips a hat to Dryad [44] (a tree nymph) that influenced Nephele’s design.

  4. When referring to Java, we refer also to other languages built on top of Java and the JVM, for example, Scala or Groovy.

  5. Some language compilers can transform functions that return a sequence of values automatically into an iterator. Java, however, offers no such mechanism.

  6. At the time of writing, Scope is not offered as a product or service by Microsoft.

References

  1. Ackermann, S., Jovanovic, V., Rompf, T., Odersky, M.: Jet: an embedded dsl for high performance big data processing. In: BigData Workshop at VLDB (2012)

  2. Alexandrov, A., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., Nijkamp, E., Warneke, D.: Mapreduce and pact - comparing data parallel programming models. In: BTW, pp. 25–44 (2011)

  3. Alexandrov, A., Battré, D., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., Nijkamp, E., Warneke, D.: Massively parallel data analysis with pacts on nephele. PVLDB 3(2), 1625–1628 (2010)

    Google Scholar 

  4. Apache Giraph. http://incubator.apache.org/giraph/

  5. Apache Hadoop. http://hadoop.apache.org/

  6. Apache Hive. http://sortbenchmark.org/

  7. Aster Data. http://www.asterdata.com/

  8. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: SoCC, pp. 119–130 (2010)

  9. Battré, D., Frejnik, N., Goel, S., Kao, O., Warneke, D.: Evaluation of network topology inference in opaque compute clouds through end-to-end measurements. In: IEEE CLOUD, pp. 17–24 (2011)

  10. Battré, D., Frejnik, N., Goel, S., Kao, O., Warneke, D.: Inferring network topologies in infrastructure as a service cloud. In: CCGRID, pp. 604–605 (2011)

  11. Battré, D., Hovestadt, M., Lohrmann, B., Stanik, A., Warneke, D.: Detecting bottlenecks in parallel dag-based data flow programs. In: MTAGS (2010)

  12. Behm, A., Borkar, V.R., Carey, M.J., Grover, R., Li, C., Onose, N., Vernica, R., Deutsch, A., Papakonstantinou, Y., Tsotras, V.J.: Asterix: towards a scalable, semistructured data platform for evolving-world models. Distrib. Parallel Databases 29(3), 185–216 (2011)

    Article  Google Scholar 

  13. Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., Kanne, C.C., Özcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4(12), 1272–1283 (2011)

    Google Scholar 

  14. Boden, C., Karnstedt, M., Fernandez, M., Markl, V.: Large-scale social media analytics on stratosphere. In: WWW (2013)

  15. Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: ICDE, pp. 1151–1162 (2011)

  16. Bruno, N., Agarwal, S., Kandula, S., Shi, B., Wu, M.C., Zhou, J.: Recurring job optimization in scope. In: SIGMOD Conference, pp. 805–806 (2012)

  17. Cha, M., Haddadi, H., Benevenuto, F., Gummadi, P.K.: Measuring user influence in twitter: the million follower fallacy. In: ICWSM (2010)

  18. Chafi, H., DeVito, Z., Moors, A., Rompf, T., Sujeeth, A.K., Hanrahan, P., Odersky, M., Olukotun, K.: Language virtualization for heterogeneous parallel computing. In: OOPSLA, pp. 835–847 (2010)

  19. Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a sql implementation on the mapreduce framework. PVLDB 4(12), 1318–1327 (2011)

    Google Scholar 

  20. Chaudhuri, S., Shim, K.: Including group-by in query optimization. In: VLDB, pp. 354–366 (1994)

  21. Cohen, J.: Graph twiddling in a mapreduce world. Comput. Sci. Eng. 11(4), 29–41 (2009)

    Article  Google Scholar 

  22. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)

  23. DeWitt, D.J., Gerber, R.H., Graefe, G., Heytens, M.L., Kumar, K.B., Muralikrishna, M.: Gamma—a high performance dataflow database machine. In: VLDB, pp. 228–237 (1986)

  24. Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  25. Ewen, S., Schelter, S., Tzoumas, K., Warneke, D., Markl, V.: Iterative parallel data processing with stratosphere: an inside look. In: SIGMOD (2013)

  26. Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. PVLDB 5(11), 1268–1279 (2012)

    Google Scholar 

  27. Fegaras, L., Li, C., Gupta, U.: An optimization framework for map-reduce queries. In: EDBT, pp. 26–37 (2012)

  28. Fushimi, S., Kitsuregawa, M., Tanaka, H.: An overview of the system software of a parallel relational database machine grace. In: VLDB, pp. 209–219 (1986)

  29. Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: SOSP, pp. 29–43 (2003)

  30. Graefe, G., Bunker, R., Cooper, S.: Hash joins and hash teams in microsoft sql server. In: VLDB, pp. 86–97 (1998)

  31. Graefe, G.: Implementing sorting in database systems. ACM Comput. Surv. 38(3), Article ID 10 (2006)

  32. Graefe, G.: Parallel query execution algorithms. In: Encyclopedia of Database Systems, pp. 2030–2035 (2009)

  33. Graefe, G.: Volcano—an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng. 6(1), 120–135 (1994)

    Article  Google Scholar 

  34. Greenplum. http://www.greenplum.com/

  35. Guo, Z., Fan, X., Chen, R., Zhang, J., Zhou, H., McDirmid, S., Liu, C., Lin, W., Zhou, J., Zhou, L.: Spotting code optimizations in data-parallel pipelines through periscope. In: OSDI, pp. 121–133 (2012)

  36. Harjung, J.J.: Reducing formal noise in pact programs. Master’s thesis, Technische Universität Berlin, Faculty of EECS (2013)

  37. Heise, A., Rheinländer, A., Leich, M., Leser, U., Naumann, F.: Meteor/sopremo: an extensible query language and operator model. In: BigData Workshop at VLDB (2012)

  38. Heise, A., Naumann, F.: Integrating open government data with stratosphere for more transparency. Web Semant.: Sci. Serv. Agents World Wide Web 14, 45–56 (2012)

    Article  Google Scholar 

  39. Höger, M., Kao, O., Richter, P., Warneke, D.: Ephemeral materialization points in stratosphere data management on the cloud. Adv. Parallel Comput. 23, 163–181 (2013)

    Google Scholar 

  40. Hovestadt, M., Kao, O., Kliem, A., Warneke, D.: Evaluating adaptive compression to mitigate the effects of shared i/o in clouds. In: IPDPS Workshops, pp. 1042–1051 (2011)

  41. Hueske, F., Krettek, A., Tzoumas, K.: Enabling operator reordering in data flow programs through static code analysis. CoRR abs/1301.4200 (2013)

  42. Hueske, F., Peters, M., Krettek, A., Ringwald, M., Tzoumas, K., Markl, V., Freytag, J.C.: Peeking into the optimization of data flow programs with mapreduce-style udfs. In: ICDE (2013)

  43. Hueske, F., Peters, M., Sax, M., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. PVLDB 5(11), 1256–1267 (2012)

    Google Scholar 

  44. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys, pp. 59–72 (2007)

  45. Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for mapreduce programs. PVLDB 4(6), 385–396 (2011)

    Google Scholar 

  46. Java HotSpot VM Whitepaper. http://www.oracle.com/technetwork/java/whitepaper-135217.html

  47. JavaScript Object Notation. http://json.org/

  48. Kalavri, V.: Integrating pig and stratosphere. Master’s thesis, KTH, School of Information and Communication Technology (ICT) (2012)

  49. Kang, U., Tsourakakis, C.E., Faloutsos, C.: Pegasus: a peta-scale graph mining system. In: ICDM, pp. 229–238 (2009)

  50. Kung, H.T., Robinson, J.T.: On optimistic methods for concurrency control. ACM Trans. Database Syst. 6(2), 213–226 (1981)

    Article  Google Scholar 

  51. Leich, M., Adamek, J., Schubotz, M., Heise, A., Rheinländer, A., Markl, V.: Applying stratosphere for big data analytics. In: BTW, pp. 507–510 (2013)

  52. Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for mapreduce workflows. PVLDB 5(11), 1196–1207 (2012)

    Google Scholar 

  53. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning in the cloud. PVLDB 5(8), 716–727 (2012)

    Google Scholar 

  54. Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD Conference, pp. 135–146 (2010)

  55. McSherry, F., Murray, D., Isaacs, R., Isard, M.: Differential dataflow. In: CIDR (2013)

  56. Mihaylov, S.R., Ives, Z.G., Guha, S.: Rex: recursive, delta-based data-centric computation. PVLDB 5(11), 1280–1291 (2012)

  57. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD Conference, pp. 1099–1110 (2008)

  58. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005)

  59. Project Gutenberg. http://www.gutenberg.org/

  60. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD Conference, pp. 23–34 (1979)

  61. Silva, Y.N., Larson, P.A., Zhou, J.: Exploiting common subexpressions for cloud query processing. In: ICDE, pp. 1337–1348 (2012)

  62. Stanford Network Analysis Project. http://snap.stanford.edu/

  63. Teradata. http://www.teradata.com/

  64. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive—a warehousing solution over a map-reduce framework. PVLDB 2(2), 1626–1629 (2009)

    Google Scholar 

  65. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)

    Article  Google Scholar 

  66. Wang, Y.M., Fuchs, W.K.: Lazy checkpoint coordination for bounding rollback propagation. In: Reliable Distributed Systems, 1993. Proceedings., 12th Symposium on, pp. 78–85 (1993)

  67. Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: SC-MTAGS (2009)

  68. Warneke, D., Kao, O.: Exploiting dynamic resource allocation for efficient parallel data processing in the cloud. IEEE Trans. Parallel Distrib. Syst. 22(6), 985–997 (2011)

    Article  Google Scholar 

  69. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J.: Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, pp. 1–14 (2008)

  70. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI (2012)

  71. Zhang, J., Zhou, H., Chen, R., Fan, X., Guo, Z., Lin, H., Li, J.Y., Lin, W., Zhou, J., Zhou, L.: Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In: NSDI (2012)

  72. Zhou, J., Bruno, N., Lin, W.: Advanced partitioning techniques for massively distributed computation. In: SIGMOD Conference, pp. 13–24 (2012)

  73. Zhou, J., Larson, P.Å., Chaiken, R.: Incorporating partitioning and parallel plans into the scope optimizer. In: ICDE, pp. 1060–1071 (2010)

  74. Zhou, J., Bruno, N., Wu, M.C., Larson, P.Å., Chaiken, R., Shakib, D.: Scope: parallel databases meet mapreduce. VLDB J. 21(5), 611–636 (2012)

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank the Master students that worked on the Stratosphere project and implemented many components of the system: Thomas Bodner, Christoph Brücke, Erik Nijkamp, Max Heimel, Moritz Kaufmann, Aljoscha Krettek, Matthias Ringwald, Tommy Neubert, Fabian Tschirschnitz, Tobias Heintz, Erik Diessler, Thomas Stolltmann.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kostas Tzoumas.

Additional information

Stratosphere is funded by the German Research Foundation (DFG) under grant FOR 1306.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alexandrov, A., Bergmann, R., Ewen, S. et al. The Stratosphere platform for big data analytics. The VLDB Journal 23, 939–964 (2014). https://doi.org/10.1007/s00778-014-0357-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-014-0357-y

Keywords

Navigation