ABSTRACT
At one level, this paper is about River, a virtual execution environment for stream processing. Stream processing is a paradigm well-suited for many modern data processing systems that ingest high-volume data streams from the real world, such as audio/video streaming, high-frequency trading, and security monitoring. One attractive property of stream processing is that it lends itself to parallelization on multicores, and even to distribution on clusters when extreme scale is required. Stream processing has been co-evolved by several communities, leading to diverse languages with similar core concepts. Providing a common execution environment reduces language development effort and increases portability. We designed River as a practical realization of Brooklet, a calculus for stream processing. So at another level, this paper is about a journey from theory (the calculus) to practice (the execution environment). The challenge is that, by definition, a calculus abstracts away all but the most central concepts. Hence, there are several research questions in concretizing the missing parts, not to mention a significant engineering effort in implementing them. But the effort is well worth it, because using a calculus as a foundation yields clear semantics and proven correctness results.
- L. Amini, H. Andrade, R. Bhagwan, F. Eskesen, R. King, P. Selo, Y. Park, and C. Venkatramani. SPC: A distributed, scalable platform for data mining. In Proc. 4th International Workshop on Data Mining Standards, Services, and Platforms, pp. 27--37, Aug. 2006. Google ScholarDigital Library
- A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: Semantic foundations and query execution. The VLDB Journal, 15(2):121--142, June 2006. Google ScholarDigital Library
- J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah. Lime: A Java compatible and synthesizable language for heterogeneous architectures. In Proc. ACM Conference on Object-Oriented Programming Systems, Languages, and Applications, pp. 89--108, Oct. 2010. Google ScholarDigital Library
- R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In ACM SIGMOD International Conference on Management of Data, pp. 261--272, June 2000. Google ScholarDigital Library
- M. Bravenboer and E. Visser. Concrete syntax for objects. In Proc. ACM Conference on Object-Oriented Programming Systems, Languages, and Applications, pp. 365--383, Oct. 2004. Google ScholarDigital Library
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, efficient data-parallel pipelines. In Proc. ACM Conference on Programming Language Design and Implementation, pp. 363--375, June 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. 6th USENIX Symposium on Operating Systems Design and Implementation, pp. 137--150, Dec. 2004. Google ScholarDigital Library
- D. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Communications of the ACM, 35(6):85--98, June 1992. Google ScholarDigital Library
- L. Fegaras. Optimizing queries with object updates. Journal of Intelligent Information Systems, 12(2--3):219--242, Mar. 1999. Google ScholarDigital Library
- B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo. SPADE: The System S declarative stream processing engine. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 1123--1134, June 2008. Google ScholarDigital Library
- G. Ghelli, N. Onose, K. Rose, and J. Siméon. XML query optimization in the presence of side effects. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 339--352, June 2008. Google ScholarDigital Library
- M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proc. 12th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 151--162, Oct. 2006. Google ScholarDigital Library
- M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In Proc. 10th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 291--303, Dec. 2002. Google ScholarDigital Library
- R. Grimm. Better extensibility through modular syntax. In Proc. ACM Conference on Programming Language Design and Implementation, pp. 38--51, June 2006. Google ScholarDigital Library
- R. Grimm, J. Davis, E. Lemar, A. MacBeth, S. Swanson, T. Anderson, B. Bershad, G. Borriello, S. Gribble, and D. Wetherall. System support for pervasive applications. ACM Transactions on Computer Systems, 22(4):421--486, Nov. 2004. Google ScholarDigital Library
- Y. Gurevich, D. Leinders, and J. Van Den Bussche. A theory of stream queries. In Proc. 11th International Conference on Database Programming Languages, vol. 4797 of LNCS, pp. 153--168, Sept. 2007. Google ScholarDigital Library
- M. Hirzel. Partition and compose: Parallel complex event processing. In Proc. 6th International Conference on Distributed Event-Based Systems, July 2012. Google ScholarDigital Library
- M. Hirzel and R. Grimm. Jeannie: Granting Java native interface developers their wishes. In Proc. ACM Conference on Object-Oriented Programming Systems, Languages, and Applications, pp. 19--38, Oct. 2007. Google ScholarDigital Library
- M. Isard, M. B. Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel program from sequential building blocks. In Proc. 2nd European Conference on Computer Systems, pp. 59--72, Mar. 2007. Google ScholarDigital Library
- R. Khandekar, I. Hildrum, S. Parekh, D. Rajan, J. Wolf, K.-L. Wu, H. Andrade, and B. Gedik. COLA: Optimizing stream processing applications via graph partitioning. In Proc. 10th ACM/IFIP/USENIX International Conference on Middleware, vol. 5896 of LNCS, pp. 308--327, Nov. 2009. Google ScholarDigital Library
- F. Labonte, P. Mattson, W. Thies, I. Buck, C. Kozyrakis, and M. Horowitz. The stream virtual machine. In Proc. 13th International Conference on Parallel Architectures and Compilation Techniques, pp. 267--277, Sept./Oct. 2004. Google ScholarDigital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for life-long program analysis and transformation. In Proc. 2nd IEEE/ACM International Symposium on Code Generation and Optimization, pp. 75--88, Mar. 2004. Google ScholarDigital Library
- E. A. Lee and D. G. Messerschmitt. Synchronous data flow. Proceedings of the IEEE, 75(9):1235--1245, Sept. 1987.Google ScholarCross Ref
- E. Meijer, B. Beckman, and G. Bierman. LINQ: Reconciling object, relations and XML in the .NET framework. In Proc. ACM SIGMOD International Conference on Management of Data, p. 706, June 2006. Google ScholarDigital Library
- C. Miranda, A. Pop, P. Dumont, A. Cohen, and M. Duranton. Erbium: A deterministic, concurrent intermediate representation to map data-flow tasks to scalable, persistent streaming processes. In Proc. International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp. 11--20, Oct. 2010. Google ScholarDigital Library
- J. C. Mogul and K. K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. ACM Transactions on Computer Systems, 15(3):217--252, Aug. 1997. Google ScholarDigital Library
- D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: A universal execution engine for distributed data-flow computing. In Proc. 8th ACM/USENIX Symposium on Networked Systems Design and Implementation, pp. 113--126, Mar. 2011. Google ScholarDigital Library
- N. Nystrom, M. R. Clarkson, and A. C. Myers. Polyglot: An extensible compiler framework for Java. In Proc. 12th International Conference on Compiler Construction, vol. 2622 of LNCS, pp. 138--152, Apr. 2003. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In Proc. ACM SIGMOD International Conference on Management of Data, pp. 1099--1110, June 2008. Google ScholarDigital Library
- P. Pietzuch, J. Ledlie, J. Schneidman, M. Roussopoulos, M. Welsh, and M. Seltzer. Network-aware operator placement for stream-processing systems. In Proc. 22nd International Conference on Data Engineering, pp. 49--61, Apr. 2006. Google ScholarDigital Library
- R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13(4):277--298, 2005. Google ScholarDigital Library
- R. Soulé, M. Hirzel, R. Grimm, B. Gedik, H. Andrade, V. Kumar, and K.-L. Wu. A universal calculus for stream processing languages. In Proc. 19th European Symposium on Programming, vol. 6012 of LNCS, pp. 507--528, Mar. 2010. Google ScholarDigital Library
- W. Thies and S. Amarasinghe. An empirical characterization of stream programs and its implications for language and compiler design. In Proc. 19th International Conference on Parallel Architectures and Compilation Techniques, pp. 365--376, Sept. 2010. Google ScholarDigital Library
- W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In Proc. 11th International Conference on Compiler Construction, vol. 2304 of LNCS, pp. 179--196, Apr. 2002. Google ScholarDigital Library
- W. Thies, M. Karczmarek, M. Gordon, D. Maze, J. Wong, H. Hoffmann, M. Brown, and S. Amarasinghe. StreamIt: A compiler for streaming applications. Technical Report MIT-LCS-TM-622, Massachusetts Institute of Technology, Dec. 2001.Google Scholar
- J. Wolf, N. Bansal, K. Hildrum, S. Parekh, D. Rajan, R. Wagle, K.-L. Wu, and L. Fleischer. SODA: An optimizing scheduler for large-scale stream-based distributed computer systems. In Proc. 9th ACM/IFIP/-USENIX International Conference on Middleware, vol. 5346 of LNCS, pp. 306--325, Dec. 2008. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. 8th USENIX Symposium on Operating Systems Design and Implementation, pp. 1--14, Dec. 2008. Google ScholarDigital Library
- D. Zhang, Q. J. Li, R. Rabbah, and S. Amarasinghe. A lightweight streaming layer for multicore execution. ACM SIGARCH Computer Architecture News, 36(2):18--27, May 2008. Google ScholarDigital Library
Index Terms
- From a calculus to an execution environment for stream processing
Recommendations
Dual-Paradigm Stream Processing
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingExisting stream processing frameworks operate either under data stream paradigm processing data record by record to favor low latency, or under operation stream paradigm processing data in micro-batches to desire high throughput. For complex and mutable ...
Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and TechnologiesIn recent years there has been a surge in applications focusing on streaming data to generate insights in real-time. Both academia, as well as industry, have tried to address this use case by developing a variety of Stream Processing Engines (SPEs) with ...
Design and implementation of stream processing system and library for CELL broadband engine processors
PDCS '07: Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and SystemsTo bring out high performance in applications running on a CELL Broadband Engine processor (CELL processor), developers have to know its architecture and have special skills of the programming. As we know, the CELL processor is suitable for stream ...
Comments