ABSTRACT
TimeStream is a distributed system designed specifically for low-latency continuous processing of big streaming data on a large cluster of commodity machines. The unique characteristics of this emerging application domain have led to a significantly different design from the popular MapReduce-style batch data processing. In particular, we advocate a powerful new abstraction called resilient substitution that caters to the specific needs in this new computation model to handle failure recovery and dynamic reconfiguration in response to load changes. Several real-world applications running on our prototype have been shown to scale robustly with low latency while at the same time maintaining the simple and concise declarative programming model. TimeStream handles an on-line advertising aggregation pipeline at a rate of 700,000 URLs per second with a 2-second delay, while performing sentiment analysis of Twitter data at a peak rate close to 10,000 tweets per second, with approximately 2-second delay.
- Hadoop. http://hadoop.apache.org/.Google Scholar
- Storm. https://github.com/nathanmarz/storm/wiki.Google Scholar
- Trident. https://github.com/nathanmarz/storm/wiki/Trident-tutorial.Google Scholar
- Streambase systems. http://streambase.com/.Google Scholar
- Meijer, E., Beckman, B., and Bierman, G. Linq: Reconciling object, relations and xml in the .NET framework. In SIGMOD, 2006. Google ScholarDigital Library
- Ali, M. H., Gerea, C., Raman, B. S., Sezgin, B., Tarnavski, T., Verona, T., Wang, P., Zabback, P., Ananthanarayan, A., Kirilov, A., Lu, M., Raizman, A., Krishnan, R., Schindlauer, R., Grabs, T., Bjeletich, S., Chandramouli, B., Goldstein, J., Bhat, S., Li, Y., Di Nicola, V., Wang, X., Maier, D., Grell, S., Nano, O., and Santos, I. Microsoft CEP server and online behavioral targeting. In VLDB, 2009. Google ScholarDigital Library
- Andrade, H., Gedik, B., Wu, K. L., and Yu, P. S. Processing high data rate streams in system S. J. Parallel Distrib. Comput. 71, 2 (2011), 145--156. Google ScholarDigital Library
- Balazinska, M., Balakrishnan, H., Madden, S., and Stonebraker, M. Fault-tolerance in the Borealis distributed stream processing system. In SIGMOD 2005. Google ScholarDigital Library
- Barga, R., Goldstein, J., Ali, M., and Hong, M. Consistent streaming through time: A vision for event stream processing. In CIDR, 2007.Google Scholar
- Dean, J., and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
- Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., and Pasquin, R. Incoop: MapReduce for incremental computations. In SOCC, 2011. Google ScholarDigital Library
- Biem, A., Bouillet, E., Feng, H., Ranganathan, A., Riabov, A., Verscheure, O., Koutsopoulos, H., and Moran, C. IBM InfoSphere Streams for scalable, real-time, intelligent transportation services. In SIGMOD, 2010. Google ScholarDigital Library
- Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. Scope: Easy and efficient parallel processing of massive data sets. In VLDB, 2008. Google ScholarDigital Library
- Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R. R., Bradshaw, R., and Weizenbaum, N. FlumeJava: Easy, efficient data-parallel pipelines. In PLDI, 2010. Google ScholarDigital Library
- Gunda, P. K., Ravindranath, L., Thekkath, C. A., Yu, Y., and Zhuang, L. Nectar: Automatic management of data and computation in datacenters. In OSDI, 2010. Google ScholarDigital Library
- Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. Zookeeper: Wait-free coordination for internet-scale systems. In USENIXATC, 2010. Google ScholarDigital Library
- Hwang, J. H., Balazinska, M., Rasin, A., Cetintemel, U., Stonebraker, M., and Zdonik, S. High-availability algorithms for distributed stream processing. In ICDE, 2005. Google ScholarDigital Library
- Lamport, L. Paxos made simple, fast, and byzantine. In OPODIS, 2002.Google Scholar
- Liu, C., Correa, R., Gill, H., Gill, T., Li, X., Muthukumar, S., Saeed, T., Loo, B. T., and Basu, P. Puma: Policy-based unified multi-radio architecture for agile mesh networking. In COMSNETS, 2012).Google ScholarCross Ref
- Neumeyer, L., Robbins, B., Nair, A., and Kesari, A. S4: Distributed stream computing platform. In ICDM Workshops, 2010. Google ScholarDigital Library
- Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarDigital Library
- Popa, L., Budiu, M., Yu, Y., and Isard, M. DryadInc: Reusing work in large-scale computations. In HotCloud, 2009. Google ScholarDigital Library
- Qian, Z., Chen, X., Kang, N., Chen, M., Yu, Y., Moscibroda, T., and Zhang, Z. MadLINQ: Large-scale distributed matrix computation for the cloud. In EuroSys, 2012. Google ScholarDigital Library
- Shah, M. A., Hellerstein, J. M., and Brewer, E. Highly available, fault-tolerant, parallel dataflows. In SIGMOD, 2004. Google ScholarDigital Library
- Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. Hive: A warehousing solution over a MapReduce framework. In VLDB, 2009. Google ScholarDigital Library
- Xing, Y., Zdonik, S., and Hwang, J. H. Dynamic load distribution in the Borealis stream processor. In ICDE, 2005. Google ScholarDigital Library
- Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P. K., and Currey, J. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008. Google ScholarDigital Library
- Zaharia, M., Das, T., Li, H., Shenker, S., and Stoica, I. Discretized Streams: An efficient and fault-tolerant model for stream processing on large clusters. In HotCloud, 2012. Google ScholarDigital Library
Index Terms
- TimeStream: reliable stream computation in the cloud
Recommendations
Fault-tolerance in the borealis distributed stream processing system
Over the past few years, Stream Processing Engines (SPEs) have emerged as a new class of software systems, enabling low latency processing of streams of data arriving at high rates. As SPEs mature and get used in monitoring applications that must ...
An extensible test framework for the Microsoft StreamInsight query processor
DBTest '10: Proceedings of the Third International Workshop on Testing Database SystemsMicrosoft StreamInsight (StreamInsight, for brevity) is a platform for developing and deploying streaming applications. StreamInsight adopts a deterministic stream model that leverages a temporal algebra as the underlying basis for processing long-...
Conquering big data with spark and BDAS
SIGMETRICS '14: The 2014 ACM international conference on Measurement and modeling of computer systemsToday, big and small organizations alike collect huge amounts of data, and they do so with one goal in mind: extract "value" through sophisticated exploratory analysis, and use it as the basis to make decisions as varied as personalized treatment and ad ...
Comments