ABSTRACT
In recent years, our customers have expressed frustration in the traditional approach of using a combination of disparate products to handle their streaming, transactional and analytical needs. The common practice of stitching heterogeneous environments in custom ways has caused enormous production woes by increasing development complexity and total cost of ownership. With SnappyData, an open source platform, we propose a unified engine for real-time operational analytics, delivering stream analytics, OLTP and OLAP in a single integrated solution. We realize this platform through a seamless integration of Apache Spark (as a big data computational engine) with GemFire (as an in-memory transactional store with scale-out SQL semantics). In this demonstration, after presenting a few use case scenarios, we exhibit SnappyData as our our in-memory solution for delivering truly interactive analytics (i.e., a couple of seconds), when faced with large data volumes or high velocity streams. We show that SnappyData can exploit state-of-the-art approximate query processing techniques and a variety of data synopses. Finally, we allow the audience to define various high-level accuracy contracts (HAC), to communicate their accuracy requirements with SnappyData in an intuitive fashion.
- Apache Samza. http://samza.apache.org/.Google Scholar
- S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, 2013. Google ScholarDigital Library
- M. Armbrust et al. Spark SQL: Relational data processing in Spark. In SIGMOD, 2015. Google ScholarDigital Library
- L. Braun et al. Analytics in motion: High performance event-processing and real-time analytics in the same database. In SIGMOD, 2015. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55, 2005. Google ScholarDigital Library
- M. Kornacker et al. Impala: A modern, open-source sql engine for hadoop. In CIDR, 2015.Google Scholar
- E. Liarou et al. Monetdb/datacell: online analytics in a streaming column-store. PVLDB, 2012. Google ScholarDigital Library
- B. Mozafari and N. Niu. A handbook for building an approximate query engine. IEEE Data Engineering Bulletin, 2015.Google Scholar
- B. Mozafari and C. Zaniolo. Optimal load shedding with aggregates and mining queries. In ICDE, 2010.Google ScholarCross Ref
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009. Google ScholarDigital Library
- A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm@twitter. In SIGMOD, 2014. Google ScholarDigital Library
Index Terms
- SnappyData: A Hybrid Transactional Analytical Store Built On Spark
Recommendations
Hybrid Transactional/Analytical Processing: A Survey
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataThe popularity of large-scale real-time analytics applications (real-time inventory/pricing, recommendations from mobile apps, fraud detection, risk analysis, IoT, etc.) keeps rising. These applications require distributed data management systems that ...
A configurable and executable model of Spark Streaming on Apache YARN
Streams of data are produced today at an unprecedented scale. Efficient and stable processing of these streams requires a careful interplay between the parameters of the streaming application and of the underlying stream processing framework. Today, ...
SnappyData: a hybrid system for transactions, analytics, and streaming: demo
DEBS '16: Proceedings of the 10th ACM International Conference on Distributed and Event-based SystemsAn increasing number of applications rely on workflows that involve (1) continuous stream processing, (2) transactional and write-heavy workloads, and (3) interactive SQL analytics. These applications need to consume high-velocity streams to trigger ...
Comments