An analysis of technological frameworks for data streams

Puentes, Fernando; Pérez-Godoy, María Dolores; González, Pedro; Del Jesus, María José

doi:10.1007/s13748-020-00210-6

An analysis of technological frameworks for data streams

Regular Paper
Published: 28 June 2020

Volume 9, pages 239–261, (2020)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

Fernando Puentes ORCID: orcid.org/0000-0001-8746-1077¹,
María Dolores Pérez-Godoy¹,
Pedro González¹ &
…
María José Del Jesus¹

451 Accesses
3 Citations
Explore all metrics

Abstract

Real-time data analysis is becoming increasingly important in Big Data environments for addressing data stream issues. To this end, several technological frameworks have been developed, both open-source and proprietary, for the analysis of streaming data. This paper analyzes some open-source technological frameworks available for data streams, detailing their main characteristics. The objective is to facilitate decisions on which framework to use, meeting the needs of data mining methods for data streams. In this sense, there are important factors affecting the choice about which framework is most suitable for this purpose. Some of these factors are the existence of data mining libraries, the available documentation, the maturity of the platform, fault tolerance and processing guarantees, among others. Another decisive factor when choosing a data stream framework is its performance. For this reason, two comparisons have been made: a performance and latency comparison between Spark Streaming, Spark Structured Streaming, Storm, Flink and Samza following the Yahoo Streaming Benchmark methodology, and a comparison between Spark Streaming and Flink with a clustering algorithm for data streaming called streaming K-means.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 9

Big Data Analytics: A Literature Review Paper

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Salvador García, Sergio Ramírez-Gallego, … Francisco Herrera

Notes

References

Han, J., Kamber, M., Pei, J.: Data Mining Concepts and Techniques, 3rd edn. Morgan Kaufmann, Burlington (2011)
MATH Google Scholar
Gama, J.: Knowledge Discovery from Data Streams. Chapman and Hall/CRC, Boca Raton (2010)
Book Google Scholar
Aggarwal, C.: Data Streams: Models and Algorithms, vol. 31. Springer, New York (2007)
Book Google Scholar
Gürcan, F., Berigel, M.: Real-time processing of big data streams: Lifecycle, tools, tasks, and challenges. In: 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), pp. 1-6. IEEE (2018)
Heudecker, N., Schulte, W. R.: Market guide for event stream processing. id:g00332885. Gartner (2018)
Chintapalli, S., Dagit, D., Evans, B., Farivar, R., Graves, T., Holderbaugh, M., Liu, Z., Nusbaum, K., Patil, K., Peng, B. J. et al.: Benchmarking streaming computation engines: storm, flink and spark streaming. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1789-1792. IEEE (2016)
Khamassi, I., Sayed-Mouchaweh, M., Hammami, M., Ghédira, K.: Discussion and review on evolving data streams and concept drift adapting. Evol. Syst. 9(1), 1–23 (2018)
Article Google Scholar
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 44 (2014)
Article Google Scholar
Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Woźniak, M.: Ensemble learning for data stream analysis: A survey. Inf. Fusion 37, 132–156 (2017)
Article Google Scholar
Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Herrera, F.: A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, 39–57 (2017)
Article Google Scholar
Schulte, W. R., Heudecker, N.: Technology insight for event stream processing. id:g00334449. Gartner (2017)
García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F.: A comparison on scalability for batch big data processing on apache spark and apache flink. Big Data Anal. 2(1), 1 (2017)
Article Google Scholar
Samosir, J., Indrawan-Santiago, M., Haghighi, P.D.: An evaluation of data stream processing systems for data driven applications. Proc. Comput. Sci. 80, 439–449 (2016)
Article Google Scholar
Wang, Y.: Stream Processing Systems Benchmark: Streambench (2016)
Lu, R., Wu, G., Xie, B., Hu, J.: Stream bench: Towards benchmarking modern distributed stream computing frameworks. In: 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing, pp. 69-78. IEEE (2014)
Shahverdi, E.: Comparative Evaluation for the Performance of Big Stream Processing Systems. PhD thesis, University of Tartu (2018)
Dasgupta, T.: Evaluation of Two Major Data Stream Processing Technologies. PhD thesis, University of Edinburgh (2016)
Karakaya, Z., Yazici, A., Alayyoub, M.: A Comparison of Stream Processing Frameworks. In: 2017 International Conference on Computer and Applications (ICCA), pp. 1-12. IEEE (2017)
Mohamed, A., Najafabadi, M. K., Wah, Y. B., Zaman, E. A. K., Maskat, R.: The state of the art and taxonomy of big data analytics: view from new big data framework. In: Artificial Intelligence Review, pp. 1-49 (2019)
Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.-A., Caelen, O., Mazzer, Y., Bontempi, G.: Scarff: a scalable framework for streaming credit card fraud detection with spark. Inf. Fusion 41, 182–194 (2018)
Article Google Scholar
Nair, L.R., Shetty, S.D., Shetty, S.D.: Applying spark based machine learning model on streaming big data for health status prediction. Comput. Electr. Eng. 65, 393–399 (2018)
Article Google Scholar
Karunaratne, P., Karunasekera, S., Harwood, A.: Distributed stream clustering using micro-clusters on apache storm. J. Parallel Distrib. Comput. 108, 74–84 (2017)
Article Google Scholar
Karim, M.R., Cochez, M., Beyan, O.D., Ahmed, C.F., Decker, S.: Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach. Inf. Sci. 432, 278–300 (2018)
Article MathSciNet Google Scholar
Apache Kafka Project: https://kafka.apache.org/documentation/ (2018). Accessed 10 Jan 2019
Apache Flume Project: https://flume.apache.org/documentation.html (2018). Accessed 10 Jan 2019
Apache Nifi Project: https://nifi.apache.org/docs.html (2018). Accessed 10 Jan 2019
Apache Spark Streaming Project: https://spark.apache.org/docs/latest/streaming-programming-guide.html (2018). Accessed 10 Jan 2019
Apache Spark Structured Streaming Project: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (2018). Accessed 10 Jan 2019
Apache Storm Project: http://storm.apache.org/releases/1.2.2/index.html (2018). Accessed 10 Jan 2019
Apache Flink Project: https://ci.apache.org/projects/flink/flink-docs-release-1.7/ (2018). Accessed 10 Jan 2019
Apache Hadoop YARN Project: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html (2018). Accessed 10 Jan 2019
Apache Samza Project: http://samza.apache.org/learn/documentation/1.0.0/core-concepts/core-concepts.html (2018). Accessed 10 Jan 2019
Apache Apex Project: https://apex.apache.org/docs.html (2018). Accessed 10 Jan 2019
Apache Beam Project: https://beam.apache.org/documentation/ (2018). Accessed 10 Jan 2019
MLLIB Project: https://spark.apache.org/docs/latest/mllib-guide.html (2018). Accessed 10 Jan 2019
StreamDM Project: http://huawei-noah.github.io/streamDM (2018). Accessed 10 Jan 2019
Apache SAMOA Project: https://samoa.incubator.apache.org/documentation/Home.html (2018). Accessed 10 Jan 2019
Amidst Toolbox Project: http://www.amidsttoolbox.com/documentation/ (2018). Accessed 10 Jan 2019
Yahoo Streaming Benchmark. https://github.com/yahoo/streaming-benchmarks (2018). Accessed 10 Jan 2019
Yahoo Streaming Benchmark Source: https://github.com/elkhan-shahverdi/streaming-benchmarks (2018). Accessed 10 Jan 2019
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: Massive online analysis. J. Mach. Learn. Res. 11(May), 1601–1604 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Universidad de Jaen, Jaén, Spain
Fernando Puentes, María Dolores Pérez-Godoy, Pedro González & María José Del Jesus

Authors

Fernando Puentes
View author publications
You can also search for this author in PubMed Google Scholar
María Dolores Pérez-Godoy
View author publications
You can also search for this author in PubMed Google Scholar
Pedro González
View author publications
You can also search for this author in PubMed Google Scholar
María José Del Jesus
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fernando Puentes.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work has been supported by the Ministry of Economy and Competitiveness Under the Project TIN2015-68454-R and PID2019-107793GB-I00.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Puentes, F., Pérez-Godoy, M.D., González, P. et al. An analysis of technological frameworks for data streams. Prog Artif Intell 9, 239–261 (2020). https://doi.org/10.1007/s13748-020-00210-6

Download citation

Received: 28 May 2019
Accepted: 30 May 2020
Published: 28 June 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s13748-020-00210-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

An analysis of technological frameworks for data streams

Abstract

Access this article

Similar content being viewed by others

Big Data Analytics: A Literature Review Paper

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An analysis of technological frameworks for data streams

Abstract

Access this article

Similar content being viewed by others

Big Data Analytics: A Literature Review Paper

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation