Skip to main content
Log in

An analysis of technological frameworks for data streams

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

Real-time data analysis is becoming increasingly important in Big Data environments for addressing data stream issues. To this end, several technological frameworks have been developed, both open-source and proprietary, for the analysis of streaming data. This paper analyzes some open-source technological frameworks available for data streams, detailing their main characteristics. The objective is to facilitate decisions on which framework to use, meeting the needs of data mining methods for data streams. In this sense, there are important factors affecting the choice about which framework is most suitable for this purpose. Some of these factors are the existence of data mining libraries, the available documentation, the maturity of the platform, fault tolerance and processing guarantees, among others. Another decisive factor when choosing a data stream framework is its performance. For this reason, two comparisons have been made: a performance and latency comparison between Spark Streaming, Spark Structured Streaming, Storm, Flink and Samza following the Yahoo Streaming Benchmark methodology, and a comparison between Spark Streaming and Flink with a clustering algorithm for data streaming called streaming K-means.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. https://kafka.apache.org/.

  2. https://flume.apache.org/.

  3. https://nifi.apache.org/.

  4. https://Spark.apache.org/streaming/.

  5. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.

  6. http://storm.apache.org/.

  7. https://samoa.incubator.apache.org.

  8. https://flink.apache.org/.

  9. http://samza.apache.org/.

  10. https://hadoop.apache.org/.

  11. https://apex.apache.org/.

  12. https://beam.apache.org/.

  13. https://github.com/spotify/scio.

References

  1. Han, J., Kamber, M., Pei, J.: Data Mining Concepts and Techniques, 3rd edn. Morgan Kaufmann, Burlington (2011)

    MATH  Google Scholar 

  2. Gama, J.: Knowledge Discovery from Data Streams. Chapman and Hall/CRC, Boca Raton (2010)

    Book  Google Scholar 

  3. Aggarwal, C.: Data Streams: Models and Algorithms, vol. 31. Springer, New York (2007)

    Book  Google Scholar 

  4. Gürcan, F., Berigel, M.: Real-time processing of big data streams: Lifecycle, tools, tasks, and challenges. In: 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), pp. 1-6. IEEE (2018)

  5. Heudecker, N., Schulte, W. R.: Market guide for event stream processing. id:g00332885. Gartner (2018)

  6. Chintapalli, S., Dagit, D., Evans, B., Farivar, R., Graves, T., Holderbaugh, M., Liu, Z., Nusbaum, K., Patil, K., Peng, B. J. et al.: Benchmarking streaming computation engines: storm, flink and spark streaming. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1789-1792. IEEE (2016)

  7. Khamassi, I., Sayed-Mouchaweh, M., Hammami, M., Ghédira, K.: Discussion and review on evolving data streams and concept drift adapting. Evol. Syst. 9(1), 1–23 (2018)

    Article  Google Scholar 

  8. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 44 (2014)

    Article  Google Scholar 

  9. Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Woźniak, M.: Ensemble learning for data stream analysis: A survey. Inf. Fusion 37, 132–156 (2017)

    Article  Google Scholar 

  10. Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Herrera, F.: A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, 39–57 (2017)

    Article  Google Scholar 

  11. Schulte, W. R., Heudecker, N.: Technology insight for event stream processing. id:g00334449. Gartner (2017)

  12. García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F.: A comparison on scalability for batch big data processing on apache spark and apache flink. Big Data Anal. 2(1), 1 (2017)

    Article  Google Scholar 

  13. Samosir, J., Indrawan-Santiago, M., Haghighi, P.D.: An evaluation of data stream processing systems for data driven applications. Proc. Comput. Sci. 80, 439–449 (2016)

    Article  Google Scholar 

  14. Wang, Y.: Stream Processing Systems Benchmark: Streambench (2016)

  15. Lu, R., Wu, G., Xie, B., Hu, J.: Stream bench: Towards benchmarking modern distributed stream computing frameworks. In: 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing, pp. 69-78. IEEE (2014)

  16. Shahverdi, E.: Comparative Evaluation for the Performance of Big Stream Processing Systems. PhD thesis, University of Tartu (2018)

  17. Dasgupta, T.: Evaluation of Two Major Data Stream Processing Technologies. PhD thesis, University of Edinburgh (2016)

  18. Karakaya, Z., Yazici, A., Alayyoub, M.: A Comparison of Stream Processing Frameworks. In: 2017 International Conference on Computer and Applications (ICCA), pp. 1-12. IEEE (2017)

  19. Mohamed, A., Najafabadi, M. K., Wah, Y. B., Zaman, E. A. K., Maskat, R.: The state of the art and taxonomy of big data analytics: view from new big data framework. In: Artificial Intelligence Review, pp. 1-49 (2019)

  20. Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.-A., Caelen, O., Mazzer, Y., Bontempi, G.: Scarff: a scalable framework for streaming credit card fraud detection with spark. Inf. Fusion 41, 182–194 (2018)

    Article  Google Scholar 

  21. Nair, L.R., Shetty, S.D., Shetty, S.D.: Applying spark based machine learning model on streaming big data for health status prediction. Comput. Electr. Eng. 65, 393–399 (2018)

    Article  Google Scholar 

  22. Karunaratne, P., Karunasekera, S., Harwood, A.: Distributed stream clustering using micro-clusters on apache storm. J. Parallel Distrib. Comput. 108, 74–84 (2017)

    Article  Google Scholar 

  23. Karim, M.R., Cochez, M., Beyan, O.D., Ahmed, C.F., Decker, S.: Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach. Inf. Sci. 432, 278–300 (2018)

    Article  MathSciNet  Google Scholar 

  24. Apache Kafka Project: https://kafka.apache.org/documentation/ (2018). Accessed 10 Jan 2019

  25. Apache Flume Project: https://flume.apache.org/documentation.html (2018). Accessed 10 Jan 2019

  26. Apache Nifi Project: https://nifi.apache.org/docs.html (2018). Accessed 10 Jan 2019

  27. Apache Spark Streaming Project: https://spark.apache.org/docs/latest/streaming-programming-guide.html (2018). Accessed 10 Jan 2019

  28. Apache Spark Structured Streaming Project: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (2018). Accessed 10 Jan 2019

  29. Apache Storm Project: http://storm.apache.org/releases/1.2.2/index.html (2018). Accessed 10 Jan 2019

  30. Apache Flink Project: https://ci.apache.org/projects/flink/flink-docs-release-1.7/ (2018). Accessed 10 Jan 2019

  31. Apache Hadoop YARN Project: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html (2018). Accessed 10 Jan 2019

  32. Apache Samza Project: http://samza.apache.org/learn/documentation/1.0.0/core-concepts/core-concepts.html (2018). Accessed 10 Jan 2019

  33. Apache Apex Project: https://apex.apache.org/docs.html (2018). Accessed 10 Jan 2019

  34. Apache Beam Project: https://beam.apache.org/documentation/ (2018). Accessed 10 Jan 2019

  35. MLLIB Project: https://spark.apache.org/docs/latest/mllib-guide.html (2018). Accessed 10 Jan 2019

  36. StreamDM Project: http://huawei-noah.github.io/streamDM (2018). Accessed 10 Jan 2019

  37. Apache SAMOA Project: https://samoa.incubator.apache.org/documentation/Home.html (2018). Accessed 10 Jan 2019

  38. Amidst Toolbox Project: http://www.amidsttoolbox.com/documentation/ (2018). Accessed 10 Jan 2019

  39. Yahoo Streaming Benchmark. https://github.com/yahoo/streaming-benchmarks (2018). Accessed 10 Jan 2019

  40. Yahoo Streaming Benchmark Source: https://github.com/elkhan-shahverdi/streaming-benchmarks (2018). Accessed 10 Jan 2019

  41. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: Massive online analysis. J. Mach. Learn. Res. 11(May), 1601–1604 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fernando Puentes.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work has been supported by the Ministry of Economy and Competitiveness Under the Project TIN2015-68454-R and PID2019-107793GB-I00.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Puentes, F., Pérez-Godoy, M.D., González, P. et al. An analysis of technological frameworks for data streams. Prog Artif Intell 9, 239–261 (2020). https://doi.org/10.1007/s13748-020-00210-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-020-00210-6

Keywords

Navigation