Skip to main content

Scalable Shapeoid Recognition on Multivariate Data Streams with Apache Beam

  • Conference paper
  • First Online:
Intelligent Computing (SAI 2022)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 506))

Included in the following conference series:

  • 747 Accesses

Abstract

Time series representation and discretisation methods are susceptible to scaling over massive data streams. A recent approach for transferring time series data to the realm of symbols under primitives, named shapeoids has emerged in the area of data mining and pattern recognition. A shapeoid will characterise a subset of the time series curve in words from its morphology. Data processing frameworks are typical examples for running operations on top of fast unbounded data, with innate traits to enable other methods which are restricted to bounded data. Apache Beam is emerging with a unified programming model for streaming applications able to uniquely translate and run on multiple execution engines, saving development time to focus on other design decisions. We develop an application on Apache Beam which transfers the concept of shapeoids to a scenario in large-scale network flow monitoring infrastructure and evaluate it over two stream computing engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    An event is anything that happens or is contemplated as happening [33].

  2. 2.

    The standalone version should be equal to one worker VM in the clustered version to observe also the speedup.

  3. 3.

    https://moodle.com/.

  4. 4.

    https://opencast.org/.

  5. 5.

    https://github.com/thantsi/beam-scotty-for-netflows.

  6. 6.

    A sensitivity analysis may be beneficial in this step.

  7. 7.

    http://rocksdb.org/.

References

  1. Akidau, T., et al.: MillWheel: fault-tolerant stream processing at internet scale. Proc. VLDB Endow. 6(11), 1033–1044 (2013)

    Article  Google Scholar 

  2. Akidau, T., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 1792–1803 (2015)

    Article  Google Scholar 

  3. Apache: Apache Beam Capability Matrix. https://beam.apache.org/documentation/runners/capability-matrix/. Accessed Oct 2021

  4. Apache: Apache Flink - Stateful Computations over Data Streams. https://flink.apache.org/. Accessed Oct 2021

  5. Apache: Apache Kafka. https://kafka.apache.org/. Accessed Oct 2021

  6. Apache: Apache Nemo. https://nemo.apache.org/. Accessed Oct 2021

  7. Apache: Apache Samza. http://samza.apache.org/. Accessed Oct 2021

  8. Apache: Apache Spark - Unified Analytics Engine for Big Data. https://spark.apache.org/. Accessed Oct 2021

  9. Botan, I., Derakhshan, R., Dindar, N., Haas, L., Miller, R.J., Tatbul, N.: SECRET: a model for analysis of the execution semantics of stream processing systems. Proc. VLDB Endow. 3(1–2), 232–243 (2010)

    Article  Google Scholar 

  10. Bulkowski, T.N.: Encyclopedia of Chart Patterns. Wiley, Hoboken (2021)

    Google Scholar 

  11. Buono, P., Aris, A., Plaisant, C., Khella, A., Shneiderman, B.: Interactive pattern search in time series. In: Visualization and Data Analysis 2005, vol. 5669, pp. 175–186. International Society for Optics and Photonics (2005)

    Google Scholar 

  12. bwNet Consortium: bwNet100G+ - Research and innovative services for a flexible 100G-network in Baden-Wuerttemberg. https://bwnet100g.de/. Accessed Oct 2021

  13. Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: indexing and mining one billion time series. In: 2010 IEEE International Conference on Data Mining, pp. 58–67. IEEE (2010)

    Google Scholar 

  14. Castro, N., Azevedo, P.: Multiresolution motif discovery in time series. In: Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 665–676. SIAM (2010)

    Google Scholar 

  15. Claise, B.: Cisco systems netflow services export version 9. RFC 3954, pp. 1–33 (2004)

    Google Scholar 

  16. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  17. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Querying and mining of time series data: experimental comparison of representations and distance measures. Proc. VLDB Endow. 1(2), 1542–1552 (2008)

    Article  Google Scholar 

  18. Esbensen, K.H., Guyot, D., Westad, F., Houmoller, L.P.: Multivariate Data Analysis: In Practice: An Introduction to Multivariate Data Analysis and Experimental Design (2002)

    Google Scholar 

  19. Esling, P., Agon, C.: Time-series data mining. ACM Comput. Surv. (CSUR) 45(1), 1–34 (2012)

    Article  Google Scholar 

  20. Tak-chung, F.: A review on time series data mining. Eng. Appl. Artif. Intell. 24(1), 164–181 (2011)

    Article  Google Scholar 

  21. Gama, J., Rodrigues, P.P.: Data stream processing. In: Gama, J., Gaber, M.M. (eds.) Learning from Data Streams, pp. 25–39. Springer, Heidelberg (2007). https://doi.org/10.1007/3-540-73679-4_3

    Chapter  Google Scholar 

  22. Ge, X.: Pattern matching in financial time series data. Final project report for ICS 278 (1998)

    Google Scholar 

  23. Google: Google Dataflow. https://cloud.google.com/dataflow/. Accessed Oct 2021

  24. Hazelcast: Hazelcast Jet - The ultra-fast stream and batch processing framework. https://hazelcast.com/products/stream-processing/. Accessed Oct 2021

  25. Johnson, T., Muthukrishnan, S., Shkapenyuk, V., Spatscheck, O.: A heartbeat mechanism and its application in gigascope. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 1079–1088 (2005)

    Google Scholar 

  26. Kasetty, S., Stafford, C., Walker, G.P., Wang, X., Keogh, E.: Real-time classification of streaming sensor data. In: 2008 20th IEEE International Conference on Tools with Artificial Intelligence, vol. 1, pp. 149–156. IEEE (2008)

    Google Scholar 

  27. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001). https://doi.org/10.1007/PL00011669

    Article  MATH  Google Scholar 

  28. Le Nguyen, T., Gsponer, S., Ilie, I., O’Reilly, M., Ifrim, G.: Interpretable time series classification using linear models and multi-resolution multi-domain symbolic representations. Data Min. Knowl. Discov. 33(4), 1183–1222 (2019). https://doi.org/10.1007/s10618-019-00633-3

    Article  MathSciNet  MATH  Google Scholar 

  29. Li, J., Tufte, K., Shkapenyuk, V., Papadimos, V., Johnson, T., Maier, D.: Out-of-order processing: a new architecture for high-performance stream systems. Proc. VLDB Endow. 1(1), 274–288 (2008)

    Article  Google Scholar 

  30. Li, S., Gerver, P., MacMillan, J., Debrunner, D., Marshall, W., Kun-Lung, W.: Challenges and experiences in building an efficient apache beam runner for IBM streams. Proc. VLDB Endow. 11(12), 1742–1754 (2018)

    Article  Google Scholar 

  31. Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2), 107–144 (2007). https://doi.org/10.1007/s10618-007-0064-z

    Article  MathSciNet  Google Scholar 

  32. Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 39(2), 287–315 (2012). https://doi.org/10.1007/s10844-012-0196-5

    Article  Google Scholar 

  33. Luckham, D., Schulte, W.R.: Glossary of terminology: the event processing technical society: (EPTS) glossary of terms-version 2.0. In: Event Processing for Business, pp. 237–258. Wiley, Hoboken (2012)

    Google Scholar 

  34. Miller, C., Nagy, Z., Schlueter, A.: Automated daily pattern filtering of measured building performance data. Autom. Constr. 49, 1–17 (2015)

    Article  Google Scholar 

  35. Mohebbi, M., Vanderkam, D., Kodysh, J., Schonberger, R., Choi, H., Kumar, S.: Google correlate whitepaper. Technical report, Google (2011)

    Google Scholar 

  36. Nägele, D., Hauser, C.B., Bradatsch, L., Wesner, S.: bwNetFlow: a customizable multi-tenant flow processing platform for transit providers. In: 2019 IEEE/ACM Innovating the Network for Data-Intensive Science (INDIS), pp. 9–16. IEEE (2019)

    Google Scholar 

  37. Ruta, N., Sawada, N., McKeough, K., Behrisch, M., Beyer, J.: SAX navigator: time series exploration through hierarchical clustering. In: 2019 IEEE Visualization Conference (VIS), pp. 236–240. IEEE (2019)

    Google Scholar 

  38. Schäfer, P.: The BOSS is concerned with time series classification in the presence of noise. Data Min. Knowl. Discov. 29(6), 1505–1530 (2015). https://doi.org/10.1007/s10618-014-0377-7

    Article  MathSciNet  MATH  Google Scholar 

  39. Senin, P., et al.: GrammarViz 3.0: interactive discovery of variable-length time series patterns. ACM Trans. Knowl. Discov. Data (TKDD) 12(1), 1–28 (2018)

    Article  Google Scholar 

  40. Siddiqui, T., Kim, A., Lee, J., Karahalios, K., Parameswaran, A.: Effortless data exploration with zenvisage: an expressive and interactive visual analytics system. arXiv preprint arXiv:1604.03583 (2016)

  41. Siddiqui, T., Luh, P., Wang, Z., Karahalios, K., Parameswaran, A.: ShapeSearch: a flexible and efficient system for shape-based exploration of trendlines. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 51–65 (2020)

    Google Scholar 

  42. Traub, J., et al.: Efficient window aggregation with general stream slicing. In: EDBT, pp. 97–108 (2019)

    Google Scholar 

  43. Tsitsipas, A., Schiessle, P., Schubert, L.: Scotty: fast a priori structure-based extraction from time series. In: 2021 IEEE International Conference on Big Data (IEEE Big Data 2021). IEEE Computer Society (2021)

    Google Scholar 

  44. Tsitsipas, A., Schubert, L.: Modelling and reasoning for indirect sensing over discrete-time via Markov logic networks. In: Cassens, J., Wegener, R., Kofod-Petersen, A. (eds.) Proceedings of the Twelfth International Workshop Modelling and Reasoning in Context (MRC 2021), vol. 2995, pp. 9–18. CEUR-WS.org (2021)

    Google Scholar 

  45. Tsitsipas, A., Schubert, L.: On group theory and interpretable time series primitives. In: Li, B., et al. (eds.) ADMA 2022. LNCS (LNAI), vol. 13088, pp. 263–275. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-95408-6_20

    Chapter  Google Scholar 

  46. Tucker, P.A., Maier, D., Sheard, T., Fegaras, L.: Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng. 15(3), 555–568 (2003)

    Article  Google Scholar 

  47. Twister2 - High performance Data Analytics. Indiana University. https://twister2.org/. Accessed Oct 2021

  48. VMWare: RabbitMQ - messaging that just works. https://www.rabbitmq.com/. Accessed Oct 2021

  49. Ye, L., Keogh, E.: Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 947–956 (2009)

    Google Scholar 

  50. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438 (2013)

    Google Scholar 

  51. Zhu, S., Fiameni, G., Simonini, G., Bergamaschi, S.: SOPJ: a scalable online provenance join for data integration. In: 2017 International Conference on High Performance Computing & Simulation (HPCS), pp. 79–85. IEEE (2017)

    Google Scholar 

Download references

Acknowledgment

The research leading to these results has received partial funding from Germany’s Federal Ministry of Education and Research (BMBF) under HorME (01IS18072) and the federal state of Baden-Württemberg (Germany), under the Project bwNet2020+.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Athanasios Tsitsipas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tsitsipas, A., Eisenhart, G., Seybold, D., Wesner, S. (2022). Scalable Shapeoid Recognition on Multivariate Data Streams with Apache Beam. In: Arai, K. (eds) Intelligent Computing. SAI 2022. Lecture Notes in Networks and Systems, vol 506. Springer, Cham. https://doi.org/10.1007/978-3-031-10461-9_48

Download citation

Publish with us

Policies and ethics