Abstract
Time series representation and discretisation methods are susceptible to scaling over massive data streams. A recent approach for transferring time series data to the realm of symbols under primitives, named shapeoids has emerged in the area of data mining and pattern recognition. A shapeoid will characterise a subset of the time series curve in words from its morphology. Data processing frameworks are typical examples for running operations on top of fast unbounded data, with innate traits to enable other methods which are restricted to bounded data. Apache Beam is emerging with a unified programming model for streaming applications able to uniquely translate and run on multiple execution engines, saving development time to focus on other design decisions. We develop an application on Apache Beam which transfers the concept of shapeoids to a scenario in large-scale network flow monitoring infrastructure and evaluate it over two stream computing engines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
An event is anything that happens or is contemplated as happening [33].
- 2.
The standalone version should be equal to one worker VM in the clustered version to observe also the speedup.
- 3.
- 4.
- 5.
- 6.
A sensitivity analysis may be beneficial in this step.
- 7.
References
Akidau, T., et al.: MillWheel: fault-tolerant stream processing at internet scale. Proc. VLDB Endow. 6(11), 1033–1044 (2013)
Akidau, T., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 1792–1803 (2015)
Apache: Apache Beam Capability Matrix. https://beam.apache.org/documentation/runners/capability-matrix/. Accessed Oct 2021
Apache: Apache Flink - Stateful Computations over Data Streams. https://flink.apache.org/. Accessed Oct 2021
Apache: Apache Kafka. https://kafka.apache.org/. Accessed Oct 2021
Apache: Apache Nemo. https://nemo.apache.org/. Accessed Oct 2021
Apache: Apache Samza. http://samza.apache.org/. Accessed Oct 2021
Apache: Apache Spark - Unified Analytics Engine for Big Data. https://spark.apache.org/. Accessed Oct 2021
Botan, I., Derakhshan, R., Dindar, N., Haas, L., Miller, R.J., Tatbul, N.: SECRET: a model for analysis of the execution semantics of stream processing systems. Proc. VLDB Endow. 3(1–2), 232–243 (2010)
Bulkowski, T.N.: Encyclopedia of Chart Patterns. Wiley, Hoboken (2021)
Buono, P., Aris, A., Plaisant, C., Khella, A., Shneiderman, B.: Interactive pattern search in time series. In: Visualization and Data Analysis 2005, vol. 5669, pp. 175–186. International Society for Optics and Photonics (2005)
bwNet Consortium: bwNet100G+ - Research and innovative services for a flexible 100G-network in Baden-Wuerttemberg. https://bwnet100g.de/. Accessed Oct 2021
Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: indexing and mining one billion time series. In: 2010 IEEE International Conference on Data Mining, pp. 58–67. IEEE (2010)
Castro, N., Azevedo, P.: Multiresolution motif discovery in time series. In: Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 665–676. SIAM (2010)
Claise, B.: Cisco systems netflow services export version 9. RFC 3954, pp. 1–33 (2004)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Querying and mining of time series data: experimental comparison of representations and distance measures. Proc. VLDB Endow. 1(2), 1542–1552 (2008)
Esbensen, K.H., Guyot, D., Westad, F., Houmoller, L.P.: Multivariate Data Analysis: In Practice: An Introduction to Multivariate Data Analysis and Experimental Design (2002)
Esling, P., Agon, C.: Time-series data mining. ACM Comput. Surv. (CSUR) 45(1), 1–34 (2012)
Tak-chung, F.: A review on time series data mining. Eng. Appl. Artif. Intell. 24(1), 164–181 (2011)
Gama, J., Rodrigues, P.P.: Data stream processing. In: Gama, J., Gaber, M.M. (eds.) Learning from Data Streams, pp. 25–39. Springer, Heidelberg (2007). https://doi.org/10.1007/3-540-73679-4_3
Ge, X.: Pattern matching in financial time series data. Final project report for ICS 278 (1998)
Google: Google Dataflow. https://cloud.google.com/dataflow/. Accessed Oct 2021
Hazelcast: Hazelcast Jet - The ultra-fast stream and batch processing framework. https://hazelcast.com/products/stream-processing/. Accessed Oct 2021
Johnson, T., Muthukrishnan, S., Shkapenyuk, V., Spatscheck, O.: A heartbeat mechanism and its application in gigascope. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 1079–1088 (2005)
Kasetty, S., Stafford, C., Walker, G.P., Wang, X., Keogh, E.: Real-time classification of streaming sensor data. In: 2008 20th IEEE International Conference on Tools with Artificial Intelligence, vol. 1, pp. 149–156. IEEE (2008)
Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001). https://doi.org/10.1007/PL00011669
Le Nguyen, T., Gsponer, S., Ilie, I., O’Reilly, M., Ifrim, G.: Interpretable time series classification using linear models and multi-resolution multi-domain symbolic representations. Data Min. Knowl. Discov. 33(4), 1183–1222 (2019). https://doi.org/10.1007/s10618-019-00633-3
Li, J., Tufte, K., Shkapenyuk, V., Papadimos, V., Johnson, T., Maier, D.: Out-of-order processing: a new architecture for high-performance stream systems. Proc. VLDB Endow. 1(1), 274–288 (2008)
Li, S., Gerver, P., MacMillan, J., Debrunner, D., Marshall, W., Kun-Lung, W.: Challenges and experiences in building an efficient apache beam runner for IBM streams. Proc. VLDB Endow. 11(12), 1742–1754 (2018)
Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2), 107–144 (2007). https://doi.org/10.1007/s10618-007-0064-z
Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 39(2), 287–315 (2012). https://doi.org/10.1007/s10844-012-0196-5
Luckham, D., Schulte, W.R.: Glossary of terminology: the event processing technical society: (EPTS) glossary of terms-version 2.0. In: Event Processing for Business, pp. 237–258. Wiley, Hoboken (2012)
Miller, C., Nagy, Z., Schlueter, A.: Automated daily pattern filtering of measured building performance data. Autom. Constr. 49, 1–17 (2015)
Mohebbi, M., Vanderkam, D., Kodysh, J., Schonberger, R., Choi, H., Kumar, S.: Google correlate whitepaper. Technical report, Google (2011)
Nägele, D., Hauser, C.B., Bradatsch, L., Wesner, S.: bwNetFlow: a customizable multi-tenant flow processing platform for transit providers. In: 2019 IEEE/ACM Innovating the Network for Data-Intensive Science (INDIS), pp. 9–16. IEEE (2019)
Ruta, N., Sawada, N., McKeough, K., Behrisch, M., Beyer, J.: SAX navigator: time series exploration through hierarchical clustering. In: 2019 IEEE Visualization Conference (VIS), pp. 236–240. IEEE (2019)
Schäfer, P.: The BOSS is concerned with time series classification in the presence of noise. Data Min. Knowl. Discov. 29(6), 1505–1530 (2015). https://doi.org/10.1007/s10618-014-0377-7
Senin, P., et al.: GrammarViz 3.0: interactive discovery of variable-length time series patterns. ACM Trans. Knowl. Discov. Data (TKDD) 12(1), 1–28 (2018)
Siddiqui, T., Kim, A., Lee, J., Karahalios, K., Parameswaran, A.: Effortless data exploration with zenvisage: an expressive and interactive visual analytics system. arXiv preprint arXiv:1604.03583 (2016)
Siddiqui, T., Luh, P., Wang, Z., Karahalios, K., Parameswaran, A.: ShapeSearch: a flexible and efficient system for shape-based exploration of trendlines. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 51–65 (2020)
Traub, J., et al.: Efficient window aggregation with general stream slicing. In: EDBT, pp. 97–108 (2019)
Tsitsipas, A., Schiessle, P., Schubert, L.: Scotty: fast a priori structure-based extraction from time series. In: 2021 IEEE International Conference on Big Data (IEEE Big Data 2021). IEEE Computer Society (2021)
Tsitsipas, A., Schubert, L.: Modelling and reasoning for indirect sensing over discrete-time via Markov logic networks. In: Cassens, J., Wegener, R., Kofod-Petersen, A. (eds.) Proceedings of the Twelfth International Workshop Modelling and Reasoning in Context (MRC 2021), vol. 2995, pp. 9–18. CEUR-WS.org (2021)
Tsitsipas, A., Schubert, L.: On group theory and interpretable time series primitives. In: Li, B., et al. (eds.) ADMA 2022. LNCS (LNAI), vol. 13088, pp. 263–275. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-95408-6_20
Tucker, P.A., Maier, D., Sheard, T., Fegaras, L.: Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng. 15(3), 555–568 (2003)
Twister2 - High performance Data Analytics. Indiana University. https://twister2.org/. Accessed Oct 2021
VMWare: RabbitMQ - messaging that just works. https://www.rabbitmq.com/. Accessed Oct 2021
Ye, L., Keogh, E.: Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 947–956 (2009)
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438 (2013)
Zhu, S., Fiameni, G., Simonini, G., Bergamaschi, S.: SOPJ: a scalable online provenance join for data integration. In: 2017 International Conference on High Performance Computing & Simulation (HPCS), pp. 79–85. IEEE (2017)
Acknowledgment
The research leading to these results has received partial funding from Germany’s Federal Ministry of Education and Research (BMBF) under HorME (01IS18072) and the federal state of Baden-Württemberg (Germany), under the Project bwNet2020+.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tsitsipas, A., Eisenhart, G., Seybold, D., Wesner, S. (2022). Scalable Shapeoid Recognition on Multivariate Data Streams with Apache Beam. In: Arai, K. (eds) Intelligent Computing. SAI 2022. Lecture Notes in Networks and Systems, vol 506. Springer, Cham. https://doi.org/10.1007/978-3-031-10461-9_48
Download citation
DOI: https://doi.org/10.1007/978-3-031-10461-9_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10460-2
Online ISBN: 978-3-031-10461-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)