skip to main content
10.1145/3448016.3457245acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Index-Accelerated Pattern Matching in Event Stores

Published: 18 June 2021 Publication History

Abstract

IoT applications require a new type of database systems termed event stores for ingesting fast arriving event streams and efficiently supporting analytical ad-hoc queries over time. One of the most important operations in this regard is sequential pattern matching also known as Match\_Recognize, which matches user defined predicates to subsequences of events. While Match\_Recognize is well known in the field of event processing, it has only recently become part of the SQL standard. Despite of that, Match\_Recognize has received little attention in the database area so far. We present a novel approach to speed up an important class of Match\_Recognize queries on event stores by utilizing off-the-shelf secondary indexes on non-temporal attributes (e.g., B$^+$-trees, LSM-trees) and a cost model for selecting the most appropriate indexes. Our approach keeps temporal and sequential information in secondary indexes to prune large parts of the stream from further processing. However, simply using as many secondary indexes as available is not the right choice because the access cost for the index scans can exceed the processing time of the naï ve approach that scans the entire stream and replays it into an event processing system. In order to address this problem, we present a first cost model to estimate the total execution cost of a Match\_Recognize query for a set of available indexes. Based on this cost model, we devise an efficient index selection strategy that avoids a full enumeration of index configurations. Prototypical implementations of our approach are available in our open-source research prototype, a commercial database system, and Apache Flink. In experiments with synthetic and real-world data sets, all our index-based implementations clearly outperform the naï ve replay strategy that is currently offered in commercial database systems and Flink.

Supplementary Material

MP4 File (3448016.3457245.mp4)
High volume event streams are ubiquitous in today's data processing landscape. One of the most important operations in online stream processing is sequential pattern matching, which matches user defined predicates to subsequences of events. However, these systems are designed neither for persistent storage of very large streams nor for processing pattern queries on those persistent streams. Therefore, the common execution strategy for pattern queries on persistent streams is to replay the entire stream into an online stream processing engine. Besides the overhead of transferring large event streams between two systems, this solution does not use the powerful processing capabilities of state-of-the-art database systems.In this paper, we present a novel approach to speed up pattern matching queries on large persistent streams by utilizing both off-the-shelf secondary indexes on non-temporal attributes (e.g., B+-trees, LSM trees) and a cost model for selecting the indexes for efficient query processing. One core idea is to keep temporal and sequential information in secondary indexes to prune large parts of the stream from further processing. However, when simply using as many secondary indexes as available, the additional index access cost can easily lead to an overall worse runtime than a replay. Thus, we present a first cost model to estimate the total execution cost of a pattern query for a set of available indexes. Based on this cost model, we devise an efficient index selection strategy that avoids a full enumeration of index configurations.In experiments with synthetic and real-world data sets, we show how our approach greatly improves response times for pattern queries on persistent streams in comparison to replay and another recently proposed strategy. Furthermore, we demonstrate the accuracy of our cost model and the practical importance for a cost-based index selection strategy.

References

[1]
2016. ISO/IEC TR 19075--5:2016, Information technology - Database languages - SQL Technical Reports - Part 5: RowPattern Recognition in SQL. Retrieved March 13, 2019 from http://standards.iso.org/ittf/PubliclyAvailableStandards/ http:// standards.iso.org/ittf/PubliclyAvailableStandards/, accessed March 13, 2019.
[2]
2019. Esper CEP. Retrieved October 28, 2019 from http://www.espertech.com/ esper
[3]
2021. FlinkCEP - Complex event processing for Flink. Retrieved February 12, 2021 from https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/libs/cep. html#detecting-patterns
[4]
2021. Pattern Matching (MATCH_RECOGNIZE) in Oracle Database 12c Release 1 (12.1). Retrieved February 11, 2021 from https://oracle-base.com/articles/12c/ pattern-matching-in-oracle-database-12cr1
[5]
Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. 2008. Efficient pattern matching over event streams. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD '08. 147--160. https://doi.org/10.1145/1376616.1376634
[6]
Ricardo A. Baeza-Yates and Gaston H. Gonnet. 1996. Fast Text Searching for Regular Expressions or Automaton Searching on Tries. J. ACM 43, 6 (1996), 915--936. https://doi.org/10.1145/235809.235810
[7]
Cagri Balkesen, Nihal Dindar, Matthias Wetter, and Nesime Tatbul. 2013. RIP: Run-based intra-query parallelism for scalable complex event processing. In Proceedings of the 2013 ACM International Conference on Distributed Event-Based Systems - DEBS '13. 3--14. https://doi.org/10.1145/2488222.2488257
[8]
Rudolf Bayer and Edward M. McCreight. 1970. Organization and Maintenance of Large Ordered Indexes. In Record of the 1970 ACM SIGFIDET Workshop on Data Description and Access, November 15--16, 1970, Rice University, Houston, Texas, USA (Second Edition with an Appendix). 107--141.
[9]
U. Narayan Bhat. 2008. An Introduction to Queueing Theory. Birkhäuser Boston, Boston. 13--17 pages. https://doi.org/10.1007/978-0--8176--4725--4 arXiv:arXiv:1011.1669v3
[10]
G. Bolch, S. Greiner, H. de Meer, and K.S. Trivedi. 2006. Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications. Wiley. 24--25 pages. https://books.google.de/books?id= ZsXYlgEACAAJ
[11]
S. Borzsony, D. Kossmann, and K. Stocker. 2001. The Skyline operator. Proceedings 17th International Conference on Data Engineering (2001), 1--20. https://doi.org/ 10.1109/ICDE.2001.914855
[12]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink?: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38, 4 (2015), 28--38. http://sites.computer. org/debull/A15dec/p28.pdf
[13]
Alfonso F Cárdenas. 1975. Analysis and performance of inverted data base structures. Commun. ACM 18, 5 (1975), 253--263.
[14]
Chee Yong Chan, Minos N. Garofalakis, and Rajeev Rastogi. 2003. RE-tree: an efficient index structure for regular expressions. VLDB J. 12, 2 (2003), 102--119. https://doi.org/10.1007/s00778-003-0094-0
[15]
Junghoo Cho and Sridhar Rajagopalan. 2002. A Fast Regular Expression Indexing Engine. In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002. 419--430. https://doi.org/10.1109/ ICDE.2002.994755
[16]
Gianpaolo Cugola and Alessandro Margara. 2012. Low latency complex event processing on parallel hardware. J. Parallel and Distrib. Comput. 72, 2 (2012), 205--218. https://doi.org/10.1016/j.jpdc.2011.11.002
[17]
Alan Demers, Johannes Gehrke, Mingsheng Hong, Biswanath Panda, Mirek Riedewald, Varun Sharma, and Walker White. 2007. Cayuga: A General Purpose Event Monitoring System. In Proceedings of the 2007 Biennial Conference on Innovative Data Systems Research - CIDR '07. 412--422. https://doi.org/10.1145/ 1247480.1247620
[18]
Yanlei Diao, Neil Immerman, and Daniel Gyllstrom. 2007. Sase+: An agile language for kleene closure over event streams. Technical Report. University of Massachusetts.
[19]
Nihal Dindar, Peter M. Fischer, Merve Soner, and Nesime Tatbul. 2011. Efficiently correlating complex events over live and archived data streams. In Proceedings of the 2011 ACM international conference on Distributed event-based system - DEBS '11. 243--254. https://doi.org/10.1145/2002259.2002293
[20]
Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity Estimation for Range Predicates Using Lightweight Models. Proc. VLDB Endow. 12, 9 (May 2019), 1044--1057. https: //doi.org/10.14778/3329772.3329780
[21]
R Elmasri and G Wuu. 1990. The time index: An access structure for temporal data. In Proceedings of the 1990 International Conference on Very Large Data Bases - VLDB '90. 1--12.
[22]
William Feller. 1971. An introduction to probability theory and its applications. Wiley.
[23]
Dengfeng Gao, Christian S Jensen, Richard T Snodgrass, and Michael D Soo. 2005. Join operations in temporal databases. The VLDB Journal 14, 1 (2005), 2--29. https://doi.org/10.1007/s00778-003-0111--3
[24]
Christian Garcia-Arellano, Adam J. Storm, David Kalmuk, Hamdi Roumani, Ronald Barber, Yuanyuan Tian, Richard Sidle, Fatma Özcan, Matt Spilchen, Josh Tiefenbach, Daniel C. Zilio, Lan Pham, Kostas Rakopoulos, Alexander Cheung, Darren Pepper, Imran Sayyid, Gidon Gershinsky, Gal Lushi, and Hamid Pirahesh. 2020. Db2 Event Store: A Purpose-Built IoT Database Engine. Proc. VLDB Endow. 13, 12 (2020), 3299--3312. http://www.vldb.org/pvldb/vol13/p3299-garciaarellano. pdf
[25]
Nikos Giatrakos, Elias Alevizos, Alexander Artikis, Antonios Deligiannakis, and Minos N. Garofalakis. 2020. Complex event recognition in the Big Data era: a survey. VLDB J. 29, 1 (2020), 313--352. https://doi.org/10.1007/s00778-019-00557- w
[26]
Martin Hirzel. 2012. Partition and Compose: Parallel Complex Event Processing. In Proceedings of the 2012 ACM International Conference on Distributed and Eventbased Systems - DEBS '12. 191--200. https://doi.org/10.1145/2335484.2335506
[27]
Leila Kaghazian, Dennis McLeod, and Reza Sadri. 2008. Scalable complex pattern search in sequential data. In Proceedings of the 2008 ACM Conference on Information and Knowledge Management - CIKM '08. 1467--1468. https: //doi.org/10.1145/1458082.1458336
[28]
Ramakrishnan Kandhan, Nikhil Teletia, and Jignesh M. Patel. 2010. SigMatch: Fast and Scalable Multi-Pattern Matching. Proc. VLDB Endow. 3, 1 (2010), 1173--1184. https://doi.org/10.14778/1920841.1920987
[29]
Martin Kaufmann, Amin A Manjili, Panagiotis Vagenas, Peter M Fischer, Donald Kossmann, Franz Färber, and Norman May. 2013. Timeline index: A unified data structure for processing queries on temporal data in SAP HANA. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data - SIGMOD '13. 1173--1184. http://dl.acm.org/citation.cfm?id=2465293
[30]
Michael S. Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?. In Proceedings of the 2017 ACM International Conference on Management of Data - SIGMOD '17. 715--730. https://doi.org/10.1145/3035918.3064049
[31]
Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. 1977. Fast Pattern Matching in Strings. SIAM J. Comput. 6, 2 (1977), 323--350. https://doi.org/10. 1137/0206024
[32]
Ilya Kolchinsky and Assaf Schuster. 2018. Join Query Optimization Techniques for Complex Event Processing Applications. Proc. VLDB Endow. 11, 11 (2018), 1332--1345. https://doi.org/10.14778/3236187.3236189
[33]
Ilya Kolchinsky and Assaf Schuster. 2019. Real-Time Multi-Pattern Detection over Event Streams. In Proceedings of the 2019 ACM SIGMOD international conference on Management of data - SIGMOD '19. 589--606. https://doi.org/10.1145/3299869. 3319869
[34]
Ilya Kolchinsky, Izchak Sharfman, and Assaf Schuster. 2015. Lazy evaluation methods for detecting complex events. In Proceedings of the 2015 ACM International Conference on Distributed Event-Based Systems - DEBS '15. 34--45. https://doi.org/10.1145/2675743.2771832
[35]
H. T. Kung, Fabrizio Luccio, and Franco P. Preparata. 1975. On Finding the Maxima of a Set of Vectors. J. ACM 22, 4 (1975), 469--476. https://doi.org/10. 1145/321906.321910
[36]
Udi Manber and Eugene W. Myers. 1993. Suffix Arrays: A New Method for On-Line String Searches. SIAM J. Comput. 22, 5 (1993), 935--948. https://doi.org/ 10.1137/0222058
[37]
Michael V. Mannino, Paicheng Chu, and Thomas Sager. 1988. Statistical profile estimation in database systems. Comput. Surveys 20, 3 (1988), 191--221. https: //doi.org/10.1145/62061.62063
[38]
Edward M. McCreight. 1976. A Space-Economical Suffix Tree Construction Algorithm. J. ACM 23, 2 (1976), 262--272. https://doi.org/10.1145/321941.321946
[39]
Yuan Mei and Samuel Madden. 2009. ZStream : A Cost-based Query Processor for Adaptively Detecting Composite Events Categories and Subject Descriptors. In Proceedings of the 2009 SIGMOD international conference on Management of data - SIGMOD '09. 193--206. https://doi.org/10.1145/1559845.1559867
[40]
Peter Muth, Patrick E O'Neil, Achim Pick, and Gerhard Weikum. 1998. Design, Implementation, and Performance of the LHAM Log-Structured History Data Access Method. In Proceedings of the 24rd International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., 452--463.
[41]
Gonzalo Navarro. 2014. Wavelet trees for all. J. Discrete Algorithms 25 (2014), 2--20. https://doi.org/10.1016/j.jda.2013.07.004
[42]
Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (jun 1996), 351--385. https://doi.org/10.1007/s002360050048
[43]
R. Ramakrsihnan, D. Donjerkovic, A. Ranganathan, K.S. Beyer, and M. Krishnaprasad. 1998. SRQL: Sorted Relational Query Language. In Proceedings of 1998 International Conference on Scientific and Statistical Database Management - SSDBM'98. 84--95. https://doi.org/10.1109/ssdm.1998.688114
[44]
Medhabi Ray, Chuan Lei, and Elke A. Rundensteiner. 2016. Scalable Pattern Sharing on Event Streams. In Proceedings of the 2016 ACM SIGMOD international conference on Management of data - SIGMOD '16. 495--510. https://doi.org/10. 1145/2882903.2882947
[45]
M. A. Rosenman and J. S. Gero. 1983. Pareto Optimal Serial Dynamic Programming. Engineering Optimization 6, 4 (1983), 177--183. https://doi.org/10.1080/ 03052158308902467
[46]
Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. 2001. Optimization of sequence queries in database systems. In Proceedings of the 2001 ACM SIGMODSIGACT- SIGART Symposium on Principles of Database Systems - PODS '01. 71--81. https://doi.org/10.1145/375551.375563
[47]
Matthias Schäfer, Martin Strohmeier, Vincent Lenders, Ivan Martinovic, and Matthias Wilhelm. 2014. Bringing up OpenSky: a large-scale ADS-B sensor network for research. In IPSN'14. 83--94.
[48]
Nicholas Poul Schultz-Møller, Matteo Migliavacca, and Peter Pietzuch. 2009. Distributed complex event processing with query rewriting. In Proceedings of the 2009 ACM International Conference on Distributed EventBased Systems - DEBS '09. 4:1--4:12. https://doi.org/10.1145/1619258.1619264
[49]
Marc Seidemann, Nikolaus Glombiewski, Michael Körber, and Bernhard Seeger. 2019. ChronicleDB: A High-Performance Event Store. ACM Trans. Database Syst. 44, 4, Article 13 (Oct. 2019), 45 pages. https://doi.org/10.1145/3342357
[50]
P Griffiths Selinger, M M Astrahan, R A Lorie, and T G Price. 1979. Access Path Selection in a Relational Database Management System. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data - SIGMOD '79. 23--34.
[51]
Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. 1995. SEQ: A model for sequence databases. In Proceedings of the 1995 International Conference on Data Engineering - ICDE '95. 232--239. https://doi.org/10.1109/icde.1995.380388
[52]
Dominic Tsang and Sanjay Chawla. 2011. A robust index for regular expression queries. In Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24--28, 2011. 2365-- 2368. https://doi.org/10.1145/2063576.2063968
[53]
Fabio Valdés and Ralf Hartmut Güting. 2014. Index-supported pattern matching on symbolic trajectories. In Proceedings of the 2014 ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems - SIGSPATIAL '14. 53-- 62. https://doi.org/10.1145/2666310.2666402
[54]
Peter Weiner. 1973. Linear Pattern Matching Algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15--17, 1973. 1--11. https://doi.org/10.1109/SWAT.1973.13
[55]
Donghui Zhang, Vassilis J. Tsotras, and Bernhard Seeger. 2002. Efficient temporal join processing using indices. In In Proceedings of the 2002 International Conference on Data Engineering - ICDE '02. 103--113. https://doi.org/10.1109/ICDE.2002. 994701
[56]
Haopeng Zhang, Yanlei Diao, and Neil Immerman. 2014. On complexity and optimization of expensive queries in complex event processing. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD '14. 217--228. https://doi.org/10.1145/2588555.2593671
[57]
Detlef Zimmer and Rainer Unland. 1999. On the Semantics of Complex Events in Active Database. In Proceedings of the 1999 international conference on data engineering - ICDE '99. 392--399.

Cited By

View all
  • (2025)Efficient Event Processing on Modern HardwareScalable Data Management for Future Hardware10.1007/978-3-031-74097-8_3(65-89)Online publication date: 24-Jan-2025
  • (2023)High-Performance Row Pattern Recognition Using JoinsProceedings of the VLDB Endowment10.14778/3579075.357909016:5(1181-1195)Online publication date: 1-Jan-2023
  • (2023)When Tree Meets Hash: Reducing Random Reads for Index Structures on Persistent MemoriesProceedings of the ACM on Management of Data10.1145/35889591:1(1-26)Online publication date: 30-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cep
  2. event processing
  3. event store
  4. indexing
  5. match recognize
  6. pattern matching
  7. stream processing

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)10
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Efficient Event Processing on Modern HardwareScalable Data Management for Future Hardware10.1007/978-3-031-74097-8_3(65-89)Online publication date: 24-Jan-2025
  • (2023)High-Performance Row Pattern Recognition Using JoinsProceedings of the VLDB Endowment10.14778/3579075.357909016:5(1181-1195)Online publication date: 1-Jan-2023
  • (2023)When Tree Meets Hash: Reducing Random Reads for Index Structures on Persistent MemoriesProceedings of the ACM on Management of Data10.1145/35889591:1(1-26)Online publication date: 30-May-2023
  • (2022)Complex event processing for physical and cyber security in datacentres - recent progress, challenges and recommendationsJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-022-00338-x11:1Online publication date: 14-Oct-2022
  • (2022)A Survey on Advancements of Real-Time Analytics Architecture ComponentsComputational Methods and Data Engineering10.1007/978-981-19-3015-7_41(547-559)Online publication date: 9-Sep-2022
  • (2022)Efficient Checking of Timed Ordered Anti-patterns over Graph-Encoded Event LogsModel and Data Engineering10.1007/978-3-031-21595-7_11(147-161)Online publication date: 21-Nov-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media