Elsevier

Information Systems

Volume 104, February 2022, 101679
Information Systems

D2IA: User-defined interval analytics on distributed streams

https://doi.org/10.1016/j.is.2020.101679Get rights and content

Highlights

  • A family of operators to generate interval events from instantaneous events.

  • Operators are defined as DSL to expressively generate data-driven intervals.

  • Operators fill a gap in the expressiveness of large-scale stream processing engines.

  • Two alternative implementations based on Apache Flink and Esper.

  • A systematic perfromance evaluation using the linear road benchmark.

Abstract

Nowadays, modern Big Stream Processing Solutions (e.g. Spark, Flink) are working towards being the ultimate framework for streaming analytics. In order to achieve this goal, they started to offer extensions of SQL that incorporate stream-oriented primitives such as windowing and Complex Event Processing (CEP). The former enables stateful computation on infinite sequences of data items while the latter focuses on the detection of events pattern. In most of the cases, data items and events are considered instantaneous, i.e., they are single time points in a discrete temporal domain. Nevertheless, a point-based time semantics does not satisfy the requirements of a number of use-cases. For instance, it is not possible to detect the interval during which the temperature increases until the temperature begins to decrease, nor for all the relations this interval subsumes. To tackle this challenge, we present D2IA; a set of novel abstract operators to define analytics on user-defined event intervals based on raw events and to efficiently reason about temporal relationships between intervals and/or point events. We realize the implementation of the concepts of D2IA on top of Flink, a distributed stream processing engine for big data.

Introduction

Streaming data analytics has become relevant as never was before. Together with Data Volume and Variety, the raise of Data velocity is forcing many organizations to embrace the real-time paradigm shift. A data stream, which is an unbounded sequences of partially-ordered data, is a convenient abstraction when data naturally comes over time, e.g., data from a sensor network. Intuitively, the unbounded nature of streams impacts the way data-systems handle query-answering. Information needs to become continuous, i.e., from an unbounded input an unbounded output is expected. To this extent, a new generation of Stream Processing engines (SPE) for Big Data (BigSPE) is emerging to process vast, heterogeneous, and noisy data streams [1].

SPEs are commonly classified into Data Stream Management Systems (DSMSs) and Complex Event Processing (CEP) [2] systems. The state-of-the-art on DSMSs and CEPs is vast and includes a variety of Domain Specific Languages (DSL) to analyze data streams. Most of these DSLs are declarative and expose special operators to deal with streams’ unboundedness. In particular, most of DSMSs adopt time-based windows to slice the input streams into finite portions, upon which they can perform stateful aggregations [3]. On the other hand, CEP engines employ regular languages to detect events patterns over streams [4] using Non-deterministic Finite State Automata (NFSA).

Listings 1.1 and 1.2 show a DSMS and a CEP query, respectively. The former calculates the average temperature over the last 5 min, while the latter emits a fire event whenever it detects a smoke event followed-by a temperature event that reports a value higher than 40. Both listings make use of an industrial DSL called Event Processing Language (EPL).1 EPL combines DSMS and CEP features into a hybrid solution that is very expressive. Interestingly, existing EPL implementations like Esper2 and OracleCEP3 can only scale-up. While vertical scalability is sufficient for a variety of use-cases, Big Data applications often call for fault-tolerant and horizontally-scalable BigSPEs. Nonetheless, the need for democratizing Big Data brought many BigSPEs to adopt SQL-like DSL for stream processing.

In this paper, we advocate that the trade-off between expressiveness and scalability led BigSPEs to design APIs and DSLs that do not meet the expectations raised by the centralized solutions [1].

Nevertheless, such expressiveness is crucial in several applications like the following air traffic scenario inspired to Bombardier’s C Series jetliner. Such plane, designed in 2015, is fitted with 5000 sensors that generate up to 10 GB of data per second. Many events are continuously produced during flights, e.g., changes in altitude, speed, and heading of an aircraft. In such a scenario, we can be interested in detecting those events during which a plane is in cruising mode and performs a change in altitude which is more than 10%. We can use EPL to design a solution for this scenario (cf Listing 1.3). However, to the best of our knowledge, solutions like Flink or Spark Streaming do not provide such feature out-of-the-box and require some customization to be used. Indeed, while the query above requires to process events that have a duration, existing BigSPEs adopt a point-based time semantics.

The literature on Stream Processing contains many examples that acknowledge the limitations of a point-based time model vs an interval-based one. The latter has a richer semantics than the former and can still represent point events without loss of generality [5].

In the remainder of the paper, we address the problem of enabling expressive yet horizontally scalable stream processing. To this extent, we designed and implemented D2IA (Data-driven Interval Analytics), a novel family of operators that enables interval events generation and reasoning. This paper is an extension of a previous one presented in [6] adding the following contributions:

  • The paper provides two implementations of the D2IA operator family, based on alternative design decisions, on a large-scale stream processing systems, i.e., Apache Flink;

  • It presents a systematic and comparative evaluation of the alternative implementations using a well-known reference benchmark for stream processing, i.e., Linear Road Benchmark;

  • It discusses about the portability of the approach in relation with other Big SPEs and how the trend towards a StreamingSQL can foster expressive yet horizontally-scalablestream processing;

  • It improves the general presentation of the paper and includes a more detailed background section.

The remainder of the paper is organized as follows. Necessary background is introduced in Section 2. Concepts behind D2IA are presented in Section 3. Section 4 describes the implementation details, and Section 5 presents the evaluation. Related work is discussed in Section 6, and a discussion on the relation between D2IA and competing alternatives is presented in Section 7. We finally conclude the paper in Section 8.

Section snippets

Background

In this section, we summarize the state-of-the-art on DSMS and CEP presenting the main concepts that are required to understand the content of the paper.

Operators for user-defined intervals analytics

In this section, we present a family of operators for analytics contextual reasoning about events called D2IA. D2IA allows generating data-driven Interval Events from Raw Events, and reasoning about time interval using Allen’s Algebra. In particular, D2IA is designed to offer:

  • R.1

    Event Generation, i.e., the operators must enable interval generation via event detection and vice versa.

  • R.2

    Analytical Features, i.e., the operators must enable (a) stateful aggregations, for example employing

Implementation

In this section, we present how we implement D2IA on top of a scalable infrastructure, i.e., Apache Flink, which is a fault-tolerant and a scalable distributed stream processing engine. Flink supports stateless and stateful stream processing; it supports several types of windows, queryable state, and, recently, also complex event processing.

Evaluation

In this section, we present the evaluation of the two alternative D2IA implementations, i.e., CEP-based and Window-Based, against a baseline implementation on top of Esper. We have chosen Esper as the state of the art centralized stream processing engine that provides a declarative language to express queries. We have encoded the different interval specifications as EPL queries. These EPL queries are listed in Appendix.

Related work

In this section, we present the work related to D2IA considering the state of the art in complex event processing. Table 1 summarizes the comparison of D2IA with related work. It shows that D2IA’s implementations on top of Flink supports all operators from Fig. 3, while in the state of the art, only EPL supports all the operators but only in a centralized scenario. Moreover, in the following we briefly describe other CEP engines that are worth mentioning due to their expressiveness.

TPStream [18]

Discussion

The implementations used for D2IA validation makes use of the Flink’s low-level streaming APIs, state management, and CEP. Despite being declarative, these APIs are embedded into programming languages and, thus, they are not completely portable to alternative systems, e.g., Spark or Kafka Streams.

Nevertheless, the support for SQL-based specification of data processing pipelines has been gaining attention by distributed data processing systems [21], [22], [23]. A declarative approach to specify

Conclusion

In this paper, we presented a family of operators to specify event intervals over data streams and to reason about their temporal relationships (D2IA). D2IA supports event intervals derived from a single source stream by means of aggregations over timestamped events (homogeneous), and event intervals derived from two or more sources (heterogeneous). In addition to our former work, we extensively evaluated two D2IA implementations based on alternative abstractions on top of Apache Flink. We

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The work of Ahmed Awad, Samuele Langhi, Mahmoud Kamel and Sherif Sakr is funded by the European Regional Development Funds via the Mobilitas Plus programme (grant MOBTT75). Dr. Tommasini and Samuele Langhi acknowledge support from the IT Academy European Social Fund via IT Academy programme.

References (24)

  • CugolaG. et al.

    Low latency complex event processing on parallel hardware

    J. Parallel Distrib. Comput.

    (2012)
  • HirzelM. et al.

    Stream processing languages in the big data era

    SIGMOD Rec.

    (2018)
  • DindarN.

    Modeling the execution semantics of stream processing engines with secret

    VLDB J.

    (2013)
  • EtzionO. et al.

    Event Processing in Action

    (2010)
  • AnicicD. et al.

    Stream reasoning and complex event processing in ETALIS

    Semant. Web

    (2012)
  • AwadA. et al.

    D2IA: stream analytics on user-defined event intervals

  • ArasuA. et al.

    The cql continuous query language: semantic foundations and query execution

    VLDB J.

    (2006)
  • GrossniklausM. et al.

    Frames: Data-driven windows

  • GedikB.

    Generic windowing support for extensible stream processing systems

    Softw. - Pract. Exp.

    (2013)
  • AllenJ.F.

    Maintaining knowledge about temporal intervals

    Commun. ACM

    (1983)
  • M.B. Vilain, H.A. Kautz, Constraint propagation algorithms for temporal reasoning, in: Proceedings of the 5th National...
  • NebelB. et al.

    Reasoning about temporal relations: A maximal tractable subclass of allen’s interval algebra

    J. ACM

    (1995)
  • Cited by (5)

    • Complex Event Recognition with Allen Relations

      2023, Proceedings of the International Conference on Knowledge Representation and Reasoning
    • HINT: A Hierarchical Index for Intervals in Main Memory

      2022, Proceedings of the ACM SIGMOD International Conference on Management of Data
    View full text