Abstract:
A variety of data stream problems that affect two or more data streams rely on joining them based on a common or similar timing attribute. With the advent of stream proce...Show MoreMetadata
Abstract:
A variety of data stream problems that affect two or more data streams rely on joining them based on a common or similar timing attribute. With the advent of stream processing frameworks like Apache Spark and Apache Flink within the last years, processing of streamed data has become much easier. Repeated processing of relatively small data batches in so-called windows increases flexibility with respect to implementation and task distribution across multiple nodes. Using event times instead of ingestion times avoids, among other problems, incorrect joins. However, in this work we argue that batch-processing leads to a significant trade-off between increased computational complexity and latency of the resulting join pairs. A concept for time-series joins of streaming data is presented. This concept, which is built upon a resilient data stream framework, minimizes both the computational costs and latency times. It uses the guarantees associated with this underlying framework to join the data records deterministically according to event times instead of processing times. This work represents a work-in-progress paper, as detailed benchmarks are pending.
Published in: 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA)
Date of Conference: 08-11 September 2020
Date Added to IEEE Xplore: 05 October 2020
ISBN Information: