DRIVEN: A framework for efficient Data Retrieval and clustering in Vehicular Networks

doi:10.1016/j.future.2020.01.050

Future Generation Computer Systems

Volume 107, June 2020, Pages 1-17

https://doi.org/10.1016/j.future.2020.01.050 Get rights and content

Highlights

•
Piecewise Linear Approximation compresses data without altering clustering accuracy.
•
Compressing data at vehicles reduces bandwidth consumption by up to 90%.
•
Streaming-based distributed data compression and clustering incurs small latency.
•
Experiments on near-vehicular hardware stress viability of use in real-world scenario.

Abstract

The growing interest in data analysis applications for Cyber–Physical Systems stems from the large amounts of data such large distributed systems sense in a continuous fashion. A key research question in this context is how to jointly address the efficiency and effectiveness challenges of such data analysis applications.

DRIVEN proposes a way to jointly address these challenges for a data gathering and distance-based clustering tool in the context of vehicular networks. To cope with the limited communication bandwidth (compared to the sensed data volume) of vehicular networks and data transmission’s monetary costs, DRIVEN avoids gathering raw data from vehicles, but rather relies on a streaming-based and error-bounded approximation, through Piecewise Linear Approximation (PLA), to compress the volumes of gathered data. Moreover, a streaming-based approach is also used to cluster the collected data (once the latter is reconstructed from its PLA-approximated form). DRIVEN’s clustering algorithm leverages the inherent ordering of the spatial and temporal data being collected to perform clustering in an online fashion, while data is being retrieved. As we show, based on our prototype implementation using Apache Flink and thorough evaluation with real-world data such as GPS, LiDAR and other vehicular signals, the accuracy loss for the clustering performed on the gathered approximated data can be small (below 10%), even when the raw data is compressed to 5-35% of its original size, and the transferring of historical data itself can be completed in up to one-tenth of the duration observed when gathering raw data.

Introduction

Large distributed Cyber–Physical Systems (CPSs) such as vehicular networks [1] (among others) are behind many of the current research threads in computer science. One of the aspects many of such research threads share has its roots in the large amounts of data sensed continuously in large distributed CPSs. As discussed in the literature, the benefits and possibilities CPSs’ data enables (e.g. online congestion monitoring, platooning and autonomous driving in the case of vehicular networks) are bound to many challenges, spanning efficient analysis [2], efficient communication [3], [4], security [5] and privacy [6]. A key aspect in this context is the need for solutions that can jointly address several such challenges [7], since solutions that focus on and/or excel in only one aspect but fall short in others might be impractical in real-world setups.

When focusing on aspects such as data communication and analysis, a well known challenge is given by the imbalance between the amounts of data sensed and produced by the sensors deployed in such CPSs (a modern vehicle, on the road today, senses more than 20 GB/h of data [8]) and the infrastructures’ capacity of gathering them within small time periods to data centers [9]. Even when data is not to be transmitted continuously, but only for a limited time period and for some selection of sensors, the required bandwidth may far exceed the available one (e.g. a single LiDAR sensor of an autonomous car produces around 7 MB/s, cf. Section 4.1). In this case, solutions focusing on efficient data analysis need to account for communication aspects too, in order for the latter not to result in a major bottleneck. The inherent limitations of traditional batch and store-then-process (DB) analysis techniques, which on their own cannot sustain the data rates of relevant applications, need thus to be overcome by taking into account the end-to-end transformation process of raw data into valuable insights. Specifically, considering which data – as well as how much data – is moved through a certain analysis pipeline. Because of this, a complementary challenge gravitates around how to take advantage of the high cumulative computational power of CPSs’ edge sensors and devices, since the porting of a given sequential analysis tool (e.g. clustering) to an efficient parallel and distributed implementation and its deployment are not trivial.

We present the DRIVEN framework, which copes with the aforementioned challenges for a common problem in vehicular networks’ applications, namely that of gathering and clustering of vehicular data. In a nutshell, the DRIVEN framework jointly addresses the challenges of data gathering, online analysis and leveraging of edge devices’ computational power by:

1.
leveraging a lossy compression technique, based on Piecewise Linear Approximation (PLA), that significantly reduces the amounts of data to be gathered from vehicles,
2.
leveraging state-of-the-art online clustering techniques such as Lisco [10], which overcome the limitations of batch-based ones, and
3.
relying on the data streaming paradigm to transparently achieve distributed and parallel deployments.

As we further elaborate in the remainder, a data analyst interested in gathering and clustering data sensed by a set of vehicles over a given period of time can do so by specifying parameters about (i) the type of data to be gathered, (ii) the maximum error that can be introduced while compressing the data to be retrieved (because of the PLA-based compression) and (iii) the specifications for the clustering of data. The DRIVEN framework then compiles this information into a streaming application that is deployed both at the vehicles providing the data as well as at the analyst’s data center. To support modularity, the framework also allows the analyst to define additional components for the resulting application that can be used to process the data before the latter is clustered.

An extensive literature exists about clustering, its porting to the streaming protocol and the leveraging of approximation techniques to improve (along with certain criteria) the clustering process, as we discuss in Section 6. In this context, our contribution does not aim at surveying all existing solutions nor at comparing them. Rather, the contribution focuses on providing evidence of how a streaming application that can (i) jointly leverage the computational power of both edge and central components of a CPS and (ii) allow for partial data loss when gathering information can provide a healthy tradeoff between data reduction and pipeline speed on the one hand and accuracy loss on the other, despite requiring more data processing components (e.g. to compress and decompress the data gathered from the vehicles) than a centralized counterpart (which needs all the raw data to be gathered). As we show in our empirical evaluation, based on a prototype implementation using Apache Flink and recently proposed streaming-based PLA and clustering methods, and four real-world use cases, DRIVEN is able to reduce the duration of data transmission by up to 90 % while incurring a bounded loss on the clustering quality. The rest of the paper is organized as follows. We introduce preliminary concepts in Section 2 and the considered system model and problem statement in Section 3. We then present the DRIVEN framework in Section 4 and our evaluation in Section 5. Finally, we discuss related work in Section 6 and conclude the paper in Section 7.

Section snippets

Preliminaries

We begin this section by discussing preliminary concepts about data streaming, PLA, distance-based clustering and logical latency.

System model and problem statement

We consider systems consisting of a set of many vehicles and one analysis center, in which data analysts are interested in gathering data from such a set of vehicles and, subsequently, clustering that data at the analysis center. Each vehicle $V_{i}$ is equipped with an embedded device which provides limited computational capacity; $V_{i}$ also mounts a set of sensors, each producing a stream of tuples composed by attributes $〈y^{0}, \dots, y^{k}〉$ , i.e., the physical or logical time of each reading and the measurements

Overview of the DRIVEN framework

In this section, we present an overview of DRIVEN. To facilitate the presentation, we first introduce a use case that serves as a running example in our discussion (we later evaluate it, together with others, in Section 5).

As discussed in Section 3, each query run by DRIVEN is a streaming continuous query deployed at both the vehicles and the analysis center, with dedicated operators for efficient data retrieval and clustering.

Evaluation

We evaluate in here the tradeoffs in compression, approximation error, retrieval time and clustering quality for DRIVEN. We first present the datasets used, the software and hardware setup and then discuss four different use cases in which historic data is gathered and clustered. Finally, we gauge PLA’s compression performance by comparing it to a lossless, general-purpose compression technique (ZIP) and discuss the concept of inherent logical latency of PLA compression to investigate the

Related work

Clustering, as a core problem in data mining, has been extensively studied in the last decades (see e.g. the survey [43] and the references therein). The two main trends in clustering algorithms differ on what should be considered as a cluster, either privileging well-balanced ball-like clusters (as in the widely-studied $k$ -means approach) or rather focusing on local density leading to arbitrarily shaped clusters (e.g. DBSCAN [36]-style). Other features that can distinguish existing clustering

Conclusion and future work

We have presented here the DRIVEN framework for data retrieval and clustering in vehicular networks. The framework, implemented in a state-of-the-art SPE, provides simultaneously an efficient way for gathering data and performing clustering on said data based on an analyst’s queries. Information retrieval is achieved using PLA for compressing the input stream in a streaming fashion. Once uncompressed, the approximated stream is fed to a distance-based streaming clustering algorithm. Both the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Work supported by VINNOVA, Sweden, the Swedish Government Agency for Innovation Systems, proj. “Onboard/Offboard Distributed Data Analytics (OODIDA)” in the funding program FFI: Strategic Vehicle Research and Innovation (DNR 2016-04260); the Swedish Foundation for Strategic Research, Sweden, proj. “Future factories in the cloud (FiC)” (grant GMT14-0032), and the Swedish Research Council (Vetenskapsrådet), Sweden , proj. “HARE: Self-deploying and Adaptive Data Streaming Analytics in Fog

Bastian Havers received his BSc in Physics from the RWTH Aachen University and his MSc in Physics from Bonn University in Germany. He is currently a PhD student in the Networks and Systems Division at Chalmers University of Technology and researcher at Volvo Cars in Sweden. His research focuses on data stream processing, cyber–physical systems and data analysis.

References (49)

ArasuArvind et al.
Linear road: A stream data management benchmark
TatbulNesime et al.
Load shedding in a data stream manager
JainAnil K.
Data clustering: 50 years beyond K-means
Pattern Recognit. Lett.
(2010)
YousefiSaleh et al.
Vehicular ad hoc networks (VANETs): challenges and perspectives
Palyvos-GiannasDimitris et al.
GeneaLog: Fine-grained data streaming provenance at the edge
ZhouJiazhen et al.
Scalable distributed communication architectures to support advanced metering infrastructure in smart grid
IEEE Trans. Parallel Distrib. Syst.
(2012)
KeramatianAmir et al.
MAD-C: Multi-stage approximate distributed cluster-combining for obstacle detection and localization
RooijJoris van et al.
LoCoVolt: Distributed detection of broken meters in smart grids through stream processing
GulisanoVincenzo et al.
Bes: Differentially private and distributed event aggregation in advanced metering infrastructures
GulisanoVincenzo et al.
Metis: a two-tier intrusion detection system for advanced metering infrastructures

CoppolaRiccardo et al.

Connected car: technologies, issues, future trends

ACM Comput. Surv.

(2016)

DuvignauRomaric et al.

Querying large vehicular networks: how to balance on-board workload and queries response time?

NajdataeiHannaneh et al.

Continuous and parallel LiDAR point-cloud clustering

StonebrakerMichael et al.

The 8 requirements of real-time stream processing

ACM Sigmod Rec.

(2005)

CarboneParis et al.

Apache flink: Stream and batch processing in a single engine

Bull. IEEE Comput. Soc. Tech. Comm. Data Eng.

(2015)

GulisanoVincenzo et al.

Efficient data streaming multiway aggregation through concurrent algorithmic designs and new abstract data types

ACM Trans. Parallel Comput.

(2017)

CostacheStefania et al.

Understanding the data-processing challenges in intelligent vehicular systems

OttenwälderBeate et al.

MCEP: a mobility-aware complex event processing system

ACM Trans. Internet Technol.

(2014)

BalazinskaMagdalena et al.

Fault-tolerance in the Borealis distributed stream processing system

KalyvianakiEvangelia et al.

THEMIS: Fairness in federated stream processing under overload

JiYuanzhen et al.

Quality-driven continuous query execution over out-of-order data streams

ZacheilasNikos et al.

Maximizing determinism in stream processing under latency constraints

BabcockBrian et al.

Chain: Operator scheduling for memory minimization in data stream systems

BabcockB. et al.

Load shedding for aggregation queries over data streams

Cited by (13)

Survey:Time-series data preprocessing: A survey and an empirical analysis
2024, Journal of Engineering Research (Kuwait)
Data are naturally collected in their raw state and must undergo a series of preprocessing steps to obtain data in their input state for Artificial Intelligence (AI) and other applications. The data preprocessing phase is not only necessary to fit input requirements but also effective in improving AI training efficiency and output accuracy. Data preprocessing is a time consuming and complex phase that lacks a unified and structured approach. We survey data preprocessing techniques under different categories to provide an extended and structured scope of data preprocessing relevant to numerical time-series data. We also provide an empirical analysis of the impact of preprocessing techniques on the quality of the data and on the performance of AI algorithms. In addition, we discuss the feasibility of distributing some of the surveyed techniques to the edge. Leveraging edge computing to distribute data preprocessing reduces the workload on central systems, creates more manageable data lakes, reduces the consumption of resources (e.g., energy) and enables EdgeAI.
State of data platforms for connected vehicles and infrastructures
2021, Communications in Transportation Research
Citation Excerpt :
Literature works that propose vehicular data management architectures often relate to cloud computing, big data and IoT. For instance, addressing the big data problem of autonomous vehicles, the DRIVEN framework (Havers et al., 2020) was proposed to optimise data ingestion of large datasets using data compression methods and data streaming with Apache Flink (Apache Software Foundation, 2021a), resulting in a ten-fold increase in data transmission speeds. Similarly, Mohyuddin and Prehofer (Mohyuddin and Prehofer, 2021) proposed a vehicular data analysis framework called Onix, focusing on Apache Spark for data streaming and visualisation interfaces.
The continuing expansion of connected and electro-mobility products and services has led to their ability to rapidly generate very large amounts of data, leading to a demand for effective data management solutions. This is further catalysed through the need for society to make informed policies and decisions that can properly support their emerging growth. While data systems and platforms exist, they are often proprietary, being only compatible to the products that they are designed for. Given the products and services generate energy and spatial-temporal data that can often correlate, a lack of interoperability between these systems would impede decision making, as data from each system must be considered independently. By studying currently available data platforms and frameworks, this paper weighs the problems that these products address, and identifies necessary gaps for a more cohesive platform to exist. This is performed through a top-down approach, whereby broader vehicle-to-everything approaches are first studied, before moving to the components that could comprise a data platform to integrate and ingest these various data feeds. Finally, potential design considerations for a data platform is presented, along with examples of application benefits that would enable users to make more informed and holistic decisions about current mobility options.
FORTE: an extensible framework for robustness and efficiency in data transfer pipelines
2023, DEBS 2023 - Proceedings of the 17th ACM International Conference on Distributed and Event-based Systems
SmrtComp: Intelligent and Online CAN Data Compression
2023, IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC
Proposing a framework for evaluating learning strategies in vehicular CPSs
2022, Middleware 2022 Industrial Track - Proceedings of the 23rd International Middleware Conference Industrial Track, Part of Middleware 2022
Transforming IoT Data Preprocessing: A Holistic, Normalized and Distributed Approach
2022, SenSys 2022 - Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems

View all citing articles on Scopus

Romaric Duvignau received his BSc and MSc degrees in Computer Science from the University of Bordeaux in France. From the same university, he received his PhD degree in Computer Science in 2015. 2016–2017 he was a research and teaching fellow at the University of Bordeaux and AixMarseille University and LIF in Marseille, France. He is currently a postdoctoral researcher in the Networks and Systems Division at Chalmers University of Technology. His research focuses on data stream processing, cyber–physical systems and continuous distributed monitoring.

Hannaneh Najdataei received her BSc degree in Software Engineering and her MSc degree in Artificial Intelligence from Shiraz University in Iran. She is currently a PhD student in the Networks and Systems Division at Chalmers University of Technology. Her research focuses on parallel programming, data stream processing and cyber–physical systems.

Vincenzo Gulisano is Associate Professor in the Networks and Systems Division at Chalmers University of Technology. His research focuses on data processing and distributed / parallel / elastic and fault-tolerant data streaming. Dr. Vincenzo Gulisano holds a PhD in Computer Science from the Polytechnic University of Madrid, Spain.

Marina Papatriantafilou is Associate Professor in the Networks and Systems Division at Chalmers University of Technology. Earlier, she was with the Max-Planck Institute for Computer Science, Saarbrücken and CWI, Amsterdam. She received her PhD degree from the Computer Science and Informatics Dept., Patras University and is a member of Network of National Contacts ACM-WE NeNaC. Her research interests include: efficient and robust parallel, distributed, stream processing and applications in multiprocessor, multicore and distributed, cyberphysical systems; synchronization, consistency, fault tolerance.

Ashok Chaitanya Koppisetty is a researcher at Volvo Cars in Sweden. He received his BSc in Chemical Engineering and his MSc in Bioinformatics from Chalmers University. From the same university, he received his PhD in Computer Science. His research focuses on Machine Learning and Distributed Data Processing.

^☆: Preliminary results have been presented at the 35th IEEE International Conference on Data Engineering (ICDE 2019) [21].

View full text

DRIVEN: A framework for efficient Data Retrieval and clustering in Vehicular Networks☆