DRIVEN: A framework for efficient Data Retrieval and clustering in Vehicular Networks☆
Introduction
Large distributed Cyber–Physical Systems (CPSs) such as vehicular networks [1] (among others) are behind many of the current research threads in computer science. One of the aspects many of such research threads share has its roots in the large amounts of data sensed continuously in large distributed CPSs. As discussed in the literature, the benefits and possibilities CPSs’ data enables (e.g. online congestion monitoring, platooning and autonomous driving in the case of vehicular networks) are bound to many challenges, spanning efficient analysis [2], efficient communication [3], [4], security [5] and privacy [6]. A key aspect in this context is the need for solutions that can jointly address several such challenges [7], since solutions that focus on and/or excel in only one aspect but fall short in others might be impractical in real-world setups.
When focusing on aspects such as data communication and analysis, a well known challenge is given by the imbalance between the amounts of data sensed and produced by the sensors deployed in such CPSs (a modern vehicle, on the road today, senses more than 20 GB/h of data [8]) and the infrastructures’ capacity of gathering them within small time periods to data centers [9]. Even when data is not to be transmitted continuously, but only for a limited time period and for some selection of sensors, the required bandwidth may far exceed the available one (e.g. a single LiDAR sensor of an autonomous car produces around 7 MB/s, cf. Section 4.1). In this case, solutions focusing on efficient data analysis need to account for communication aspects too, in order for the latter not to result in a major bottleneck. The inherent limitations of traditional batch and store-then-process (DB) analysis techniques, which on their own cannot sustain the data rates of relevant applications, need thus to be overcome by taking into account the end-to-end transformation process of raw data into valuable insights. Specifically, considering which data – as well as how much data – is moved through a certain analysis pipeline. Because of this, a complementary challenge gravitates around how to take advantage of the high cumulative computational power of CPSs’ edge sensors and devices, since the porting of a given sequential analysis tool (e.g. clustering) to an efficient parallel and distributed implementation and its deployment are not trivial.
We present the DRIVEN framework, which copes with the aforementioned challenges for a common problem in vehicular networks’ applications, namely that of gathering and clustering of vehicular data. In a nutshell, the DRIVEN framework jointly addresses the challenges of data gathering, online analysis and leveraging of edge devices’ computational power by:
- 1.
leveraging a lossy compression technique, based on Piecewise Linear Approximation (PLA), that significantly reduces the amounts of data to be gathered from vehicles,
- 2.
leveraging state-of-the-art online clustering techniques such as Lisco [10], which overcome the limitations of batch-based ones, and
- 3.
relying on the data streaming paradigm to transparently achieve distributed and parallel deployments.
As we further elaborate in the remainder, a data analyst interested in gathering and clustering data sensed by a set of vehicles over a given period of time can do so by specifying parameters about (i) the type of data to be gathered, (ii) the maximum error that can be introduced while compressing the data to be retrieved (because of the PLA-based compression) and (iii) the specifications for the clustering of data. The DRIVEN framework then compiles this information into a streaming application that is deployed both at the vehicles providing the data as well as at the analyst’s data center. To support modularity, the framework also allows the analyst to define additional components for the resulting application that can be used to process the data before the latter is clustered.
An extensive literature exists about clustering, its porting to the streaming protocol and the leveraging of approximation techniques to improve (along with certain criteria) the clustering process, as we discuss in Section 6. In this context, our contribution does not aim at surveying all existing solutions nor at comparing them. Rather, the contribution focuses on providing evidence of how a streaming application that can (i) jointly leverage the computational power of both edge and central components of a CPS and (ii) allow for partial data loss when gathering information can provide a healthy tradeoff between data reduction and pipeline speed on the one hand and accuracy loss on the other, despite requiring more data processing components (e.g. to compress and decompress the data gathered from the vehicles) than a centralized counterpart (which needs all the raw data to be gathered). As we show in our empirical evaluation, based on a prototype implementation using Apache Flink and recently proposed streaming-based PLA and clustering methods, and four real-world use cases, DRIVEN is able to reduce the duration of data transmission by up to 90 % while incurring a bounded loss on the clustering quality. The rest of the paper is organized as follows. We introduce preliminary concepts in Section 2 and the considered system model and problem statement in Section 3. We then present the DRIVEN framework in Section 4 and our evaluation in Section 5. Finally, we discuss related work in Section 6 and conclude the paper in Section 7.
Section snippets
Preliminaries
We begin this section by discussing preliminary concepts about data streaming, PLA, distance-based clustering and logical latency.
System model and problem statement
We consider systems consisting of a set of many vehicles and one analysis center, in which data analysts are interested in gathering data from such a set of vehicles and, subsequently, clustering that data at the analysis center. Each vehicle is equipped with an embedded device which provides limited computational capacity; also mounts a set of sensors, each producing a stream of tuples composed by attributes , i.e., the physical or logical time of each reading and the measurements
Overview of the DRIVEN framework
In this section, we present an overview of DRIVEN. To facilitate the presentation, we first introduce a use case that serves as a running example in our discussion (we later evaluate it, together with others, in Section 5).
As discussed in Section 3, each query run by DRIVEN is a streaming continuous query deployed at both the vehicles and the analysis center, with dedicated operators for efficient data retrieval and clustering.
Evaluation
We evaluate in here the tradeoffs in compression, approximation error, retrieval time and clustering quality for DRIVEN. We first present the datasets used, the software and hardware setup and then discuss four different use cases in which historic data is gathered and clustered. Finally, we gauge PLA’s compression performance by comparing it to a lossless, general-purpose compression technique (ZIP) and discuss the concept of inherent logical latency of PLA compression to investigate the
Related work
Clustering, as a core problem in data mining, has been extensively studied in the last decades (see e.g. the survey [43] and the references therein). The two main trends in clustering algorithms differ on what should be considered as a cluster, either privileging well-balanced ball-like clusters (as in the widely-studied -means approach) or rather focusing on local density leading to arbitrarily shaped clusters (e.g. DBSCAN [36]-style). Other features that can distinguish existing clustering
Conclusion and future work
We have presented here the DRIVEN framework for data retrieval and clustering in vehicular networks. The framework, implemented in a state-of-the-art SPE, provides simultaneously an efficient way for gathering data and performing clustering on said data based on an analyst’s queries. Information retrieval is achieved using PLA for compressing the input stream in a streaming fashion. Once uncompressed, the approximated stream is fed to a distance-based streaming clustering algorithm. Both the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
Work supported by VINNOVA, Sweden, the Swedish Government Agency for Innovation Systems, proj. “Onboard/Offboard Distributed Data Analytics (OODIDA)” in the funding program FFI: Strategic Vehicle Research and Innovation (DNR 2016-04260); the Swedish Foundation for Strategic Research, Sweden, proj. “Future factories in the cloud (FiC)” (grant GMT14-0032), and the Swedish Research Council (Vetenskapsrådet), Sweden , proj. “HARE: Self-deploying and Adaptive Data Streaming Analytics in Fog
Bastian Havers received his BSc in Physics from the RWTH Aachen University and his MSc in Physics from Bonn University in Germany. He is currently a PhD student in the Networks and Systems Division at Chalmers University of Technology and researcher at Volvo Cars in Sweden. His research focuses on data stream processing, cyber–physical systems and data analysis.
References (49)
- et al.
Linear road: A stream data management benchmark
- et al.
Load shedding in a data stream manager
Data clustering: 50 years beyond K-means
Pattern Recognit. Lett.
(2010)- et al.
Vehicular ad hoc networks (VANETs): challenges and perspectives
- et al.
GeneaLog: Fine-grained data streaming provenance at the edge
- et al.
Scalable distributed communication architectures to support advanced metering infrastructure in smart grid
IEEE Trans. Parallel Distrib. Syst.
(2012) - et al.
MAD-C: Multi-stage approximate distributed cluster-combining for obstacle detection and localization
- et al.
LoCoVolt: Distributed detection of broken meters in smart grids through stream processing
- et al.
Bes: Differentially private and distributed event aggregation in advanced metering infrastructures
- et al.
Metis: a two-tier intrusion detection system for advanced metering infrastructures
Connected car: technologies, issues, future trends
ACM Comput. Surv.
Querying large vehicular networks: how to balance on-board workload and queries response time?
Continuous and parallel LiDAR point-cloud clustering
The 8 requirements of real-time stream processing
ACM Sigmod Rec.
Apache flink: Stream and batch processing in a single engine
Bull. IEEE Comput. Soc. Tech. Comm. Data Eng.
Efficient data streaming multiway aggregation through concurrent algorithmic designs and new abstract data types
ACM Trans. Parallel Comput.
Understanding the data-processing challenges in intelligent vehicular systems
MCEP: a mobility-aware complex event processing system
ACM Trans. Internet Technol.
Fault-tolerance in the Borealis distributed stream processing system
THEMIS: Fairness in federated stream processing under overload
Quality-driven continuous query execution over out-of-order data streams
Maximizing determinism in stream processing under latency constraints
Chain: Operator scheduling for memory minimization in data stream systems
Load shedding for aggregation queries over data streams
Cited by (13)
Survey:Time-series data preprocessing: A survey and an empirical analysis
2024, Journal of Engineering Research (Kuwait)State of data platforms for connected vehicles and infrastructures
2021, Communications in Transportation ResearchCitation Excerpt :Literature works that propose vehicular data management architectures often relate to cloud computing, big data and IoT. For instance, addressing the big data problem of autonomous vehicles, the DRIVEN framework (Havers et al., 2020) was proposed to optimise data ingestion of large datasets using data compression methods and data streaming with Apache Flink (Apache Software Foundation, 2021a), resulting in a ten-fold increase in data transmission speeds. Similarly, Mohyuddin and Prehofer (Mohyuddin and Prehofer, 2021) proposed a vehicular data analysis framework called Onix, focusing on Apache Spark for data streaming and visualisation interfaces.
FORTE: an extensible framework for robustness and efficiency in data transfer pipelines
2023, DEBS 2023 - Proceedings of the 17th ACM International Conference on Distributed and Event-based SystemsSmrtComp: Intelligent and Online CAN Data Compression
2023, IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSCProposing a framework for evaluating learning strategies in vehicular CPSs
2022, Middleware 2022 Industrial Track - Proceedings of the 23rd International Middleware Conference Industrial Track, Part of Middleware 2022Transforming IoT Data Preprocessing: A Holistic, Normalized and Distributed Approach
2022, SenSys 2022 - Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems
Bastian Havers received his BSc in Physics from the RWTH Aachen University and his MSc in Physics from Bonn University in Germany. He is currently a PhD student in the Networks and Systems Division at Chalmers University of Technology and researcher at Volvo Cars in Sweden. His research focuses on data stream processing, cyber–physical systems and data analysis.
Romaric Duvignau received his BSc and MSc degrees in Computer Science from the University of Bordeaux in France. From the same university, he received his PhD degree in Computer Science in 2015. 2016–2017 he was a research and teaching fellow at the University of Bordeaux and AixMarseille University and LIF in Marseille, France. He is currently a postdoctoral researcher in the Networks and Systems Division at Chalmers University of Technology. His research focuses on data stream processing, cyber–physical systems and continuous distributed monitoring.
Hannaneh Najdataei received her BSc degree in Software Engineering and her MSc degree in Artificial Intelligence from Shiraz University in Iran. She is currently a PhD student in the Networks and Systems Division at Chalmers University of Technology. Her research focuses on parallel programming, data stream processing and cyber–physical systems.
Vincenzo Gulisano is Associate Professor in the Networks and Systems Division at Chalmers University of Technology. His research focuses on data processing and distributed / parallel / elastic and fault-tolerant data streaming. Dr. Vincenzo Gulisano holds a PhD in Computer Science from the Polytechnic University of Madrid, Spain.
Marina Papatriantafilou is Associate Professor in the Networks and Systems Division at Chalmers University of Technology. Earlier, she was with the Max-Planck Institute for Computer Science, Saarbrücken and CWI, Amsterdam. She received her PhD degree from the Computer Science and Informatics Dept., Patras University and is a member of Network of National Contacts ACM-WE NeNaC. Her research interests include: efficient and robust parallel, distributed, stream processing and applications in multiprocessor, multicore and distributed, cyberphysical systems; synchronization, consistency, fault tolerance.
Ashok Chaitanya Koppisetty is a researcher at Volvo Cars in Sweden. He received his BSc in Chemical Engineering and his MSc in Bioinformatics from Chalmers University. From the same university, he received his PhD in Computer Science. His research focuses on Machine Learning and Distributed Data Processing.
- ☆
Preliminary results have been presented at the 35th IEEE International Conference on Data Engineering (ICDE 2019) [21].