FluteDB: An efficient and scalable in-memory time series database for sensor-cloud

doi:10.1016/j.jpdc.2018.07.021

Journal of Parallel and Distributed Computing

Volume 122, December 2018, Pages 95-108

https://doi.org/10.1016/j.jpdc.2018.07.021 Get rights and content

Highlights

•
FluteDB is an efficient and scalable time series database for sensor-cloud.
•
The index in FluteDB equips flexible storage tricks for time series data.
•
FluteDB improves efficiency by adjusting disk accesses according to data temperature.
•
FluteDB optimizes its data encapsulation and fault tolerant strategies.

Abstract

Recently, with the widespread use of large-scale sensor network, time series data is vastly generated and requires to be processed. However, those traditional databases show their limitations on storage when handling such a large stream data in cloud, and even their actual reliability and availability are also difficult to be guaranteed. To deal with the problem, this paper proposes FluteDB, an efficient and scalable in-memory time series database for sensor-cloud. We adequately analyze the unique characteristics of time series data and its relevant operations to strike the right balance among efficiency, scalability, resources consumption, reliability and availability. Specifically, on basis of the aggregate analysis of root cause for ongoing time series problems, FluteDB targeted optimizes the strategies for key operations in memory and physical storage, at the expense of partial acceptable data precision and consistency. FluteDB’s enhanced strategies are primarily comprised of Triggered Time Series Merge Tree (TTSM Tree), time series enhanced cache management and corresponding compression algorithms for different data types. The validations of all sub-modules have demonstrated that our improved strategies outperform existing methods in real time series environment significantly. Global experimental results also show that the integrated FluteDB reduces query latency by 17x, improves write rate by 98x and saves about 47% storage resources. The average available service time and recovery rate and degree of FluteDB are competitive with the state-of-the-art reliability and availability strategy in real and simulated faults, which demonstrates FluteDB can provide highly stable large-scale data cloud services.

Introduction

With the appearance and popularization of Internet of Things (IoT), Wireless Sensor Network (WSN), Smart City (SC) and other Internet hot spots, massive time series data is generated continuously and waits for further processing [15], [18], [21], [22]. Among them, many mature mining algorithms have achieved satisfactory applied results in many practical fields by analyzing and extracting specific features from massive time series data, e.g. Real-Time Vehicle Traffic (RTVT), Smart Grid (SG). As the basis of such downstream algorithms, how to store and query time series data efficiently and stably has aroused wide attentions especially in cloud service [6], [13], [23].

Due to the increasing demand for source data and real-time compute in cloud services, multimedia data sampling devices are widely available, and their sampling intervals have been shortened immensely nowadays, which lead to a vast growth of the scale of time series data directly. Besides, in order to provide highly reliable and available time series services certainly, the necessary preparations for fault tolerant and data recovery for cloud services are indispensable. Some existing works, which are time series specialized, collectively referred to as Time Series Databases (TSDBs), have achieved satisfactory applied effect in real cloud environment [11], [15], [18], [21], [22]. Among them, some improved versions arise from the modifications of classic databases, and the rest are targeted designed for time series. To further meet the demands of time series and promote the efficiency, reliability and availability of all sub-modules and entire system, we next specify several detailed and strict constraints.

Write dominate. The most primary requirement for TSDB is that to keep the services stable at an ultra-high write rate [15]. In the real cloud applications, time series services always tend to cope with millions of write requests per second. The existing databases generally optimize their indexing strategy, data persistence, and accumulate cluster performance to satisfy the demand. To further enhance the write performance, it is necessary to optimize the above strategies and explore more efficient ways.

Query management. Because most of the downstream service objects of query operations are periodical monitoring or management systems, the query rate is usually a couple orders of magnitude lower than write rate [5], [21]. However, it is still very difficult to achieve efficient query within such large-scale time series data. Existing methods attempt to solve this problem by distinguishing different storage media and indexing structures for storing cold/hot data, which will bring a lot of common side effects (e.g. data redundancy in memory, too long latency for the worst query).

Resource control. Though the hardware prices are generally declining, the usage efficiency of resources in cloud environment (including memory and disk) are also an important indicator for the evaluation of storage services [2], [4], [16]. The scale of time series data becomes much larger. Take Facebook as an example, it daily produces about 10T of logs, texts and other streaming data. Storing such massive data directly is bound to consume a lot of storage resources. If taking into account the version control, single-writer, append-only and redundancy strategy, the resource utilization of entire system will become much lower. It will reduce the retention time of persistent historical data (cyclic writer), as well as the scale of cached data in memory directly. Therefore, a set of specific data compression algorithms are essential, though at the expense of partially acceptable loss of data precision.

Highly reliability and availability. Reliability and availability are indicators which describe respectively the ability for providing cloud services normally and fault tolerant [15], [22]. Existing cluster management approaches can meet the demand to provide services in the face of partial node failure continuously, but how to recover from disaster or fault quickly and provide services again is the key to ensuring the reliability of time-series cloud services.

This paper proposes FluteDB (as an extension of [1]), a novel time series database for sensor-cloud (as shown in Fig. 1), which aimed to satisfy mentioned constraints and provide efficient, scalable and stable cloud services. It enhances all of the sub-modules in database based on the aggregate analysis of time series data and its relevant operations. FluteDB also re-adjusts the communication and data exchange modes among memory, hard drives and other resources to keep database running more efficient. All the considerations and designs in FluteDB take fully into account the linearly scalable to ensure it can scale up as needed. Specifically, it has the following novel methods to improve the present situation.

Since time series services are mainly composed of vast majority of write operations and a few point queries and region queries, a novel insight behind FulteDB is that to make the overall architecture more inclined to the optimization of write operation, that is, to exchange for higher write performance at the expense of part performance of query operation or acceptable data precision [18], [21]. In order to achieve this goal, FluteDB re-organizes the structure of indexing, and further optimizes the configuration and use of different types of resources. To be specific, combined with the temporal characteristics of time series, FluteDB proposes a new TTSM tree algorithm based on Log-Structure Merge Tree (LSM Tree) [7], [12], [14]. The TTSM tree divides indexing into two parts based on the distribution of data values. The hot data (newly inserted) is stored in special tree structure in memory, while the cold one is stored on a specific B* tree structure on physical disks. Periodically, partial indexing in memory will be persisted to disk. Since time series data is in a strict ordered sequence, FluteDB designs a more flexible triggering mechanism to determine when to perform the above operations considering the complexity of tree structure merging. Moreover, FluteDB also enhances the corresponding efficient storage structures for different storage media to meet different operational requirements.

Although query operations account for a small percentage, the relevant data service efficiency for upstream tasks also need to be guaranteed. Therefore, as long as the write rate is not affected, the actual query demands are able to be meet indirectly by optimizing the hit rate of both cache and in-memory indexing. FluteDB updates the original cache replacement algorithms and its implementation to guarantee high-value data stay in memory as much as possible based on time series characteristics. Then, vast majority of query operations can be handled in memory, the overall query efficiency will be improved significantly.

FluteDB also presents specific compression algorithms for different data types (Integer, Float, Double, String and etc.) to reduce the resource consumption. In order to maximize the compression ratio of compression algorithms on the basis of acceptable compression and decompression complexity, we adopt a more flexible compression concept and coded format, and only sacrifice partial data coding precision. Besides, FluteDB integrates a complete set of reliability strategies for all sub-modules and entire system, enabling it to recover and continue to provide services in the face of power outage, network outage or other faults and disasters.

As a time series database used in real cloud environments, FluteDB encapsulates data connection management, data manipulation as functional layer on the top of above functions. This layer is designed to conform to the classic structure of traditional distributed database, which provides a scalable guarantee for FluteDB.

By evaluating FluteDB in large-scale cloud storage environment, its writing efficiency, query latency and storage resource consumption significantly outperforms existing methods. And the system’s stable running time and disaster recovery effect can be also guaranteed.

The rest of paper is structured as follows. Section 2 introduces the existing databases and the optimization methods for time-series data. Section 3 presents FluteDB’s architecture, and further analyzes its design principle and implementation in detail. In Section 4, extensive experiments show that FluteDB obtain better performances than existing systems. Finally, conclusion is summarized in Section 5.

Section snippets

Related work

Since a large number of famous researches focus on analyzing the time series characteristics, continuous interests in management of time series data have been followed closely in field of database for decades [11], [15], [18], [21], [22]. At present, people have begun to pay more and more attention to explore how to efficiently and steadily store and query time series data because of its immense growth and valuable implicit information.

In general, the previous TSDBs can be divided into

FluteDB architecture

In this section, we define the time series data, and analyze its numerical and operating characteristics. On this basis, we present FluteDB’s basic architecture, enhanced sub-modules’ design concept and their realization.

Evaluation

In this section, we analyze the performance of FluteDB via benchmarks and present measurements of our production deployment.

Conclusion

FluteDB is a novel memory TSDB for sensor-cloud which efficiently manages time series data by rationally processing memory data and interacts data in disk in batch. To fully adapt to the real time series data application environment, FluteDB has optimized its corresponding indexing structure, query components, data compression and data encapsulation by considering the characteristics of time series. Furthermore, FluteDB is equipped with a complete fault tolerant and recovery strategy in order

Acknowledgments

This work is supported by China 973 Fundamental R&D Program (No. 2014CB340300), NSFC program (Nos. 61472022, 61421003, 61772151 61502017), SKLSDE-2016ZX-11, and partly by the Beijing Advanced Innovation Center for Big Data and Brain Computing . We thank the anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions.

Chen Li is currently a Ph.D. student at the School of Computer Science and Engineering, Beihang University. His research interests include knowledge graph, Information retrieval and data analysis and processing.

References (23)

L. Chen, L. jianxin, S. Jinghui, Z. Yangyang, FluteDB: An Efficient and Dependable Time-Series Database Storage Engine,...
Christopher M.J. et al.
The partitioned exponential file for database storage management
VLDB J.
(2007)
GregoryL. et al.
Reliability of series-parallel systems with random failure propagation time
IEEE Trans. Reliab.
(2013)
JacobZ. et al.
A universal algorithm for sequential data compression
IEEE Trans. Inform. Theory
(1977)
KaushikC. et al.
Locally adaptive dimensionality reduction for indexing large time series databases
ACM Trans. Database Syst.
(2002)
MarioG. et al.
Time series forecasting with genetic programming
Nat. Comput.
(2017)
MendelR. et al.
The design and implementation of a log-structured file system
ACM Trans. Comput. Syst.
(1992)
MichaelM.
Compressed bloom filters
IEEE/ACM Trans. Netw.
(2002)
Mostafa A.B.
Data compression in scientific and statistical databases
IEEE Trans. Softw. Eng.
(1985)
New York City Taxi Trip Duration, 2017, https://www.kaggle.com/c/nyc-taxi-trip-duration/data. (Accessed 17 December...

OpenTSDB - A Distributed, Scalable Monitoring System, 2017, http://opentsdb.net/. (Accessed 17 December...

Cited by (8)

Time-tired compaction: An elastic compaction scheme for LSM-tree based time-series database
2024, Advanced Engineering Informatics
Time-series DBMSs based on the LSM-tree have been widely applied in numerous scenarios ranging from daily life to industrial production. Compared to the traditional key–value data, the time-series data workload has significant features of writing and querying in chronological order. While simultaneously, such features bring new challenges to efficient queries, especially data compaction. Namely, an effective time-series data compaction algorithm is crucial for efficient storage and query of massive time-series data. However, the current data compaction method based on traditional LSM-tree cannot solve the problems such as high delay and inaccurate range of time sequence data query. Therefore, we propose a novel compaction algorithm, Time-Tiered Compaction, to customize for time-series scenarios. Time-Tiered Compaction leverages the characteristics of time-series workloads to estimate query loads to select optimum SSTables for merging during every compaction process, reducing unnecessary expenses. Time-Tiered Compaction is implemented on Apache IoTDB to evaluate algorithm performance using TPC-xIoT. Results show that the presented Time-Tiered Compaction significantly reduces (30%) the latency of the range queries with only a slight increase (5%) in point queries, compared with traditional strategies.
Hierarchical Multiresolution Representation of Streaming Time Series
2021, Big Data Research
Citation Excerpt :
From an industrial point of view, considerable efforts have been made to overcome the limitations of general-purpose database systems for times series management [8]. Nevertheless, many existing systems reduce dimensionality of persisted time series data at query-time [37–39] or by precomputing queries [40,41]. In the case of STS, such systems imply either excessive hardware requirements or a significant delay before the results become available.
Real-time monitoring, analysis and operations in large industrial systems require an accurate but compact data model created on the basis of a large number of data sources continuously generating massive amounts of data modeled as streaming time series. This paper proposes a generic time series representation approach for reducing data model size and supporting streaming time series data mining at multiple time resolutions. The proposed Hierarchical Multiresolution Time Series Representation model utilizes a buffer-based approach that combines one-pass stream processing with hierarchical aggregation to achieve high processing speed without excessive hardware requirements. In addition, this paper presents a new representation based on the proposed model, Hierarchical Multiresolution Linear-function-based Piecewise Statistical Approximation. The proposed representation considers fluctuations and continuity of modeled processes in order to preserve fundamental characteristics of time series at reduced dimensionality. The usefulness of the proposed solution was proven in a case study. The case study results for generated data set confirm that the proposed model leads to higher and more stable processing speed at lower RAM consumption comparing to related model, especially when dealing with greater number of time resolutions. The case study results for real UK smart meter data confirm that the proposed representation leads to a reduced amount of information loss and an improvement in subsequent time series clustering comparing to related time series representation. Therefore, this paper's main contribution is multiresolution streaming time series data mining support convenient for application in large industrial systems.
A Survey on Data-driven Performance Tuning for Big Data Analytics Platforms
2021, Big Data Research
Citation Excerpt :
In-memory databases can be much faster than in-disk databases and they are easier to implement [28]. There are special-purpose in-memory databases, like Apache Ignite, Gemfire and FluteDB [29], but commercial databases like Oracle, Microsoft and SAP are also offering solutions for in-memory data management. Main memory is volatile and data can be lost, but the introduction of non-volatile random-access memories (NVM) can help to overcome this problem [30].
Many research works deal with big data platforms looking forward to data science and analytics. These are complex and usually distributed environments, composed of several systems and tools. As expected, there is a need for a closer look at performance issues.
In this work, we review performance tuning strategies in the big data environment. We focus on data-driven tuning techniques, discussing the use of database inspired approaches. Concerning big data and NoSQL stores, performance tuning issues are quite different from the so-called conventional systems. Many existing solutions are mostly ad-hoc activities that do not fit for multiple situations. But there are some categories of data-driven solutions that can be taken as guidelines and incorporated into general-purpose auto-tuning modules for big data systems.
We examine typical performance tuning actions, discussing available solutions to support some of the tuning process's primary activities. We also discuss recent implementations of data-driven performance tuning solutions for big data platforms. We propose an initial classification based on the domain state-of-the-art and present selected tuning actions for large-scale data processing systems. Finally, we organized existing works towards self-tuning big data systems based on this classification and presented general and system-specific tuning recommendations. We found that most of the literature pieces evaluate the use of tuning actions at the physical design perspective, and there is a lack of self-tuning machine-learning-based solutions for big data systems.
An incentive-based protection and recovery strategy for secure big data in social networks
2020, Information Sciences
Citation Excerpt :
The rapid development of the Internet of Things [1,30], smart cities [18,23], vehicular networks [4], and sensor-cloud [5,28] has accelerated the generation of data and strengthened the communication between people and things.
Big data sources, such as smart vehicles, IoT devices, and sensor networks, differ from traditional data sources in both output volume and variety. Big data is usually stored on various types of network nodes, which is prone to data security and privacy problems, such as virus infection. In particular, the spread of viruses through social networks can cause large-scale destruction and privacy leakage in the network. This paper aims to provide a solution to protect the security of big data. First, the users are divided into five states according to their reactions to data virus: susceptible, contagious, doubt, immune, and recoverable. Then, we propose a novel model for studying the propagation mechanism of data virus. To control the spread of virus and protect data security, an incentive mechanism is introduced. After that, a protection and recovery strategy (PRS) is developed to reduce infected users and increase the immunized. The experimental results on two data sets indicate that our model gives a good description of the data virus propagation process, and PRS is better than both acquaintance immunization and target immunization methods, which validates the privacy preserving strategy for big data in networks.
An Efficient NoSQL-Based Storage Schema for Large-Scale Time Series Data
2024, Journal of Database Management
Matrix profile-based approach to industrial sensor data analysis inside rdbms
2021, Mathematics

View all citing articles on Scopus

Bo Li is an assistant professor at the School of Computer Science and Engineering, Beihang University. He received the Ph.D. degree in Jan. 2012. He was a visiting scholar in computer science department of University of Edinburgh in 2014. His current research interests include virtualization, system reliability and data mining etc.

Md Zakirul Alam Bhuiyan is currently an assistant professor in the Department of Computer and Information Sciences at Fordham University. Previously, he worked as an assistant professor at Temple University. His research focuses on dependable cyber physical systems, WSN applications, big data, and cyber security.

Lihong Wang is a professor in National Computer Network Emergency Response Technical Team/Coordination Center of China. Her current research interests include information security, cloud computing, big data mining and analysis, Information retrieval and data mining.

Jinghui Si is currently an undergraduate student at the School of Computer Science and Engineering, Beihang University. His research interests include data mining and natural language processing.

Guanyu Wei is currently an undergraduate student at the School of Computer Science and Engineering, Beihang University. His research interests include system reliability and distributed systems.

Jianxin Li is a professor at the School of Computer Science and Engineering, Beihang University, and a member of IEEE and ACM. He received the Ph.D. degree in Jan. 2008. He was a visiting scholar at machine learning department of CMU in 2015, and a visiting researchers of MSRA in 2011. His current research interests include virtualization and cloud computing, data analysis and processing.

View full text

FluteDB: An efficient and scalable in-memory time series database for sensor-cloud

Highlights

Abstract

Introduction

Section snippets

Related work

FluteDB architecture

Evaluation

Conclusion

Acknowledgments

The partitioned exponential file for database storage management

VLDB J.

Reliability of series-parallel systems with random failure propagation time

IEEE Trans. Reliab.

A universal algorithm for sequential data compression

IEEE Trans. Inform. Theory

Locally adaptive dimensionality reduction for indexing large time series databases

ACM Trans. Database Syst.

Time series forecasting with genetic programming

Nat. Comput.

The design and implementation of a log-structured file system

ACM Trans. Comput. Syst.

Compressed bloom filters

IEEE/ACM Trans. Netw.

Data compression in scientific and statistical databases

IEEE Trans. Softw. Eng.