FluteDB: An efficient and scalable in-memory time series database for sensor-cloud
Introduction
With the appearance and popularization of Internet of Things (IoT), Wireless Sensor Network (WSN), Smart City (SC) and other Internet hot spots, massive time series data is generated continuously and waits for further processing [15], [18], [21], [22]. Among them, many mature mining algorithms have achieved satisfactory applied results in many practical fields by analyzing and extracting specific features from massive time series data, e.g. Real-Time Vehicle Traffic (RTVT), Smart Grid (SG). As the basis of such downstream algorithms, how to store and query time series data efficiently and stably has aroused wide attentions especially in cloud service [6], [13], [23].
Due to the increasing demand for source data and real-time compute in cloud services, multimedia data sampling devices are widely available, and their sampling intervals have been shortened immensely nowadays, which lead to a vast growth of the scale of time series data directly. Besides, in order to provide highly reliable and available time series services certainly, the necessary preparations for fault tolerant and data recovery for cloud services are indispensable. Some existing works, which are time series specialized, collectively referred to as Time Series Databases (TSDBs), have achieved satisfactory applied effect in real cloud environment [11], [15], [18], [21], [22]. Among them, some improved versions arise from the modifications of classic databases, and the rest are targeted designed for time series. To further meet the demands of time series and promote the efficiency, reliability and availability of all sub-modules and entire system, we next specify several detailed and strict constraints.
Write dominate. The most primary requirement for TSDB is that to keep the services stable at an ultra-high write rate [15]. In the real cloud applications, time series services always tend to cope with millions of write requests per second. The existing databases generally optimize their indexing strategy, data persistence, and accumulate cluster performance to satisfy the demand. To further enhance the write performance, it is necessary to optimize the above strategies and explore more efficient ways.
Query management. Because most of the downstream service objects of query operations are periodical monitoring or management systems, the query rate is usually a couple orders of magnitude lower than write rate [5], [21]. However, it is still very difficult to achieve efficient query within such large-scale time series data. Existing methods attempt to solve this problem by distinguishing different storage media and indexing structures for storing cold/hot data, which will bring a lot of common side effects (e.g. data redundancy in memory, too long latency for the worst query).
Resource control. Though the hardware prices are generally declining, the usage efficiency of resources in cloud environment (including memory and disk) are also an important indicator for the evaluation of storage services [2], [4], [16]. The scale of time series data becomes much larger. Take Facebook as an example, it daily produces about 10T of logs, texts and other streaming data. Storing such massive data directly is bound to consume a lot of storage resources. If taking into account the version control, single-writer, append-only and redundancy strategy, the resource utilization of entire system will become much lower. It will reduce the retention time of persistent historical data (cyclic writer), as well as the scale of cached data in memory directly. Therefore, a set of specific data compression algorithms are essential, though at the expense of partially acceptable loss of data precision.
Highly reliability and availability. Reliability and availability are indicators which describe respectively the ability for providing cloud services normally and fault tolerant [15], [22]. Existing cluster management approaches can meet the demand to provide services in the face of partial node failure continuously, but how to recover from disaster or fault quickly and provide services again is the key to ensuring the reliability of time-series cloud services.
This paper proposes FluteDB (as an extension of [1]), a novel time series database for sensor-cloud (as shown in Fig. 1), which aimed to satisfy mentioned constraints and provide efficient, scalable and stable cloud services. It enhances all of the sub-modules in database based on the aggregate analysis of time series data and its relevant operations. FluteDB also re-adjusts the communication and data exchange modes among memory, hard drives and other resources to keep database running more efficient. All the considerations and designs in FluteDB take fully into account the linearly scalable to ensure it can scale up as needed. Specifically, it has the following novel methods to improve the present situation.
Since time series services are mainly composed of vast majority of write operations and a few point queries and region queries, a novel insight behind FulteDB is that to make the overall architecture more inclined to the optimization of write operation, that is, to exchange for higher write performance at the expense of part performance of query operation or acceptable data precision [18], [21]. In order to achieve this goal, FluteDB re-organizes the structure of indexing, and further optimizes the configuration and use of different types of resources. To be specific, combined with the temporal characteristics of time series, FluteDB proposes a new TTSM tree algorithm based on Log-Structure Merge Tree (LSM Tree) [7], [12], [14]. The TTSM tree divides indexing into two parts based on the distribution of data values. The hot data (newly inserted) is stored in special tree structure in memory, while the cold one is stored on a specific B* tree structure on physical disks. Periodically, partial indexing in memory will be persisted to disk. Since time series data is in a strict ordered sequence, FluteDB designs a more flexible triggering mechanism to determine when to perform the above operations considering the complexity of tree structure merging. Moreover, FluteDB also enhances the corresponding efficient storage structures for different storage media to meet different operational requirements.
Although query operations account for a small percentage, the relevant data service efficiency for upstream tasks also need to be guaranteed. Therefore, as long as the write rate is not affected, the actual query demands are able to be meet indirectly by optimizing the hit rate of both cache and in-memory indexing. FluteDB updates the original cache replacement algorithms and its implementation to guarantee high-value data stay in memory as much as possible based on time series characteristics. Then, vast majority of query operations can be handled in memory, the overall query efficiency will be improved significantly.
FluteDB also presents specific compression algorithms for different data types (Integer, Float, Double, String and etc.) to reduce the resource consumption. In order to maximize the compression ratio of compression algorithms on the basis of acceptable compression and decompression complexity, we adopt a more flexible compression concept and coded format, and only sacrifice partial data coding precision. Besides, FluteDB integrates a complete set of reliability strategies for all sub-modules and entire system, enabling it to recover and continue to provide services in the face of power outage, network outage or other faults and disasters.
As a time series database used in real cloud environments, FluteDB encapsulates data connection management, data manipulation as functional layer on the top of above functions. This layer is designed to conform to the classic structure of traditional distributed database, which provides a scalable guarantee for FluteDB.
By evaluating FluteDB in large-scale cloud storage environment, its writing efficiency, query latency and storage resource consumption significantly outperforms existing methods. And the system’s stable running time and disaster recovery effect can be also guaranteed.
The rest of paper is structured as follows. Section 2 introduces the existing databases and the optimization methods for time-series data. Section 3 presents FluteDB’s architecture, and further analyzes its design principle and implementation in detail. In Section 4, extensive experiments show that FluteDB obtain better performances than existing systems. Finally, conclusion is summarized in Section 5.
Section snippets
Related work
Since a large number of famous researches focus on analyzing the time series characteristics, continuous interests in management of time series data have been followed closely in field of database for decades [11], [15], [18], [21], [22]. At present, people have begun to pay more and more attention to explore how to efficiently and steadily store and query time series data because of its immense growth and valuable implicit information.
In general, the previous TSDBs can be divided into
FluteDB architecture
In this section, we define the time series data, and analyze its numerical and operating characteristics. On this basis, we present FluteDB’s basic architecture, enhanced sub-modules’ design concept and their realization.
Evaluation
In this section, we analyze the performance of FluteDB via benchmarks and present measurements of our production deployment.
Conclusion
FluteDB is a novel memory TSDB for sensor-cloud which efficiently manages time series data by rationally processing memory data and interacts data in disk in batch. To fully adapt to the real time series data application environment, FluteDB has optimized its corresponding indexing structure, query components, data compression and data encapsulation by considering the characteristics of time series. Furthermore, FluteDB is equipped with a complete fault tolerant and recovery strategy in order
Acknowledgments
This work is supported by China 973 Fundamental R&D Program (No. 2014CB340300), NSFC program (Nos. 61472022, 61421003, 61772151 61502017), SKLSDE-2016ZX-11, and partly by the Beijing Advanced Innovation Center for Big Data and Brain Computing . We thank the anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions.
Chen Li is currently a Ph.D. student at the School of Computer Science and Engineering, Beihang University. His research interests include knowledge graph, Information retrieval and data analysis and processing.
References (23)
- L. Chen, L. jianxin, S. Jinghui, Z. Yangyang, FluteDB: An Efficient and Dependable Time-Series Database Storage Engine,...
- et al.
The partitioned exponential file for database storage management
VLDB J.
(2007) - et al.
Reliability of series-parallel systems with random failure propagation time
IEEE Trans. Reliab.
(2013) - et al.
A universal algorithm for sequential data compression
IEEE Trans. Inform. Theory
(1977) - et al.
Locally adaptive dimensionality reduction for indexing large time series databases
ACM Trans. Database Syst.
(2002) - et al.
Time series forecasting with genetic programming
Nat. Comput.
(2017) - et al.
The design and implementation of a log-structured file system
ACM Trans. Comput. Syst.
(1992) Compressed bloom filters
IEEE/ACM Trans. Netw.
(2002)Data compression in scientific and statistical databases
IEEE Trans. Softw. Eng.
(1985)- New York City Taxi Trip Duration, 2017, https://www.kaggle.com/c/nyc-taxi-trip-duration/data. (Accessed 17 December...
Cited by (8)
Time-tired compaction: An elastic compaction scheme for LSM-tree based time-series database
2024, Advanced Engineering InformaticsHierarchical Multiresolution Representation of Streaming Time Series
2021, Big Data ResearchCitation Excerpt :From an industrial point of view, considerable efforts have been made to overcome the limitations of general-purpose database systems for times series management [8]. Nevertheless, many existing systems reduce dimensionality of persisted time series data at query-time [37–39] or by precomputing queries [40,41]. In the case of STS, such systems imply either excessive hardware requirements or a significant delay before the results become available.
A Survey on Data-driven Performance Tuning for Big Data Analytics Platforms
2021, Big Data ResearchCitation Excerpt :In-memory databases can be much faster than in-disk databases and they are easier to implement [28]. There are special-purpose in-memory databases, like Apache Ignite, Gemfire and FluteDB [29], but commercial databases like Oracle, Microsoft and SAP are also offering solutions for in-memory data management. Main memory is volatile and data can be lost, but the introduction of non-volatile random-access memories (NVM) can help to overcome this problem [30].
An incentive-based protection and recovery strategy for secure big data in social networks
2020, Information SciencesCitation Excerpt :The rapid development of the Internet of Things [1,30], smart cities [18,23], vehicular networks [4], and sensor-cloud [5,28] has accelerated the generation of data and strengthened the communication between people and things.
An Efficient NoSQL-Based Storage Schema for Large-Scale Time Series Data
2024, Journal of Database Management
Chen Li is currently a Ph.D. student at the School of Computer Science and Engineering, Beihang University. His research interests include knowledge graph, Information retrieval and data analysis and processing.
Bo Li is an assistant professor at the School of Computer Science and Engineering, Beihang University. He received the Ph.D. degree in Jan. 2012. He was a visiting scholar in computer science department of University of Edinburgh in 2014. His current research interests include virtualization, system reliability and data mining etc.
Md Zakirul Alam Bhuiyan is currently an assistant professor in the Department of Computer and Information Sciences at Fordham University. Previously, he worked as an assistant professor at Temple University. His research focuses on dependable cyber physical systems, WSN applications, big data, and cyber security.
Lihong Wang is a professor in National Computer Network Emergency Response Technical Team/Coordination Center of China. Her current research interests include information security, cloud computing, big data mining and analysis, Information retrieval and data mining.
Jinghui Si is currently an undergraduate student at the School of Computer Science and Engineering, Beihang University. His research interests include data mining and natural language processing.
Guanyu Wei is currently an undergraduate student at the School of Computer Science and Engineering, Beihang University. His research interests include system reliability and distributed systems.
Jianxin Li is a professor at the School of Computer Science and Engineering, Beihang University, and a member of IEEE and ACM. He received the Ph.D. degree in Jan. 2008. He was a visiting scholar at machine learning department of CMU in 2015, and a visiting researchers of MSRA in 2011. His current research interests include virtualization and cloud computing, data analysis and processing.