Time series data encoding in Apache IoTDB: comparative analysis and recommendation

Xia, Tianrui; Xiao, Jinzhao; Huang, Yuxiang; Hu, Changyu; Song, Shaoxu; Huang, Xiangdong; Wang, Jianmin

doi:10.1007/s00778-024-00840-5

Time series data encoding in Apache IoTDB: comparative analysis and recommendation

Regular Paper
Published: 12 February 2024

Volume 33, pages 727–752, (2024)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Tianrui Xia¹,
Jinzhao Xiao¹,
Yuxiang Huang¹,
Changyu Hu¹,
Shaoxu Song ORCID: orcid.org/0000-0002-9503-2755¹,
Xiangdong Huang¹ &
…
Jianmin Wang¹

216 Accesses
Explore all metrics

Abstract

Not only the vast applications but also the distinct features of time series data stimulate the booming growth of time series database management systems, such as Apache IoTDB, InfluxDB, OpenTSDB and so on. Almost all these systems employ columnar storage, with effective encoding of time series data. Given the distinct features of various time series data, different encoding strategies may perform variously. In this study, we first summarize the features of time series data that may affect encoding performance. We also introduce the latest feature extraction results in these features. Then, we introduce the storage scheme of a typical time series database, Apache IoTDB, prescribing the limits to implementing encoding algorithms in the system. A qualitative analysis of encoding effectiveness is then presented for the studied algorithms. To this end, we develop a benchmark for evaluating encoding algorithms, including a data generator and several real-world datasets. Also, we present an extensive experimental evaluation. Remarkably, a quantitative analysis of encoding effectiveness regarding to data features is conducted in Apache IoTDB. Finally, we recommend the best encoding algorithm for different time series referring to their data features. Machine learning models are trained for the recommendation and evaluated over real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 9

A time-series compression technique and its application to the smart grid

Article 19 August 2014

Time Series Queries Processing with GPU Support

TSXor: A Simple Time Series Compression Algorithm

References

Aamand, A., Indyk, P., Vakilian, A.: (Learned) frequency estimation algorithms under Zipfian distribution. CoRR, arXiv:1908.05198 (2019)
Abadi, D.J., Madden, S., Ferreira, M.: Integrating compression and execution in column-oriented database systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27–29, 2006, pp. 671–682. ACM (2006)
Bartik, M., Ubik, S., Kubalík, P.: LZ4 compression algorithm on FPGA. In: 2015 IEEE International Conference on Electronics, Circuits, and Systems, ICECS 2015, Cairo, Egypt, December 6–9, 2015, pp. 179–182. IEEE (2015)
Blalock, D. W., Madden, S., Guttag, J. V.: Sprintz: time series compression for the internet of things. Proc. ACM Interact. Mob. Wearab. Ubiquit. Technol. 2(3), 93:1-93:23 (2018)
Article Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth (1984)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. In: Digital SRC Research Report, Citeseer (1994)
Campobello, G., Segreto, A., Zanafi, S., Serrano, S.: RAKE: a simple and efficient lossless compression algorithm for the internet of things. In: 25th European Signal Processing Conference, EUSIPCO 2017, Kos, Greece, August 28–September 2, 2017, pp. 2581–2585. IEEE (2017)
Cen, L., Kipf, A., Marcus, R., Kraska, T.: LEA: a learned encoding advisor for column stores. In: aiDM ’21: 4th Workshop in Exploiting AI Techniques for Data Management, Virtual Event, China, 25 June, 2021, pp. 32–35. ACM (2021)
Chiarot, G., Silvestri, C.: Time series compression survey. ACM Comput. Surv. 55(10), 198:1-198:32 (2023)
Article Google Scholar
Dalai, M., Leonardi, R.: Approximations of one-dimensional digital signals under the l\({}_{\text{ infty }}\) norm. IEEE Trans. Signal Process. 54(8), 3111–3124 (2006)
Article Google Scholar
Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J., Vapnik, V.: Support vector regression machines. In: Advances in Neural Information Processing Systems 9, NIPS, Denver, CO, USA, December 2–5, 1996, pp. 155–161. MIT Press (1996)
Eichinger, F., Efros, P., Karnouskos, S., Böhm, K.: A time-series compression technique and its application to the smart grid. VLDB J. 24(2), 193–218 (2015)
Article Google Scholar
Fang, C., Song, S., Guan, H., Huang, X., Wang, C., Wang, J.: Grouping time series for efficient columnar storage. Proc. ACM Manag. Data 1(1), 23:1-23:26 (2023)
Article Google Scholar
Fink, E., Gandhi, H.S.: Compression of time series by extracting major extrema. J. Exp. Theor. Artif. Intell. 23(2), 255–270 (2011)
Article Google Scholar
Golomb, S. W.: Run-length encodings (corresp.). IEEE Trans. Inf. Theory 12(3), 399–401 (1966)
Article Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edn. Springer (2009)
Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1–3), 185–234 (1989)
Article Google Scholar
Howard, P.G., Vitter, J.S.: Parallel lossless image compression using Huffman and arithmetic coding. Inf. Process. Lett. 59(2), 65–73 (1996)
Article Google Scholar
http://opentsdb.net/
https://archive.ics.uci.edu
https://github.com/apache/iotdb/tree/research/encoding-exp
https://github.com/thssdb/encoding-exp
https://github.com/thulab/iotdb-benchmark
https://iotdb.apache.org/
https://iotdb.apache.org/UserGuide/Master/Data-Concept/Encoding.html
https://thulab.github.io/iotdb-quality/
https://www.gnu.org/software/gzip/
https://www.influxdata.com/
https://www.kaggle.com/datasets/
https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs
https://www.kaggle.com/datasets/shawon10/web-log-dataset
https://www.kaggle.com/datasets/winmedals/incident-event-log-dataset
https://www.microsoft.com/en-us/download/details.aspx
Huang, S., Chen, Y., Chen, X., Liu, K., Xu, X., Wang, C., Brown, K., Halilovic, I.: The next generation operational data historian for IoT based on informix. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 169–176. ACM (2014)
Jiang, H., Liu, C., Paparrizos, J., Chien, A.A., Ma, J., Elmore, A.J.: Good to the last bit: Data-driven encoding with codecdb. In: SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20–25, 2021, pp. 843–856. ACM (2021)
Katsis, Y., Freund, Y., Papakonstantinou, Y.: Combining databases and signal processing in plato. In: 7th Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4–7, 2015, Online Proceedings (2015). https://www.cidrdb.org
Khelifati, A., Khayati, M., Cudré-Mauroux, P.: CORAD: correlation-aware compression of massive time series using sparse dictionary coding. In: 2019 IEEE International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, December 9–12, 2019, pp. 2289–2298. IEEE (2019)
Lazaridis, I., Mehrotra, S.: Capturing sensor-generated time series with quality guarantees. In: Proceedings of the 19th International Conference on Data Engineering, March 5–8, 2003, Bangalore, India, pp. 429–440. IEEE Computer Society (2003)
Liakos, P., Papakonstantinopoulou, K., Kotidis, Y.: Chimp: Efficient lossless floating point compression for time series databases. Proc. VLDB Endow. 15(11), 3058–3070 (2022)
Article Google Scholar
Liu, C., Jiang, H., Paparrizos, J., Elmore, A.J.: Decomposed bounded floats for fast compression and queries. Proc. VLDB Endow. 14(11), 2586–2598 (2021)
Article Google Scholar
Lubba, C.H., Sethi, S.S., Knaute, P., Schultz, S.R., Fulcher, B.D., Jones, N.S.: catch22: Canonical time-series characteristics—selected through highly comparative time-series analysis. Data Min. Knowl. Discov. 33(6), 1821–1852 (2019)
Article Google Scholar
Marascu, A., Pompey, P., Bouillet, E., Wurst, M., Verscheure, O., Grund, M., Cudré-Mauroux, P.: TRISTAN: real-time analytics on massive time series using sparse dictionary compression. In: 2014 IEEE International Conference on Big Data (IEEE BigData 2014), Washington, DC, USA, October 27–30, 2014, pp. 291–300. IEEE Computer Society (2014)
Nong, G., Zhang, S., Chan, W.H.: Linear suffix array construction by almost pure induced-sorting. In: 2009 Data Compression Conference (DCC 2009), 16–18 March 2009, Snowbird, UT, USA, pp. 193–202. IEEE Computer Society (2009)
Ong, G.H., Huang, S.: A data compression scheme for Chinese text files using Huffman coding and a two-level dictionary. Inf. Sci. 84(1 &2), 85–99 (1995)
Article Google Scholar
Pasco, R.C.: Source coding algorithms for fast data compression (ph.d. thesis abstr.). IEEE Trans. Inf. Theory 23(4), 548 (1977)
Article Google Scholar
Pelkonen, T., Franklin, S., Cavallaro, P., Huang, Q., Meza, J., Teller, J., Veeraraghavan, K.: Gorilla: a fast, scalable, in-memory time series database. Proc. VLDB Endow. 8(12), 1816–1827 (2015)
Article Google Scholar
Ryabko, B.Y.: Data compression by means of a “book stack’’. Probl. Pered. Informat. 16(4), 16–21 (1980)
MathSciNet Google Scholar
Samulowitz, H., Reddy, C., Sabharwal, A., Sellmann, M.: Snappy: a simple algorithm portfolio. In: Theory and Applications of Satisfiability Testing - SAT 2013 - 16th International Conference, Helsinki, Finland, July 8–12, 2013. Proceedings, Volume 7962 of Lecture Notes in Computer Science, pp. 422–428. Springer (2013)
Seidel, R.: Small-dimensional linear programming and convex hulls made easy. Discret. Comput. Geom. 6, 423–434 (1991)
Article MathSciNet Google Scholar
Spiegel, J., Wira, P., Hermann, G.: A comparative experimental study of lossless compression algorithms for enhancing energy efficiency in smart meters. In: 16th IEEE International Conference on Industrial Informatics, INDIN 2018, Porto, Portugal, July 18–20, 2018, pp. 447–452. IEEE (2018)
Walder, J., Krátký, M., Platos, J.: Fast fibonacci encoding algorithm. In: Proceedings of the Dateso 2010 Annual International Workshop on DAtabases, TExts, Specifications and Objects, Stedronin-Plazy, Czech Republic, April 21–23, 2010, Volume 567 of CEUR Workshop Proceedings, pp. 72–83. CEUR-WS.org (2010)
Wang, C., Qiao, J., Huang, X., Song, S., Hou, H., Jiang, T., Rui, L., Wang, J., Sun, J.: Apache IoTDB: a time series database for IoT applications. Proc. ACM Manag. Data 1(2), 195:1-195:27 (2023)
Article Google Scholar
Welch, T.A.: A technique for high-performance data compression. Computer 17(6), 8–19 (1984)
Article Google Scholar
Wong, R.C., Fu, A.W.: Mining top-k item sets over a sliding window based on zipfian distribution. In: Proceedings of the 2005 SIAM International Conference on Data Mining, SDM 2005, Newport Beach, CA, USA, April 21–23, 2005, pp. 516–520. SIAM (2005)
Yousri, R., Alsenwi, M., Saeed Darweesh, M., Ismail, T.: A design for an efficient hybrid compression system for EEG data. In: 2021 International Conference on Electronic Engineering (ICEEM), pp. 1–6 (2021)
Yu, X., Peng, Y., Li, F., Wang, S., Shen, X., Mai, H., Xie, Y.: Two-level data compression using machine learning in time series database. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20–24, 2020, pp. 1333–1344. IEEE (2020)
Yu, H., Huang, F., Lin, C.: Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn. 85(1–2), 41–75 (2011)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

Download references

Acknowledgements

This work is supported in part by National Natural Science Foundation of China (62072265, 62021002, 62232005), National Key Research and Development Plan (2021YFB3300500), Beijing National Research Center for Information Science and Technology (BNR2022RC01011). Shaoxu Song (https://sxsong.github.io/) is the corresponding author.

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Tianrui Xia, Jinzhao Xiao, Yuxiang Huang, Changyu Hu, Shaoxu Song, Xiangdong Huang & Jianmin Wang

Authors

Tianrui Xia
View author publications
You can also search for this author in PubMed Google Scholar
Jinzhao Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Changyu Hu
View author publications
You can also search for this author in PubMed Google Scholar
Shaoxu Song
View author publications
You can also search for this author in PubMed Google Scholar
Xiangdong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jianmin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shaoxu Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xia, T., Xiao, J., Huang, Y. et al. Time series data encoding in Apache IoTDB: comparative analysis and recommendation. The VLDB Journal 33, 727–752 (2024). https://doi.org/10.1007/s00778-024-00840-5

Download citation

Received: 13 January 2023
Revised: 14 December 2023
Accepted: 07 January 2024
Published: 12 February 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s00778-024-00840-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Time series data encoding in Apache IoTDB: comparative analysis and recommendation

Abstract

Access this article

Similar content being viewed by others

A time-series compression technique and its application to the smart grid

Time Series Queries Processing with GPU Support

TSXor: A Simple Time Series Compression Algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Time series data encoding in Apache IoTDB: comparative analysis and recommendation

Abstract

Access this article

Similar content being viewed by others

A time-series compression technique and its application to the smart grid

Time Series Queries Processing with GPU Support

TSXor: A Simple Time Series Compression Algorithm

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation