Skip to main content
Log in

Time series data encoding in Apache IoTDB: comparative analysis and recommendation

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Not only the vast applications but also the distinct features of time series data stimulate the booming growth of time series database management systems, such as Apache IoTDB, InfluxDB, OpenTSDB and so on. Almost all these systems employ columnar storage, with effective encoding of time series data. Given the distinct features of various time series data, different encoding strategies may perform variously. In this study, we first summarize the features of time series data that may affect encoding performance. We also introduce the latest feature extraction results in these features. Then, we introduce the storage scheme of a typical time series database, Apache IoTDB, prescribing the limits to implementing encoding algorithms in the system. A qualitative analysis of encoding effectiveness is then presented for the studied algorithms. To this end, we develop a benchmark for evaluating encoding algorithms, including a data generator and several real-world datasets. Also, we present an extensive experimental evaluation. Remarkably, a quantitative analysis of encoding effectiveness regarding to data features is conducted in Apache IoTDB. Finally, we recommend the best encoding algorithm for different time series referring to their data features. Machine learning models are trained for the recommendation and evaluated over real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Algorithm 1
Algorithm 2
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37
Fig. 38
Fig. 39
Fig. 40

Similar content being viewed by others

References

  1. Aamand, A., Indyk, P., Vakilian, A.: (Learned) frequency estimation algorithms under Zipfian distribution. CoRR, arXiv:1908.05198 (2019)

  2. Abadi, D.J., Madden, S., Ferreira, M.: Integrating compression and execution in column-oriented database systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27–29, 2006, pp. 671–682. ACM (2006)

  3. Bartik, M., Ubik, S., Kubalík, P.: LZ4 compression algorithm on FPGA. In: 2015 IEEE International Conference on Electronics, Circuits, and Systems, ICECS 2015, Cairo, Egypt, December 6–9, 2015, pp. 179–182. IEEE (2015)

  4. Blalock, D. W., Madden, S., Guttag, J. V.: Sprintz: time series compression for the internet of things. Proc. ACM Interact. Mob. Wearab. Ubiquit. Technol. 2(3), 93:1-93:23 (2018)

    Article  Google Scholar 

  5. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth (1984)

  6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  7. Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. In: Digital SRC Research Report, Citeseer (1994)

  8. Campobello, G., Segreto, A., Zanafi, S., Serrano, S.: RAKE: a simple and efficient lossless compression algorithm for the internet of things. In: 25th European Signal Processing Conference, EUSIPCO 2017, Kos, Greece, August 28–September 2, 2017, pp. 2581–2585. IEEE (2017)

  9. Cen, L., Kipf, A., Marcus, R., Kraska, T.: LEA: a learned encoding advisor for column stores. In: aiDM ’21: 4th Workshop in Exploiting AI Techniques for Data Management, Virtual Event, China, 25 June, 2021, pp. 32–35. ACM (2021)

  10. Chiarot, G., Silvestri, C.: Time series compression survey. ACM Comput. Surv. 55(10), 198:1-198:32 (2023)

    Article  Google Scholar 

  11. Dalai, M., Leonardi, R.: Approximations of one-dimensional digital signals under the l\({}_{\text{ infty }}\) norm. IEEE Trans. Signal Process. 54(8), 3111–3124 (2006)

    Article  Google Scholar 

  12. Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J., Vapnik, V.: Support vector regression machines. In: Advances in Neural Information Processing Systems 9, NIPS, Denver, CO, USA, December 2–5, 1996, pp. 155–161. MIT Press (1996)

  13. Eichinger, F., Efros, P., Karnouskos, S., Böhm, K.: A time-series compression technique and its application to the smart grid. VLDB J. 24(2), 193–218 (2015)

    Article  Google Scholar 

  14. Fang, C., Song, S., Guan, H., Huang, X., Wang, C., Wang, J.: Grouping time series for efficient columnar storage. Proc. ACM Manag. Data 1(1), 23:1-23:26 (2023)

    Article  Google Scholar 

  15. Fink, E., Gandhi, H.S.: Compression of time series by extracting major extrema. J. Exp. Theor. Artif. Intell. 23(2), 255–270 (2011)

    Article  Google Scholar 

  16. Golomb, S. W.: Run-length encodings (corresp.). IEEE Trans. Inf. Theory 12(3), 399–401 (1966)

    Article  Google Scholar 

  17. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edn. Springer (2009)

  18. Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1–3), 185–234 (1989)

    Article  Google Scholar 

  19. Howard, P.G., Vitter, J.S.: Parallel lossless image compression using Huffman and arithmetic coding. Inf. Process. Lett. 59(2), 65–73 (1996)

    Article  Google Scholar 

  20. http://opentsdb.net/

  21. https://archive.ics.uci.edu

  22. https://github.com/apache/iotdb/tree/research/encoding-exp

  23. https://github.com/thssdb/encoding-exp

  24. https://github.com/thulab/iotdb-benchmark

  25. https://iotdb.apache.org/

  26. https://iotdb.apache.org/UserGuide/Master/Data-Concept/Encoding.html

  27. https://thulab.github.io/iotdb-quality/

  28. https://www.gnu.org/software/gzip/

  29. https://www.influxdata.com/

  30. https://www.kaggle.com/datasets/

  31. https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs

  32. https://www.kaggle.com/datasets/shawon10/web-log-dataset

  33. https://www.kaggle.com/datasets/winmedals/incident-event-log-dataset

  34. https://www.microsoft.com/en-us/download/details.aspx

  35. Huang, S., Chen, Y., Chen, X., Liu, K., Xu, X., Wang, C., Brown, K., Halilovic, I.: The next generation operational data historian for IoT based on informix. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 169–176. ACM (2014)

  36. Jiang, H., Liu, C., Paparrizos, J., Chien, A.A., Ma, J., Elmore, A.J.: Good to the last bit: Data-driven encoding with codecdb. In: SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20–25, 2021, pp. 843–856. ACM (2021)

  37. Katsis, Y., Freund, Y., Papakonstantinou, Y.: Combining databases and signal processing in plato. In: 7th Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4–7, 2015, Online Proceedings (2015). https://www.cidrdb.org

  38. Khelifati, A., Khayati, M., Cudré-Mauroux, P.: CORAD: correlation-aware compression of massive time series using sparse dictionary coding. In: 2019 IEEE International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, December 9–12, 2019, pp. 2289–2298. IEEE (2019)

  39. Lazaridis, I., Mehrotra, S.: Capturing sensor-generated time series with quality guarantees. In: Proceedings of the 19th International Conference on Data Engineering, March 5–8, 2003, Bangalore, India, pp. 429–440. IEEE Computer Society (2003)

  40. Liakos, P., Papakonstantinopoulou, K., Kotidis, Y.: Chimp: Efficient lossless floating point compression for time series databases. Proc. VLDB Endow. 15(11), 3058–3070 (2022)

    Article  Google Scholar 

  41. Liu, C., Jiang, H., Paparrizos, J., Elmore, A.J.: Decomposed bounded floats for fast compression and queries. Proc. VLDB Endow. 14(11), 2586–2598 (2021)

    Article  Google Scholar 

  42. Lubba, C.H., Sethi, S.S., Knaute, P., Schultz, S.R., Fulcher, B.D., Jones, N.S.: catch22: Canonical time-series characteristics—selected through highly comparative time-series analysis. Data Min. Knowl. Discov. 33(6), 1821–1852 (2019)

    Article  Google Scholar 

  43. Marascu, A., Pompey, P., Bouillet, E., Wurst, M., Verscheure, O., Grund, M., Cudré-Mauroux, P.: TRISTAN: real-time analytics on massive time series using sparse dictionary compression. In: 2014 IEEE International Conference on Big Data (IEEE BigData 2014), Washington, DC, USA, October 27–30, 2014, pp. 291–300. IEEE Computer Society (2014)

  44. Nong, G., Zhang, S., Chan, W.H.: Linear suffix array construction by almost pure induced-sorting. In: 2009 Data Compression Conference (DCC 2009), 16–18 March 2009, Snowbird, UT, USA, pp. 193–202. IEEE Computer Society (2009)

  45. Ong, G.H., Huang, S.: A data compression scheme for Chinese text files using Huffman coding and a two-level dictionary. Inf. Sci. 84(1 &2), 85–99 (1995)

    Article  Google Scholar 

  46. Pasco, R.C.: Source coding algorithms for fast data compression (ph.d. thesis abstr.). IEEE Trans. Inf. Theory 23(4), 548 (1977)

    Article  Google Scholar 

  47. Pelkonen, T., Franklin, S., Cavallaro, P., Huang, Q., Meza, J., Teller, J., Veeraraghavan, K.: Gorilla: a fast, scalable, in-memory time series database. Proc. VLDB Endow. 8(12), 1816–1827 (2015)

    Article  Google Scholar 

  48. Ryabko, B.Y.: Data compression by means of a “book stack’’. Probl. Pered. Informat. 16(4), 16–21 (1980)

    MathSciNet  Google Scholar 

  49. Samulowitz, H., Reddy, C., Sabharwal, A., Sellmann, M.: Snappy: a simple algorithm portfolio. In: Theory and Applications of Satisfiability Testing - SAT 2013 - 16th International Conference, Helsinki, Finland, July 8–12, 2013. Proceedings, Volume 7962 of Lecture Notes in Computer Science, pp. 422–428. Springer (2013)

  50. Seidel, R.: Small-dimensional linear programming and convex hulls made easy. Discret. Comput. Geom. 6, 423–434 (1991)

    Article  MathSciNet  Google Scholar 

  51. Spiegel, J., Wira, P., Hermann, G.: A comparative experimental study of lossless compression algorithms for enhancing energy efficiency in smart meters. In: 16th IEEE International Conference on Industrial Informatics, INDIN 2018, Porto, Portugal, July 18–20, 2018, pp. 447–452. IEEE (2018)

  52. Walder, J., Krátký, M., Platos, J.: Fast fibonacci encoding algorithm. In: Proceedings of the Dateso 2010 Annual International Workshop on DAtabases, TExts, Specifications and Objects, Stedronin-Plazy, Czech Republic, April 21–23, 2010, Volume 567 of CEUR Workshop Proceedings, pp. 72–83. CEUR-WS.org (2010)

  53. Wang, C., Qiao, J., Huang, X., Song, S., Hou, H., Jiang, T., Rui, L., Wang, J., Sun, J.: Apache IoTDB: a time series database for IoT applications. Proc. ACM Manag. Data 1(2), 195:1-195:27 (2023)

    Article  Google Scholar 

  54. Welch, T.A.: A technique for high-performance data compression. Computer 17(6), 8–19 (1984)

    Article  Google Scholar 

  55. Wong, R.C., Fu, A.W.: Mining top-k item sets over a sliding window based on zipfian distribution. In: Proceedings of the 2005 SIAM International Conference on Data Mining, SDM 2005, Newport Beach, CA, USA, April 21–23, 2005, pp. 516–520. SIAM (2005)

  56. Yousri, R., Alsenwi, M., Saeed Darweesh, M., Ismail, T.: A design for an efficient hybrid compression system for EEG data. In: 2021 International Conference on Electronic Engineering (ICEEM), pp. 1–6 (2021)

  57. Yu, X., Peng, Y., Li, F., Wang, S., Shen, X., Mai, H., Xie, Y.: Two-level data compression using machine learning in time series database. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20–24, 2020, pp. 1333–1344. IEEE (2020)

  58. Yu, H., Huang, F., Lin, C.: Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn. 85(1–2), 41–75 (2011)

  59. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

Download references

Acknowledgements

This work is supported in part by National Natural Science Foundation of China (62072265, 62021002, 62232005), National Key Research and Development Plan (2021YFB3300500), Beijing National Research Center for Information Science and Technology (BNR2022RC01011). Shaoxu Song (https://sxsong.github.io/) is the corresponding author.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaoxu Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, T., Xiao, J., Huang, Y. et al. Time series data encoding in Apache IoTDB: comparative analysis and recommendation. The VLDB Journal 33, 727–752 (2024). https://doi.org/10.1007/s00778-024-00840-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-024-00840-5

Keywords

Navigation