Abstract
Not only the vast applications but also the distinct features of time series data stimulate the booming growth of time series database management systems, such as Apache IoTDB, InfluxDB, OpenTSDB and so on. Almost all these systems employ columnar storage, with effective encoding of time series data. Given the distinct features of various time series data, different encoding strategies may perform variously. In this study, we first summarize the features of time series data that may affect encoding performance. We also introduce the latest feature extraction results in these features. Then, we introduce the storage scheme of a typical time series database, Apache IoTDB, prescribing the limits to implementing encoding algorithms in the system. A qualitative analysis of encoding effectiveness is then presented for the studied algorithms. To this end, we develop a benchmark for evaluating encoding algorithms, including a data generator and several real-world datasets. Also, we present an extensive experimental evaluation. Remarkably, a quantitative analysis of encoding effectiveness regarding to data features is conducted in Apache IoTDB. Finally, we recommend the best encoding algorithm for different time series referring to their data features. Machine learning models are trained for the recommendation and evaluated over real-world datasets.
Similar content being viewed by others
References
Aamand, A., Indyk, P., Vakilian, A.: (Learned) frequency estimation algorithms under Zipfian distribution. CoRR, arXiv:1908.05198 (2019)
Abadi, D.J., Madden, S., Ferreira, M.: Integrating compression and execution in column-oriented database systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27–29, 2006, pp. 671–682. ACM (2006)
Bartik, M., Ubik, S., Kubalík, P.: LZ4 compression algorithm on FPGA. In: 2015 IEEE International Conference on Electronics, Circuits, and Systems, ICECS 2015, Cairo, Egypt, December 6–9, 2015, pp. 179–182. IEEE (2015)
Blalock, D. W., Madden, S., Guttag, J. V.: Sprintz: time series compression for the internet of things. Proc. ACM Interact. Mob. Wearab. Ubiquit. Technol. 2(3), 93:1-93:23 (2018)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth (1984)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. In: Digital SRC Research Report, Citeseer (1994)
Campobello, G., Segreto, A., Zanafi, S., Serrano, S.: RAKE: a simple and efficient lossless compression algorithm for the internet of things. In: 25th European Signal Processing Conference, EUSIPCO 2017, Kos, Greece, August 28–September 2, 2017, pp. 2581–2585. IEEE (2017)
Cen, L., Kipf, A., Marcus, R., Kraska, T.: LEA: a learned encoding advisor for column stores. In: aiDM ’21: 4th Workshop in Exploiting AI Techniques for Data Management, Virtual Event, China, 25 June, 2021, pp. 32–35. ACM (2021)
Chiarot, G., Silvestri, C.: Time series compression survey. ACM Comput. Surv. 55(10), 198:1-198:32 (2023)
Dalai, M., Leonardi, R.: Approximations of one-dimensional digital signals under the l\({}_{\text{ infty }}\) norm. IEEE Trans. Signal Process. 54(8), 3111–3124 (2006)
Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J., Vapnik, V.: Support vector regression machines. In: Advances in Neural Information Processing Systems 9, NIPS, Denver, CO, USA, December 2–5, 1996, pp. 155–161. MIT Press (1996)
Eichinger, F., Efros, P., Karnouskos, S., Böhm, K.: A time-series compression technique and its application to the smart grid. VLDB J. 24(2), 193–218 (2015)
Fang, C., Song, S., Guan, H., Huang, X., Wang, C., Wang, J.: Grouping time series for efficient columnar storage. Proc. ACM Manag. Data 1(1), 23:1-23:26 (2023)
Fink, E., Gandhi, H.S.: Compression of time series by extracting major extrema. J. Exp. Theor. Artif. Intell. 23(2), 255–270 (2011)
Golomb, S. W.: Run-length encodings (corresp.). IEEE Trans. Inf. Theory 12(3), 399–401 (1966)
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edn. Springer (2009)
Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1–3), 185–234 (1989)
Howard, P.G., Vitter, J.S.: Parallel lossless image compression using Huffman and arithmetic coding. Inf. Process. Lett. 59(2), 65–73 (1996)
https://iotdb.apache.org/UserGuide/Master/Data-Concept/Encoding.html
https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs
https://www.kaggle.com/datasets/winmedals/incident-event-log-dataset
Huang, S., Chen, Y., Chen, X., Liu, K., Xu, X., Wang, C., Brown, K., Halilovic, I.: The next generation operational data historian for IoT based on informix. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 169–176. ACM (2014)
Jiang, H., Liu, C., Paparrizos, J., Chien, A.A., Ma, J., Elmore, A.J.: Good to the last bit: Data-driven encoding with codecdb. In: SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20–25, 2021, pp. 843–856. ACM (2021)
Katsis, Y., Freund, Y., Papakonstantinou, Y.: Combining databases and signal processing in plato. In: 7th Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4–7, 2015, Online Proceedings (2015). https://www.cidrdb.org
Khelifati, A., Khayati, M., Cudré-Mauroux, P.: CORAD: correlation-aware compression of massive time series using sparse dictionary coding. In: 2019 IEEE International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, December 9–12, 2019, pp. 2289–2298. IEEE (2019)
Lazaridis, I., Mehrotra, S.: Capturing sensor-generated time series with quality guarantees. In: Proceedings of the 19th International Conference on Data Engineering, March 5–8, 2003, Bangalore, India, pp. 429–440. IEEE Computer Society (2003)
Liakos, P., Papakonstantinopoulou, K., Kotidis, Y.: Chimp: Efficient lossless floating point compression for time series databases. Proc. VLDB Endow. 15(11), 3058–3070 (2022)
Liu, C., Jiang, H., Paparrizos, J., Elmore, A.J.: Decomposed bounded floats for fast compression and queries. Proc. VLDB Endow. 14(11), 2586–2598 (2021)
Lubba, C.H., Sethi, S.S., Knaute, P., Schultz, S.R., Fulcher, B.D., Jones, N.S.: catch22: Canonical time-series characteristics—selected through highly comparative time-series analysis. Data Min. Knowl. Discov. 33(6), 1821–1852 (2019)
Marascu, A., Pompey, P., Bouillet, E., Wurst, M., Verscheure, O., Grund, M., Cudré-Mauroux, P.: TRISTAN: real-time analytics on massive time series using sparse dictionary compression. In: 2014 IEEE International Conference on Big Data (IEEE BigData 2014), Washington, DC, USA, October 27–30, 2014, pp. 291–300. IEEE Computer Society (2014)
Nong, G., Zhang, S., Chan, W.H.: Linear suffix array construction by almost pure induced-sorting. In: 2009 Data Compression Conference (DCC 2009), 16–18 March 2009, Snowbird, UT, USA, pp. 193–202. IEEE Computer Society (2009)
Ong, G.H., Huang, S.: A data compression scheme for Chinese text files using Huffman coding and a two-level dictionary. Inf. Sci. 84(1 &2), 85–99 (1995)
Pasco, R.C.: Source coding algorithms for fast data compression (ph.d. thesis abstr.). IEEE Trans. Inf. Theory 23(4), 548 (1977)
Pelkonen, T., Franklin, S., Cavallaro, P., Huang, Q., Meza, J., Teller, J., Veeraraghavan, K.: Gorilla: a fast, scalable, in-memory time series database. Proc. VLDB Endow. 8(12), 1816–1827 (2015)
Ryabko, B.Y.: Data compression by means of a “book stack’’. Probl. Pered. Informat. 16(4), 16–21 (1980)
Samulowitz, H., Reddy, C., Sabharwal, A., Sellmann, M.: Snappy: a simple algorithm portfolio. In: Theory and Applications of Satisfiability Testing - SAT 2013 - 16th International Conference, Helsinki, Finland, July 8–12, 2013. Proceedings, Volume 7962 of Lecture Notes in Computer Science, pp. 422–428. Springer (2013)
Seidel, R.: Small-dimensional linear programming and convex hulls made easy. Discret. Comput. Geom. 6, 423–434 (1991)
Spiegel, J., Wira, P., Hermann, G.: A comparative experimental study of lossless compression algorithms for enhancing energy efficiency in smart meters. In: 16th IEEE International Conference on Industrial Informatics, INDIN 2018, Porto, Portugal, July 18–20, 2018, pp. 447–452. IEEE (2018)
Walder, J., Krátký, M., Platos, J.: Fast fibonacci encoding algorithm. In: Proceedings of the Dateso 2010 Annual International Workshop on DAtabases, TExts, Specifications and Objects, Stedronin-Plazy, Czech Republic, April 21–23, 2010, Volume 567 of CEUR Workshop Proceedings, pp. 72–83. CEUR-WS.org (2010)
Wang, C., Qiao, J., Huang, X., Song, S., Hou, H., Jiang, T., Rui, L., Wang, J., Sun, J.: Apache IoTDB: a time series database for IoT applications. Proc. ACM Manag. Data 1(2), 195:1-195:27 (2023)
Welch, T.A.: A technique for high-performance data compression. Computer 17(6), 8–19 (1984)
Wong, R.C., Fu, A.W.: Mining top-k item sets over a sliding window based on zipfian distribution. In: Proceedings of the 2005 SIAM International Conference on Data Mining, SDM 2005, Newport Beach, CA, USA, April 21–23, 2005, pp. 516–520. SIAM (2005)
Yousri, R., Alsenwi, M., Saeed Darweesh, M., Ismail, T.: A design for an efficient hybrid compression system for EEG data. In: 2021 International Conference on Electronic Engineering (ICEEM), pp. 1–6 (2021)
Yu, X., Peng, Y., Li, F., Wang, S., Shen, X., Mai, H., Xie, Y.: Two-level data compression using machine learning in time series database. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20–24, 2020, pp. 1333–1344. IEEE (2020)
Yu, H., Huang, F., Lin, C.: Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn. 85(1–2), 41–75 (2011)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Acknowledgements
This work is supported in part by National Natural Science Foundation of China (62072265, 62021002, 62232005), National Key Research and Development Plan (2021YFB3300500), Beijing National Research Center for Information Science and Technology (BNR2022RC01011). Shaoxu Song (https://sxsong.github.io/) is the corresponding author.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xia, T., Xiao, J., Huang, Y. et al. Time series data encoding in Apache IoTDB: comparative analysis and recommendation. The VLDB Journal 33, 727–752 (2024). https://doi.org/10.1007/s00778-024-00840-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-024-00840-5