skip to main content
10.1145/3329785.3329924acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Fast & Strong: The Case of Compressed String Dictionaries on Modern CPUs

Published: 01 July 2019 Publication History

Abstract

String dictionaries constitute a large portion of the memory foot-print of database applications. While strong string dictionary compression algorithms exist, these come with impractical access and compression times. Therefore, lightweight algorithms such as front coding are favored in practice. This paper endeavors to make strong string dictionary compression practical. We focus on Re-Pair Front Coding (RPFC), a grammar-based compression algorithm, since it consistently offers better compression ratios than other algorithms in the literature. To accelerate compression times, we propose block-based RPFC, which consists in compressing independently small blocks of the dictionary. Moreover, to accelerate access times, we devise a vectorized access method, using Intel® Advanced Vector Extensions 512 (Intel® AVX-512), that is enabled by two specific changes we propose to RPFC. Our experimental evaluation shows that our proposed techniques accelerate compression and access times by up to 24x and 2.9x, respectively. These results move our modified RPFC into a practical range for use in database systems.

References

[1]
2017. Intel® Xeon® Platinum 8180 Processor. https://ark.intel.com/content/www/us/en/ark/products/120496/intel-xeon-platinum-8180-processor-38-5m-cache-2-50-ghz.html. Accessed: 2019-03-05.
[2]
2018. GeoNames dump. http://download.geonames.org/export/dump/. Accessed: 2019-03-06.
[3]
2018. Laboratory for Web Algorithmics - Datasets. http://law.di.unimi.it/datasets.php. Accessed: 2019-03-06.
[4]
2018. Wikimedia database dumps. https://dumps.wikimedia.org/. Accessed: 2019-03-06.
[5]
2019. Intel® VTune™ Amplifier. https://software.intel.com/en-us/vtune. Accessed: 2019-03-20.
[6]
Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating Compression and Execution in Column-oriented Database Systems. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD '06). ACM, New York, NY, USA, 671--682.
[7]
Julian Arz and Johannes Fischer. 2014. LZ-Compressed String Dictionaries. In 2014 Data Compression Conference. 322--331.
[8]
Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. UbiCrawler: A Scalable Fully Distributed Web Crawler. Software: Practice & Experience 34, 8 (2004), 711--726.
[9]
Nieves R Brisaboa, Rodrigo Cánovas, Francisco Claude, Miguel A Martínez-Prieto, and Gonzalo Navarro. 2011. Compressed string dictionaries. In International Symposium on Experimental Algorithms. Springer, 136--147.
[10]
David Clark. 1998. Compact Pat Trees. PhD thesis, University of Waterloo (1998).
[11]
Franz Färber, Norman May,Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes Rauhe, and Jonathan Dees. 2012. The SAP HANA Database -- An Architecture Overview. Data Eng. Bull. 35, 1 (2012), 28--33.
[12]
Roberto Grossi and Giuseppe Ottaviano. 2015. Fast compressed tries through path decompositions. Journal of Experimental Algorithmics (JEA) 19 (2015), 3--4.
[13]
Shunsuke Kanda, Kazuhiro Morita, and Masao Fuketa. 2017. Practical string dictionary compression using string dictionary encoding. In 2017 International Conference on Big Data Innovations and Applications (Innovate-Data). IEEE, 1--8.
[14]
N Jesper Larsson and Alistair Moffat. 2000. Off-line dictionary-based compression. Proc. IEEE 88, 11 (2000), 1722--1732.
[15]
Christian Lemke, Kai-Uwe Sattler, Franz Färber, and Alexander Zeier. 2010. Speeding Up Queries in Column Stores. In Data Warehousing and Knowledge Discovery. Springer Berlin Heidelberg, Berlin, Heidelberg, 117--129.
[16]
Miguel A Martínez-Prieto, Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, and Gonzalo Navarro. 2016. Practical compressed string dictionaries. Information Systems 56 (2016), 73--108.
[17]
Ingo Müller, Cornelius Ratsch, and Franz Färber. 2014. Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems. In EDBT. 283--294.
[18]
Till Westmann, Donald Kossmann, Sven Helmer, and Guido Moerkotte. 2000. The implementation and performance of compressed databases. ACM Sigmod Record 29, 3 (2000), 55--67.
[19]
Thomas Willhalm, Ismail Oukid, Ingo Müller, and Franz Färber. 2013. Vectorizing Database Column Scans with Complex Predicates. In International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures - ADMS 2013. 1--12.
[20]
Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. 2009. SIMD-scan: Ultra Fast In-memory Table Scan Using On-chip Vector Processing Units. Proc. VLDB Endow. 2, 1 (Aug. 2009), 385--394.
[21]
Ahmad Yasin. 2014. A Top-Down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 35--44.

Cited By

View all
  • (2023)Engineering a Textbook Approach to Index Massive String DictionariesString Processing and Information Retrieval10.1007/978-3-031-43980-3_16(203-217)Online publication date: 20-Sep-2023
  • (2021)Adaptive Compression for Fast Scans on String ColumnsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452798(554-562)Online publication date: 9-Jun-2021
  • (2020)FSSTProceedings of the VLDB Endowment10.14778/3407790.340785113:12(2649-2661)Online publication date: 14-Sep-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DaMoN'19: Proceedings of the 15th International Workshop on Data Management on New Hardware
July 2019
150 pages
ISBN:9781450368018
DOI:10.1145/3329785
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 94 of 127 submissions, 74%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)2
Reflects downloads up to 22 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Engineering a Textbook Approach to Index Massive String DictionariesString Processing and Information Retrieval10.1007/978-3-031-43980-3_16(203-217)Online publication date: 20-Sep-2023
  • (2021)Adaptive Compression for Fast Scans on String ColumnsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452798(554-562)Online publication date: 9-Jun-2021
  • (2020)FSSTProceedings of the VLDB Endowment10.14778/3407790.340785113:12(2649-2661)Online publication date: 14-Sep-2020
  • (2020)PIDSProceedings of the VLDB Endowment10.14778/3380750.338076113:6(925-938)Online publication date: 11-Mar-2020
  • (2020)Accelerating re-pair compression using FPGAsProceedings of the 16th International Workshop on Data Management on New Hardware10.1145/3399666.3399931(1-8)Online publication date: 15-Jun-2020
  • (2020)Order-Preserving Key Compression for In-Memory Search TreesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380583(1601-1615)Online publication date: 11-Jun-2020
  • (2020)Faster & strong: string dictionary compression using sampling and fast vectorized decompressionThe VLDB Journal10.1007/s00778-020-00620-xOnline publication date: 20-Jul-2020
  • (2019)Rpair: Rescaling RePair with RsyncString Processing and Information Retrieval10.1007/978-3-030-32686-9_3(35-44)Online publication date: 3-Oct-2019
  • (2019)Base64 encoding and decoding at almost the speed of a memory copySoftware: Practice and Experience10.1002/spe.277750:2(89-97)Online publication date: 26-Nov-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media