skip to main content
10.1145/2682862.2682866acmotherconferencesArticle/Chapter ViewAbstractPublication PagesadcsConference Proceedingsconference-collections
research-article

Blended Dictionaries for Reduced-Memory Lempel-Ziv Corpus Compression

Published: 26 November 2014 Publication History

Abstract

Relative Lempel-Ziv (RLZ) compression has been shown to be effective for compression of large text repositories. It provides high compression ratios with extremely fast atomic decompression of individual documents. However, it depends on a large in-memory dictionary, which is implemented as a contiguous string that must be accessed randomly during the decompression process. In this paper we explore how compressed suffix arrays might reduce the size of the dictionary. These suffix arrays drastically increase the cost of accessing individual characters, however, so we propose splitting of the dictionary: an uncompressed structure for frequently accessed dictionary elements, with compression for the remainder. Our results show that splitting provides a smoothly tuneable trade-off between access time and memory requirements, but does not overcome the inherent limitations of compressed suffix arrays for this application, with decompression time growing by a factor of 10 for even the best combination of parameters. Suffix arrays comprise an attractive option where memory is limited, high compression is paramount, and decompression speed is unimportant.

References

[1]
P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro. An alphabet-friendly FM-index. In 11th International Conference on String Processing and Information Retrieval (SPIRE 2004), pages 150--160. Springer, 2004.
[2]
S. Gog, T. Beller, A. Moffat, and M. Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA'14), pages 326--337, 2014.
[3]
C. Hoobin, S. J. Puglisi, and J. Zobel. Sample selection for dictionary-based corpus compression. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'11), pages 1137--1138, 2011.
[4]
C. Hoobin, S. J. Puglisi, and J. Zobel. Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proceedings of the VLDB Endowment, 5(3): 265--273, 2011.
[5]
A. Moffat and A. Turpin. Compression and Coding Algorithms. Kluwer, 2002. ISBN 0-7923-7668-4.
[6]
G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys (CSUR), 39(1):2, 2007.
[7]
K. Sadakane. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms, 48(2): 294--313, 2003.
[8]
J. Tong, A. Wirth, and J. Zobel. Principled dictionary pruning for low-memory corpus compression. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'14), pages 283--292, 2014.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ADCS '14: Proceedings of the 19th Australasian Document Computing Symposium
November 2014
132 pages
ISBN:9781450330008
DOI:10.1145/2682862
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

  • RMIT University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 November 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Repository compression
  2. dictionary compression
  3. document retrieval
  4. encoding

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ADCS '14
ADCS '14: Australasian Document Computing Symposium
November 27 - 28, 2014
VIC, Melbourne, Australia

Acceptance Rates

Overall Acceptance Rate 30 of 57 submissions, 53%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 73
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media