Reference Sequence Construction for Relative Compression of Genomes

Kuruppu, Shanika; Puglisi, Simon J.; Zobel, Justin

doi:10.1007/978-3-642-24583-1_41

Shanika Kuruppu¹⁸,
Simon J. Puglisi¹⁹ &
Justin Zobel¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7024))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

837 Accesses
3 Altmetric

Abstract

Relative compression, where a set of similar strings are compressed with respect to a reference string, is an effective method of compressing DNA datasets containing multiple similar sequences. Moreover, it supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In this paper, we explore using the dictionary of repeats generated by COMRAD, RE-PAIR and DNA-X algorithms as reference sequences for relative compression. We show that this technique allows for better compression, and allows more general repetitive datasets to be compressed using relative compression.

This work was supported by the Royal Society and the NICTA Victorian Research Laboratory. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Center of Excellence program.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

On the Role of Inverted Repeats in DNA Sequence Similarity

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

A new algorithm for “the LCS problem” with application in compressing genome resequencing data

Article Open access 18 August 2016

References

Bentley, J., McIlroy, D.: Data compression using long common strings. In: Proc. Data Compression Conference (DCC 1999), pp. 287–295 (1999)
Google Scholar
Brandon, M., Wallace, D., Baldi, P.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14), 1731–1738 (2009)
Article Google Scholar
Cao, M.D., Dix, T., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proc. Data Compression Conference (DCC 2007), pp. 43–52 (2007)
Google Scholar
Chen, X., Li, M., Ma, B., Tromp, J.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 18(12), 1696–1698 (2002)
Article Google Scholar
Grabowski, S., Deorowicz, S.: Engineering relative compression of genomes (2011), http://arxiv.org/abs/1103.2351v1
Grumbach, S., Tahi, F.: A new challenge for compression algorithms: Genetic sequences. Information Processing & Management 30(6), 875–886 (1994)
Article MATH Google Scholar
Kreft, S., Navarro, G.: Self-indexing based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (to apppear, 2011)
Chapter Google Scholar
Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Iterative dictionary construction for compression of large DNA datasets. IEEE/ACM Transactions on Computational Biology and Bioinformatics (to appear, 2011)
Google Scholar
Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010)
Chapter Google Scholar
Kuruppu, S., Puglisi, S.J., Zobel, J.: Optimized relative Lempel-Ziv compression of genomes. In: Proc. 34th Australasian Computer Science Conference (ACSC 2011), pp. 91–98 (2011)
Google Scholar
Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Proc. Data Compression Conference (DCC 1999), pp. 296–305 (1999)
Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Computational Biology 17(3), 281–308 (2010)
Article MathSciNet Google Scholar
Manzini, G., Rastero, M.: A simple and fast DNA compressor. Software: Practice and Experience 34(14), 1397–1411 (2004)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

National ICT Australia Department of Computer Science & Software Engineering, University of Melbourne, Australia
Shanika Kuruppu & Justin Zobel
Department of Informatics, King’s College London, United Kingdom
Simon J. Puglisi

Authors

Shanika Kuruppu
View author publications
You can also search for this author in PubMed Google Scholar
Simon J. Puglisi
View author publications
You can also search for this author in PubMed Google Scholar
Justin Zobel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Università di Pisa, Italy
Roberto Grossi
Consiglio Nazionale delle Ricerche, Area della Ricerca di Pisa, Istituto di Scienza e Tecnologia dell’Informazione “Alessandro Faedo”, Via Giuseppe Moruzzi 1, 56124, Pisa, Italy
Fabrizio Sebastiani & Fabrizio Silvestri &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuruppu, S., Puglisi, S.J., Zobel, J. (2011). Reference Sequence Construction for Relative Compression of Genomes. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds) String Processing and Information Retrieval. SPIRE 2011. Lecture Notes in Computer Science, vol 7024. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24583-1_41

Download citation

DOI: https://doi.org/10.1007/978-3-642-24583-1_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24582-4
Online ISBN: 978-3-642-24583-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics