NRRC: A Non-referential Reads Compression Algorithm

Saha, Subrata; Rajasekaran, Sanguthevar

doi:10.1007/978-3-319-19048-8_25

Subrata Saha⁷ &
Sanguthevar Rajasekaran⁷

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9096))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

1977 Accesses
2 Citations
1 Altmetric

Abstract

In the era of modern sequencing technology, we are collecting a vast amount of biological sequence data. The technology to store, process, and analyze the data is not as cheap as to generate the sequencing data. As a result, the need for devising efficient data compression and data reduction techniques is growing by the day. Although there exist a number of sophisticated general purpose compression algorithms, they are not efficient to compress biological data. As a result, we need specialized compression algorithms targeting biological data. Five different NGS data compression problems have been identified and studied. In this article we propose a novel algorithm for one of these problems. We have done extensive experiments using real sequencing reads of various lengths. The simulation results reveal that our proposed algorithm is indeed competitive and performs better than the best known algorithms existing in the current literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM format sequencing data. PLoS One 8, e59190 (2013)
Google Scholar
Bose, T., Mohammed, M.H., Dutta, A., Mande, S.S.: BIND - An algorithm for loss-less compression of nucleotide sequence data. J. Biosci. 37, 785–789 (2012)
Article Google Scholar
Brandon, M.C., Wallace, D.C., Baldi, P.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25, 1731–1738 (2009)
Article Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. SRC Research Report (1994)
Google Scholar
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proceedings of the 2007 IEEE Data Compression Conference (DCC 2007), pp. 43–52 (2007)
Google Scholar
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. Genome Informat Ser. 10, 51–61 (1999)
Google Scholar
Chen, X., Li, M., Ma, B., Tromp, J.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 8, 1696–1698 (2002)
Article Google Scholar
Christley, S., Lu, Y., Li, C., Xiaohui, X.: Human genomes as email attachments. Bioinformatics 25, 274–275 (2009)
Article Google Scholar
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28, 1415–1419 (2012)
Article Google Scholar
Deorowicz, S., Grabowski, S.: Compression of DNA sequence reads in FASTQ format. Bioinformatics 27, 860–862 (2011)
Article Google Scholar
Fritz, M.H.-Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)
Article Google Scholar
Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proceedings of the 1993 IEEEData Compression Conference (DCC 1993), Snowbird, Utah, pp. 340–350 (1993)
Google Scholar
Grumbach, S., Tahi, F.: A new challenge for compression algorithms. Genet. Seq. Inform. Process. Manag. 30, 875–886 (1994)
Article MATH Google Scholar
Hach, F., Numanagic, I., Alkan, C., Sahinalp, S.C.: SCALCE: Boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012)
Article Google Scholar
Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of nextgeneration sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171 (2012)
Google Scholar
Kingsford, C., Patro, R.: Compression of short-read sequences using path encoding. bioRxiv (2014)
Google Scholar
Korodi, G., Tabus, I., Rissanen, J., Astola, J.D.: sequence compression - based on the normalized maximum likelihood model. IEEE Sign Process Mag. 24, 47–53 (2007)
Article Google Scholar
Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Iterative dictionary construction for compression of large DNA data sets. IEEE-ACM Trans Computat Biol Bioinformatics 9, 137–149 (2012)
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)
Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)
Article Google Scholar
Mohammed, M.H., Dutta, A., Bose, T., Chadaram, S., Mande, S.S.: DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences. Bioinformatics 28, 2527–2529 (2012)
Article Google Scholar
Pinho, A.J., Ferreira, P.J.S.G., Neves, A.J.R., Bastos, C.A.C.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 6, e21588 (2011)
Google Scholar
Pinho, A.J., Pratas, D.: MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics 30, 117–118 (2014)
Article Google Scholar
Pinho, A.J., Pratas, D., Garcia, S.P.: GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res. 40, e27 (2012)
Google Scholar
Popitsch, N., Haeseler, A.V.N.: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Res 41, e27 (2013)
Google Scholar
Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: Proceedings of the 37th Annual Symposium on Foundations of Computer Science, pp. 320–328 (1996)
Google Scholar
Tembe, W., Lowey, J., Suh, E.: G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26, 2192–2194 (2010)
Article Google Scholar
Wang, C., Zhang, D.: A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res. 39, E45-U74 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Connecticut, Storrs, USA
Subrata Saha & Sanguthevar Rajasekaran

Authors

Subrata Saha
View author publications
You can also search for this author in PubMed Google Scholar
Sanguthevar Rajasekaran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Subrata Saha .

Editor information

Editors and Affiliations

Georgia State University, Atlanta, USA
Robert Harrison
Old Dominion University, Norfolk, USA
Yaohang Li
University of Connecticut, Storrs, Connecticut, USA
Ion Măndoiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saha, S., Rajasekaran, S. (2015). NRRC: A Non-referential Reads Compression Algorithm. In: Harrison, R., Li, Y., Măndoiu, I. (eds) Bioinformatics Research and Applications. ISBRA 2015. Lecture Notes in Computer Science(), vol 9096. Springer, Cham. https://doi.org/10.1007/978-3-319-19048-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-19048-8_25
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19047-1
Online ISBN: 978-3-319-19048-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics