Abstract
In the era of modern sequencing technology, we are collecting a vast amount of biological sequence data. The technology to store, process, and analyze the data is not as cheap as to generate the sequencing data. As a result, the need for devising efficient data compression and data reduction techniques is growing by the day. Although there exist a number of sophisticated general purpose compression algorithms, they are not efficient to compress biological data. As a result, we need specialized compression algorithms targeting biological data. Five different NGS data compression problems have been identified and studied. In this article we propose a novel algorithm for one of these problems. We have done extensive experiments using real sequencing reads of various lengths. The simulation results reveal that our proposed algorithm is indeed competitive and performs better than the best known algorithms existing in the current literature.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bonfield, J.K., Mahoney, M.V.: Compression of FASTQ and SAM format sequencing data. PLoS One 8, e59190 (2013)
Bose, T., Mohammed, M.H., Dutta, A., Mande, S.S.: BIND - An algorithm for loss-less compression of nucleotide sequence data. J. Biosci. 37, 785–789 (2012)
Brandon, M.C., Wallace, D.C., Baldi, P.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25, 1731–1738 (2009)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. SRC Research Report (1994)
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proceedings of the 2007 IEEE Data Compression Conference (DCC 2007), pp. 43–52 (2007)
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. Genome Informat Ser. 10, 51–61 (1999)
Chen, X., Li, M., Ma, B., Tromp, J.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 8, 1696–1698 (2002)
Christley, S., Lu, Y., Li, C., Xiaohui, X.: Human genomes as email attachments. Bioinformatics 25, 274–275 (2009)
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28, 1415–1419 (2012)
Deorowicz, S., Grabowski, S.: Compression of DNA sequence reads in FASTQ format. Bioinformatics 27, 860–862 (2011)
Fritz, M.H.-Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)
Grumbach, S., Tahi, F.: Compression of DNA sequences. In: Proceedings of the 1993 IEEEData Compression Conference (DCC 1993), Snowbird, Utah, pp. 340–350 (1993)
Grumbach, S., Tahi, F.: A new challenge for compression algorithms. Genet. Seq. Inform. Process. Manag. 30, 875–886 (1994)
Hach, F., Numanagic, I., Alkan, C., Sahinalp, S.C.: SCALCE: Boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012)
Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of nextgeneration sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171 (2012)
Kingsford, C., Patro, R.: Compression of short-read sequences using path encoding. bioRxiv (2014)
Korodi, G., Tabus, I., Rissanen, J., Astola, J.D.: sequence compression - based on the normalized maximum likelihood model. IEEE Sign Process Mag. 24, 47–53 (2007)
Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Iterative dictionary construction for compression of large DNA data sets. IEEE-ACM Trans Computat Biol Bioinformatics 9, 137–149 (2012)
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)
Mohammed, M.H., Dutta, A., Bose, T., Chadaram, S., Mande, S.S.: DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences. Bioinformatics 28, 2527–2529 (2012)
Pinho, A.J., Ferreira, P.J.S.G., Neves, A.J.R., Bastos, C.A.C.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 6, e21588 (2011)
Pinho, A.J., Pratas, D.: MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics 30, 117–118 (2014)
Pinho, A.J., Pratas, D., Garcia, S.P.: GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res. 40, e27 (2012)
Popitsch, N., Haeseler, A.V.N.: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Res 41, e27 (2013)
Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: Proceedings of the 37th Annual Symposium on Foundations of Computer Science, pp. 320–328 (1996)
Tembe, W., Lowey, J., Suh, E.: G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26, 2192–2194 (2010)
Wang, C., Zhang, D.: A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res. 39, E45-U74 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Saha, S., Rajasekaran, S. (2015). NRRC: A Non-referential Reads Compression Algorithm. In: Harrison, R., Li, Y., Măndoiu, I. (eds) Bioinformatics Research and Applications. ISBRA 2015. Lecture Notes in Computer Science(), vol 9096. Springer, Cham. https://doi.org/10.1007/978-3-319-19048-8_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-19048-8_25
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19047-1
Online ISBN: 978-3-319-19048-8
eBook Packages: Computer ScienceComputer Science (R0)