Abstract
There can be a data boom in the near future, due to cheaper methods make possible for everyone to keep their own DNA on their own device or on a central medical cloud. With the development of sequencing methods, we are able to get the sequences of more and more species. However the size of the human genome is about 3 GB for each person. And for other species it can be more.
The need is growing for the efficient compression of these data and general compressors can not reach a satisfying result. These are not aware of the special structure of these data. There are already some algorithms tried to reach smaller and smaller rates. In this paper, we would like to present our new method to accomplish this task.
Keywords
Dr. Kiss was also with J. Selye University, Komárno, Slovakia.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8(1), 25 (2013)
Grumbach, S., Tahi, F.: A new challenge for compression algorithms: genetic sequences. Inf. Process. Manag. 30(6), 875–886 (1994)
Rivals, E., Delahaye, J.-P., Dauchet, M., Delgrange, O.: A guaranteed compression scheme for repetitive DNA sequences. In: Proceedings of Data Compression Conference, DCC 1996, p. 453. IEEE (1996)
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. Genome Inform. 10, 51–61 (1999)
Matsumoto, T., Sadakane, K., Imai, H.: Biological sequence compression algorithms. Genome Inform. 11, 43–52 (2000)
Chen, X., Li, M., Ma, B., Tromp, J.: DNACompress: fast and effective DNA sequence compression. Bioinformatics 18(12), 1696–1698 (2002)
Cherniavsky, N., Ladner, R.: Grammar-based compression of DNA sequences. DIMACS Working Group on The Burrows-Wheeler Transform, 21 (2004)
Behzadi, B., Le Fessant, F.: DNA compression challenge revisited: a dynamic programming approach. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 190–200. Springer, Heidelberg (2005). https://doi.org/10.1007/11496656_17
Ferreira, P.J.S.G., Neves, A.J.R., Afreixo, V., Pinho, A.J.: Exploring three-base periodicity for DNA compression and modeling. In: Proceedings of the 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2006, vol. 5, p. V. IEEE (2006)
Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16321-0_20
Rajeswari, P.R., Apparo, A., Kumar, V.K.: Genbit Compress Tool (GBC): a Java-based tool to compress DNA sequences and compute compression ratio (bits/base) of genomes. arXiv preprint arXiv:1006.1193 (2010)
Rajarajeswari, P., Apparao, A.: DNABit compress-genome compression algorithm. Bioinformation 5(8), 350 (2011)
Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Iterative dictionary construction for compression of large DNA data sets. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 9(1), 137–149 (2012)
Machhi, V., Patel, M.S.: Compression techniques applied to DNA data of various species. DNA Seq. 8(3) (2016)
Keerthy, A.S., Priya, S.M.: Lempel-Ziv-Welch compression of DNA sequence data with indexed multiple dictionaries. Int. J. Appl. Eng. Res. 12(16), 5610–5615 (2017)
Bockenhauer, H.-J., Bongartz, D.: Algorithmic Aspects of Bioinformatics. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71913-7
Cavalier-Smith, T.: A revised six-kingdom system of life. Biol. Rev. 73(3), 203–266 (1998)
Moreira, D., López-García, P.: Ten reasons to exclude viruses from the tree of life. Nat. Rev. Microbiol. 7(4), 306 (2009)
Hegde, N.R., Maddur, M.S., Kaveri, S.V., Bayry, J.: Reasons to include viruses in the tree of life. Nat. Rev. Microbiol. 7(8), 615 (2009)
NCBI National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/
Ensembl genomes. http://ensemblgenomes.org/
Acknowledgment
The project was supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Lehotay-Kéry, P., Kiss, A. (2019). GenPress: A Novel Dictionary Based Method to Compress DNA Data of Various Species. In: Nguyen, N., Gaol, F., Hong, TP., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2019. Lecture Notes in Computer Science(), vol 11432. Springer, Cham. https://doi.org/10.1007/978-3-030-14802-7_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-14802-7_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14801-0
Online ISBN: 978-3-030-14802-7
eBook Packages: Computer ScienceComputer Science (R0)