Abstract:
NGS (Next generation sequencing) technology can concurrently accomplish sequencing of a large scale of DNA data in one time, resulting in a large number of DNA short read...Show MoreMetadata
Abstract:
NGS (Next generation sequencing) technology can concurrently accomplish sequencing of a large scale of DNA data in one time, resulting in a large number of DNA short reads. The transportation and processing of DNA data are thus faced with difficulties. There are two kinds of compression methods for high-throughput DNA data, reference-based method and reference-free method. Reference-free method is adaptive for compressing DNA data from different species without storing large genome for reference. In this paper, we proposed a reference-free algorithm, named HDC, realizing high-throughput DNA compression based on Huffman coding and dictionary method. The algorithm builds multiple dictionaries through Huffman coding and uses the dictionary to finish the compression and decompression. By testing on the genomes of human, green monkey and horse, HDC's lowest compression rate reaches 0.192 when compressing the human genome with chromosome as compression unit. We also compared HDC with a conventional compression algorithm gzip, and two reference-free DNA compression algorithms Leon and ORCOM. The result demonstrates that the HDC algorithm performs significantly best among the three algorithms.
Published in: 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)
Date of Conference: 13-15 October 2018
Date Added to IEEE Xplore: 03 February 2019
ISBN Information: