Abstract
The advancement of next generation sequencing (NGS) and shotgun sequencing technologies produced massive amounts of genomics data. Metagenomics, a powerful technique to study genetic material of uncultivable microorganisms received directly from their natural environment, is dealing with high throughput sequencing read data sets. Assembling, binning and alignment of short reads in order to identify microorganisms of a Metagenomics sample are expensive and time- consuming, regardless of other restrictions. DNA signature is a short nucleotide sequence fragment which is used to distinguish species across all other species. It can be a basis for identifying microorganisms both in environmental and clinical samples directly from the short reads, without assembling and alignment processes. In this paper, we propose a scalable method in which we use optimization techniques borrowed from database technology, namely bitmap indexes. They are used to speed up searching and matching of billions of DNA signatures in the short reads of thousands of different microorganisms, using commodity High Performance Computing, such as Hadoop MapReduce, Hive and Hbase.
This work was performed when Ramin Karimi was visiting the LIAS/ISAE-ENSMA Lab. This visit is funded by ERASMUS mobility program. The work was also supported in part by the projects TMOP-4.2.2.C-11/1/KONV-2012-0001, and TMOP 4.2.4. A/2-11-1-2012-0001 supported by the European Union, co-financed by the European Social Fund, and by the OTKA grant NK101680.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Tiedje, J.M.: Microbial diversity: of value to whom. ASM News 60(10), 524–525 (1994)
Allsopp, D., Colwell, R.R., Hawksworth, D.L., et al.: Microbial Diversity and Ecosystem Function: Proceedings of the IUBS/IUMS Workshop held at Egham, UK, August 10-13. CAB INTERNATIONAL (1995)
Kaeberlein, T., Lewis, K., Epstein, S.S.: Isolating “uncultivable” microorganisms in pure culture in a simulated natural environment. Science 296(5570), 1127–1129 (2002)
Trapnell, C., Salzberg, S.L.: How to map billions of short reads onto genomes. Nature Biotechnology 27(5), 455 (2009)
Thomas, T., Gilbert, J., Meyer, F.: Metagenomics-a guide from sampling to data analysis. Microb. Inform. Exp. 2(3) (2012)
Haubold, B., Reed, F.A., Pfaffelhuber, P.: Alignment-free estimation of nucleotide diversity. Bioinformatics 27(4), 449–455 (2011)
Wooley, J.C., Godzik, A., Friedberg, I.: A primer on metagenomics. PLoS Computational Biology 6(2), e1000667 (2010)
Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics 13(1), 36–46 (2012)
Otu, H.H., Sayood, K.: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(16), 2122–2130 (2003)
Li, C., Yang, Y., Jia, M., Zhang, Y., Yu, X., Wang, C.: Phylogenetic analysis of DNA sequences based on k-word and rough set theory. Physica A: Statistical Mechanics and its Applications 398, 162–171 (2014)
Nagar, A., Hahsler, M.: Genomic sequence fragment identification using quasi-alignment. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, p. 359. ACM (2013)
Vinga, S., Almeida, J.: Alignment-free sequence comparison–a review. Bioinformatics 19(4), 513–523 (2003)
Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., Sun, F.: Alignment-free sequence comparison based on next generation sequencing reads: Extended abstract. In: Chor, B. (ed.) RECOMB 2012. LNCS, vol. 7262, pp. 272–285. Springer, Heidelberg (2012)
Srinivasan, S.M., Guda, C.: MetaID: A novel method for identification and quantification of metagenomic samples. BMC Genomics 14(8), 1–12 (2013)
Phillippy, A.M., Mason, J.A., Ayanbule, K., Sommer, D.D., Taviani, E., Huq, A., ... Salzberg, S.L.: Comprehensive DNA signature discovery and validation. PLoS Computational Biology 3(5), e98 (2007)
Phillippy, A.M., Ayanbule, K., Edwards, N.J., Salzberg, S.L.: Insignia: a DNA signature search web server for diagnostic assay development. Nucleic Acids Research 37(suppl. 2), W229–W234 (2009)
Satya, R.V., Kumar, K., Zavaljevski, N., Reifman, J.: A high-throughput pipeline for the design of real-time pcr signatures. BMC Bioinformatics 11(1), 340 (2010)
Apache Hadoop available at http://hadoop.apache.org/
White, T.: Hadoop: The definitive guide. O’Reilly Media, Inc. (2012)
Cloudera Frequently Asked Questions (FAQs), http://www.cloudera.com/content/cloudera/en/why-cloudera/hadoop-and-big-data.html
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
NoSQL Relational Database Management System homepage, http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/Home%20Page
Michael, M., Moreira, J.E., Shiloach, D., Wisniewski, R.W.: Scale-up x scale-out: A case study using nutch/lucene. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2007, pp. 1–8. IEEE (2007)
Bondi, A.B.: Characteristics of scalability and their impact on performance. In: Proceedings of the 2nd International Workshop on Software and Performance, pp. 195–203. ACM (2000)
Apache Hive available at http://hive.apache.org
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)
Apache HBase available at http://hbase.apache.org
Karande, N.D.: Efficient indexing technique using bitmap indices for data warehouses. International Journal 1(4) (2013)
Bellatreche, L., Missaoui, R., Necir, H., Drias, H.: A data mining approach for selecting bitmap join indices. JCSE 1(2), 177–194 (2007)
National Center for Biotechnology Information (NCBI), ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/
Insignia Homepage, http://insignia.cbcb.umd.edu/index.php
Metasim Homepage, http://ab.inf.uni-tuebingen.de/software/metasim/
Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: Metasima sequencing simulator for genomics and metagenomics. PloS One 3(10), e3373 (2008)
Hbase and Hive integration, https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Karimi, R., Bellatreche, L., Girard, P., Boukorca, A., Hajdu, A. (2014). BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species with DNA Signatures through Metagenomics Samples. In: Bursa, M., Khuri, S., Renda, M.E. (eds) Information Technology in Bio- and Medical Informatics. ITBAM 2014. Lecture Notes in Computer Science, vol 8649. Springer, Cham. https://doi.org/10.1007/978-3-319-10265-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-10265-8_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10264-1
Online ISBN: 978-3-319-10265-8
eBook Packages: Computer ScienceComputer Science (R0)