Skip to main content
Log in

Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

With rapid advancement in the field of bioinformatics and computational biology, the collected DNA dataset is growing exponentially, doubling after every 18 months. Due to large-scale and complex structure of the DNA dataset, the analysis of DNA sequence is becoming computationally a challenging issue in bioinformatics field and computational biology. Fast algorithms, capable of analyzing large-scale DNA sequence, are now required in the field of bioinformatics. This paper presents a novel Parallel Vector Space Model (PVSM) approach that supports the analysis of large-scale DNA sequence by taking advantages of multi-core system. The proposed approach is built on top of modified Vector Space Model (VSM). In order to evaluate the performance of PVSM, the proposed technique is extensively evaluated using varied size of DNA sequences in the context of computational efficiency and accuracy. The performance of PVSM is compared with sequential modified VSM. The sequential VSM is implemented on a single processor whereas, the proposed method is initially parallelized on 4 processors and subsequently on 12 processors. The experimental results show that the PVSM performed better than the sequential VSM. The proposed method achieved approximately 2× speedup compared with sequential approach, without affecting the accuracy level. Moreover, the proposed PVSM is highly scalable with an increase in the number of processing cores and support the analysis of large-scale DNA sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. A Brief Guide to Genomics, National Human Genome Research Institute. [Online]. Available: https://www.genome.gov/18016863/a-brief-guide-to-genomics/. [Accessed: 22-Jun-2017] (2015)

  2. Memeti, S., Pllana, S.: Analyzing large-scale DNA sequences on multi-core architectures. Proc. - IEEE 18th Int. Conf. Comput. Sci. Eng. CSE 2015, 208–215 (2016)

    Google Scholar 

  3. Ogheneovo, E.E., Japheth, R.B.: Application of vector space model to query ranking and information retrieval. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 6(5), 42–47 (2016)

    Google Scholar 

  4. Smith, T., Waterman, T.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)

    Article  Google Scholar 

  5. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)

    Article  Google Scholar 

  6. Abual-rub, M.S., Abdullah, R., Aini, N., Rashid, A.: A modified vector space model for protein retrieval. J. Comput. Sci. 7(9), 85–89 (2007)

    Google Scholar 

  7. Patel, S., Panchal, H., Anjaria, K.: DNA Sequence analysis by ORF FINDER & GENOMATIX tool: Bioinformatics analysis of some tree species of Leguminosae family. In: Proceedings - 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2012, pp. 922–926 (2012)

  8. Vandin, F., Upfal, E., Raphael, B.J.: Algorithms and genome sequencing?: Identifying driver pathways in cancer. IEEE Computer Magazine, no. March, pp. 39–46 (2012)

  9. Benson, D.A., et al.: GenBank. Nucleic Acids Res. 41 (Database issue), D36–42 (2013)

    Google Scholar 

  10. Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)

    Article  Google Scholar 

  11. Drews, F., Lichtenberg, J., Welch, L.: Scalable parallel word search in multicore/multiprocessor systems. J. Supercomput. 51(1), 58–75 (2010)

    Article  Google Scholar 

  12. Takeuchi, T., Yamada, A., Aoki, T., Nishimura, K.: Cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing. Source Code Biol. Med. 11, 12 (2016)

    Article  Google Scholar 

  13. Kienzler, R., Bruggmann, R., Ranganathan, A., Tatbul, N.: Large-Scale DNA sequence analysis in the cloud: a Stream-Based approach. In: Euro-Par 2011: Parallel Processing Workshops , france, august 29 – september 2, 2011, pp 467–476. Springer, Berlin (2012)

  14. Benenson, Y., Paz-Elizur, T., Adar, R., Keinan, E., Livneh, Z., Shapiro, E.: Programmable and autonomous computing machine made of biomolecules. Nature 414(6862), 430–434 (2001)

    Article  Google Scholar 

  15. Reif, J.H., Sahu, S.: [Online]. Available: http://bwn.ece.gatech.edu/nanos/papers/AutonomousProgrammableNanoroboticDevicesUsing.pdf. [Accessed: 14-May-2018] (2008)

  16. Soewito, B., Weng, N.: Methodology for evaluating dna pattern searching algorithms on multiprocessor. In: 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering, pp. 570–577 (2007)

  17. Bioinformatics Explained: BLAST versus Smith-Waterman. [Online]. Available: http://www.ccg.unam.mx/vinuesa/tlem/pdfs/BioinformaticsexplainedBLAST.pdf. [Accessed: 14-May-2018] (2007)

  18. de Almeida, T.J.B.M., Roma, N.F.V.: A Parallel Programming Framework for Multi-core DNA Sequence Alignment, 2010 Int. Conf. Complex, Int.ll. Softw. Intensive Syst., no. February 2010, 907–912 (2010)

  19. Herath, D., Lakmali, C., Ragel, R.: Accelerating string matching for bio-computing applications on multi-core CPUs. In: 2012 IEEE 7th Int. Conf. Ind. Inf. Syst. ICIIS 2012 (2012)

  20. Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.: SHRIMP: Accurate mapping of short color-space reads. PLos Comput. Biol. 5(5), 1–11 (2009)

    Article  Google Scholar 

  21. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)

    Article  Google Scholar 

  22. Ruban, S., Sam, S.B., Serrao, L.V.: A Study and Analysis of Information Retrieval Models. pp. 230–236 (2015)

  23. Aitah, W.A., Almakadmeh, K.: An efficient adaptive genetic algorithm for vector space model. J. Theor. Appl. Inf. Technol. 71(2), 281–286 (2015)

    Google Scholar 

  24. López-Pujalte, C., Guerrero-Bote, V.P., De Moya-Anegón, F.: Genetic algorithms in relevance feedback: a second test and new contributions. Inf. Process. Manag. 39(5), 669–687 (2003)

    Article  MATH  Google Scholar 

  25. Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. pp. 1–18 (2009)

  26. Raghavan, V.V., Wong, S.K.M.: A critical analysis of vector space model for information retrieval. J. Am. Soc. Inf. Sci. 37(5), 279–287 (1986)

    Article  Google Scholar 

  27. Singhal, A.: Modern information retrieval?: a brief overview. IEEE Data Eng. Bull. 24, 35–43 (2001)

    Google Scholar 

  28. Castells, P., Fernandez, M., Vallet, D.: An adaptation of the Vector-Space model for Ontology-Based information retrieval. IEEE Trans. Knowl. Data Eng. 19(2), 261–272 (2007)

    Article  Google Scholar 

  29. Sarkar, I.N.: A vector space model approach to identify genetically related diseases. J Am Med Inf. Assoc 19(2), 249–254 (2012)

    Article  Google Scholar 

  30. Khan, M., Jin, Y., Li, M., Xiang, Y., Jiang, C.: Hadoop Performance modeling for job estimation and resource provisioning. Parallel Distrib. Syst. IEEE Trans. PP(99), 1 (2015)

    Google Scholar 

  31. Khan, M., Ashton, P.M., Li, M., Taylor, G.A., Pisica, I., Liu, J.: Parallel detrended fluctuation analysis for fast event detection on massive PMU data. Smart Grid, IEEE Trans. 6(1), 360–368 (2015)

    Article  Google Scholar 

  32. Apache Spark Standalone, Apache Spark. [Online]. Available: http://spark.apache.org/docs/latest/spark-standalone.html. [Accessed: 15-Mar-2017]

  33. Danford, T.: Next-generation genomics analysis with apache spark. In: Strata + Hadoop World (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mukhtaj Khan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Majid, A., Khan, M., Iqbal, N. et al. Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis. J Grid Computing 17, 313–324 (2019). https://doi.org/10.1007/s10723-018-9451-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-018-9451-5

Keywords

Navigation