Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis

Majid, Abdul; Khan, Mukhtaj; Iqbal, Nadeem; Jan, Mian Ahmad; Khan, Mushtaq; Salman

doi:10.1007/s10723-018-9451-5

Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis

Published: 10 August 2018

Volume 17, pages 313–324, (2019)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Abdul Majid¹,
Mukhtaj Khan ORCID: orcid.org/0000-0002-4933-6192¹,
Nadeem Iqbal¹,
Mian Ahmad Jan¹,
Mushtaq Khan² &
…
Salman¹

130 Accesses
5 Citations
Explore all metrics

Abstract

With rapid advancement in the field of bioinformatics and computational biology, the collected DNA dataset is growing exponentially, doubling after every 18 months. Due to large-scale and complex structure of the DNA dataset, the analysis of DNA sequence is becoming computationally a challenging issue in bioinformatics field and computational biology. Fast algorithms, capable of analyzing large-scale DNA sequence, are now required in the field of bioinformatics. This paper presents a novel Parallel Vector Space Model (PVSM) approach that supports the analysis of large-scale DNA sequence by taking advantages of multi-core system. The proposed approach is built on top of modified Vector Space Model (VSM). In order to evaluate the performance of PVSM, the proposed technique is extensively evaluated using varied size of DNA sequences in the context of computational efficiency and accuracy. The performance of PVSM is compared with sequential modified VSM. The sequential VSM is implemented on a single processor whereas, the proposed method is initially parallelized on 4 processors and subsequently on 12 processors. The experimental results show that the PVSM performed better than the sequential VSM. The proposed method achieved approximately 2× speedup compared with sequential approach, without affecting the accuracy level. Moreover, the proposed PVSM is highly scalable with an increase in the number of processing cores and support the analysis of large-scale DNA sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel Computation on Large-Scale DNA Sequences

GenSeeK: A Novel Parallel Multiple Pattern Recognition Algorithm for DNA Sequences

A Review of Parallel Implementations for the Smith–Waterman Algorithm

Article 06 September 2021

References

A Brief Guide to Genomics, National Human Genome Research Institute. [Online]. Available: https://www.genome.gov/18016863/a-brief-guide-to-genomics/. [Accessed: 22-Jun-2017] (2015)
Memeti, S., Pllana, S.: Analyzing large-scale DNA sequences on multi-core architectures. Proc. - IEEE 18th Int. Conf. Comput. Sci. Eng. CSE 2015, 208–215 (2016)
Google Scholar
Ogheneovo, E.E., Japheth, R.B.: Application of vector space model to query ranking and information retrieval. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 6(5), 42–47 (2016)
Google Scholar
Smith, T., Waterman, T.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Article Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Article Google Scholar
Abual-rub, M.S., Abdullah, R., Aini, N., Rashid, A.: A modified vector space model for protein retrieval. J. Comput. Sci. 7(9), 85–89 (2007)
Google Scholar
Patel, S., Panchal, H., Anjaria, K.: DNA Sequence analysis by ORF FINDER & GENOMATIX tool: Bioinformatics analysis of some tree species of Leguminosae family. In: Proceedings - 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2012, pp. 922–926 (2012)
Vandin, F., Upfal, E., Raphael, B.J.: Algorithms and genome sequencing?: Identifying driver pathways in cancer. IEEE Computer Magazine, no. March, pp. 39–46 (2012)
Benson, D.A., et al.: GenBank. Nucleic Acids Res. 41 (Database issue), D36–42 (2013)
Google Scholar
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Article Google Scholar
Drews, F., Lichtenberg, J., Welch, L.: Scalable parallel word search in multicore/multiprocessor systems. J. Supercomput. 51(1), 58–75 (2010)
Article Google Scholar
Takeuchi, T., Yamada, A., Aoki, T., Nishimura, K.: Cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing. Source Code Biol. Med. 11, 12 (2016)
Article Google Scholar
Kienzler, R., Bruggmann, R., Ranganathan, A., Tatbul, N.: Large-Scale DNA sequence analysis in the cloud: a Stream-Based approach. In: Euro-Par 2011: Parallel Processing Workshops , france, august 29 – september 2, 2011, pp 467–476. Springer, Berlin (2012)
Benenson, Y., Paz-Elizur, T., Adar, R., Keinan, E., Livneh, Z., Shapiro, E.: Programmable and autonomous computing machine made of biomolecules. Nature 414(6862), 430–434 (2001)
Article Google Scholar
Reif, J.H., Sahu, S.: [Online]. Available: http://bwn.ece.gatech.edu/nanos/papers/AutonomousProgrammableNanoroboticDevicesUsing.pdf. [Accessed: 14-May-2018] (2008)
Soewito, B., Weng, N.: Methodology for evaluating dna pattern searching algorithms on multiprocessor. In: 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering, pp. 570–577 (2007)
Bioinformatics Explained: BLAST versus Smith-Waterman. [Online]. Available: http://www.ccg.unam.mx/vinuesa/tlem/pdfs/BioinformaticsexplainedBLAST.pdf. [Accessed: 14-May-2018] (2007)
de Almeida, T.J.B.M., Roma, N.F.V.: A Parallel Programming Framework for Multi-core DNA Sequence Alignment, 2010 Int. Conf. Complex, Int.ll. Softw. Intensive Syst., no. February 2010, 907–912 (2010)
Herath, D., Lakmali, C., Ragel, R.: Accelerating string matching for bio-computing applications on multi-core CPUs. In: 2012 IEEE 7th Int. Conf. Ind. Inf. Syst. ICIIS 2012 (2012)
Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.: SHRIMP: Accurate mapping of short color-space reads. PLos Comput. Biol. 5(5), 1–11 (2009)
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)
Article Google Scholar
Ruban, S., Sam, S.B., Serrao, L.V.: A Study and Analysis of Information Retrieval Models. pp. 230–236 (2015)
Aitah, W.A., Almakadmeh, K.: An efficient adaptive genetic algorithm for vector space model. J. Theor. Appl. Inf. Technol. 71(2), 281–286 (2015)
Google Scholar
López-Pujalte, C., Guerrero-Bote, V.P., De Moya-Anegón, F.: Genetic algorithms in relevance feedback: a second test and new contributions. Inf. Process. Manag. 39(5), 669–687 (2003)
Article MATH Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. pp. 1–18 (2009)
Raghavan, V.V., Wong, S.K.M.: A critical analysis of vector space model for information retrieval. J. Am. Soc. Inf. Sci. 37(5), 279–287 (1986)
Article Google Scholar
Singhal, A.: Modern information retrieval?: a brief overview. IEEE Data Eng. Bull. 24, 35–43 (2001)
Google Scholar
Castells, P., Fernandez, M., Vallet, D.: An adaptation of the Vector-Space model for Ontology-Based information retrieval. IEEE Trans. Knowl. Data Eng. 19(2), 261–272 (2007)
Article Google Scholar
Sarkar, I.N.: A vector space model approach to identify genetically related diseases. J Am Med Inf. Assoc 19(2), 249–254 (2012)
Article Google Scholar
Khan, M., Jin, Y., Li, M., Xiang, Y., Jiang, C.: Hadoop Performance modeling for job estimation and resource provisioning. Parallel Distrib. Syst. IEEE Trans. PP(99), 1 (2015)
Google Scholar
Khan, M., Ashton, P.M., Li, M., Taylor, G.A., Pisica, I., Liu, J.: Parallel detrended fluctuation analysis for fast event detection on massive PMU data. Smart Grid, IEEE Trans. 6(1), 360–368 (2015)
Article Google Scholar
Apache Spark Standalone, Apache Spark. [Online]. Available: http://spark.apache.org/docs/latest/spark-standalone.html. [Accessed: 15-Mar-2017]
Danford, T.: Next-generation genomics analysis with apache spark. In: Strata + Hadoop World (2015)

Download references

Author information

Authors and Affiliations

Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
Abdul Majid, Mukhtaj Khan, Nadeem Iqbal, Mian Ahmad Jan & Salman
COMSATS Institute of Information Technology, Wah Cantt, Pakistan
Mushtaq Khan

Authors

Abdul Majid
View author publications
You can also search for this author in PubMed Google Scholar
Mukhtaj Khan
View author publications
You can also search for this author in PubMed Google Scholar
Nadeem Iqbal
View author publications
You can also search for this author in PubMed Google Scholar
Mian Ahmad Jan
View author publications
You can also search for this author in PubMed Google Scholar
Mushtaq Khan
View author publications
You can also search for this author in PubMed Google Scholar
Salman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mukhtaj Khan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Majid, A., Khan, M., Iqbal, N. et al. Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis. J Grid Computing 17, 313–324 (2019). https://doi.org/10.1007/s10723-018-9451-5

Download citation

Received: 14 February 2018
Accepted: 16 July 2018
Published: 10 August 2018
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s10723-018-9451-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis

Abstract

Access this article

Similar content being viewed by others

Parallel Computation on Large-Scale DNA Sequences

GenSeeK: A Novel Parallel Multiple Pattern Recognition Algorithm for DNA Sequences

A Review of Parallel Implementations for the Smith–Waterman Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis

Abstract

Access this article

Similar content being viewed by others

Parallel Computation on Large-Scale DNA Sequences

GenSeeK: A Novel Parallel Multiple Pattern Recognition Algorithm for DNA Sequences

A Review of Parallel Implementations for the Smith–Waterman Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation