Elsevier

Expert Systems with Applications

Volume 80, 1 September 2017, Pages 162-170
Expert Systems with Applications

EPMA: Efficient pattern matching algorithm for DNA sequences

https://doi.org/10.1016/j.eswa.2017.03.026Get rights and content

Highlights

  • We present a brief introduction to the applications of pattern matching.

  • We present a novel pattern matching algorithm for DNA sequences.

  • We present multithreading in pattern matching.

  • We use Turing machine for pattern matching.

  • We present comparative results with significance improvements.

Abstract

To solve, manage and analyze biological problems using computer technology is called bioinformatics. With the emergent evolution in computing era, the volume of biological data has increased significantly. These large amounts of data have increased the need to analyze it in reasonable space and time. DNA sequences contain basic information of species, and pattern matching between different species is an important and challenging issue to cope with. There exist generalized string matching and some specialized DNA pattern matching algorithms in the literature. There is still need to develop fast and space efficient pattern matching algorithms that consider new hardware development. In this paper, we present a novel DNA sequences pattern matching algorithm called EPMA. The proposed algorithm utilizes fixed length 2-bits binary encoding, segmentation and multi-threading. The idea is to find the pattern with multiple searcher agents concurrently. The proposed algorithm is validated with comparative experimental results. The results show that the new algorithm is a good candidate for DNA sequence pattern matching applications. The algorithm effectively utilizes modern hardware and will help researchers in the sequence alignment, short read error correction, phylogenetic inference etc. Furthermore, the proposed method can be extended to generalized string matching and their applications.

Introduction

Pattern matching is subjected to find all occurrences of the pattern(s) in source (Faro & Lecroq, 2013). Pattern matching is one of the most challenging issues in computer science applications including intrusion detection systems (Hassan, 2005), operating systems, information retrieval, search engines (Somayajulu & DVLN, 2011), artificial intelligence (Almazroi, 2011), image and signal processing (Klaib and Osborne, 2009, Li et al., 2008). The string matching applications are library management systems, error detection and correction systems, processing text systems, speech and pattern recognition systems (Michailidis & Margaritis, 2002), bibliographic search systems, question-answer systems (Zubair, Wahab, Hussain, & Zaffar, 2010), dictionaries and memorized data systems (Hassan, 2005).String matching is also used to analyze protein sequences and pattern matching in DNA sequences (Cao, 2004, Bhukya and Somayajulu, 2011). String matching is also used in genome sequence compression and short read error correction (Sardaraz et al., 2016, Sardaraz et al., 2014, Tahir et al., 2015). Therefore, it plays a vital role in solving various problems in computer sciences (Faro and Lecroq, 2010, Hassan, 2005).

String matching is usually used to solve matching problems i.e. to compare pattern ‘p’ with target text ‘t’. The first and simplest string matching algorithm is brute force, which does not preprocess the p or t. Its computational time complexity is O(mn); where m and n both refer to lengths of p and t respectively. Later, many computationally improved variants of the brute force algorithm were developed e.g. Karp–Rabin algorithm (Karp & Rabin, 1987) and Knuth–Morris–Pratt (KMP) algorithm (Knuth, Morris, James, & Pratt, 1977). The string matching has been divided into two main categories i.e. exact string matching and approximate string matching. These are subdivided into five groups based on different approaches used. These groups are the classical approach, suffix automata approach, bit-parallel approach, hashing approach and hybrid approach based algorithms (Faro & Lecroq, 2013). String matching plays an important role in computer sciences, bioinformatics computational biology in data analysis like feature extraction, disease and structural analysis. Biological scientists and practitioner are mostly interested in performing searching to identify proteins or genes that contain sequence pattern. However, numerous number of algorithms has been developed to deal with specific challenging so far, but the biological database volumes are also increasing at a rapid rate. So, fast and efficient pattern matching algorithms are required to cope with current and future challenges.

The recent development in computational technology and increase in the volume of biological data pose many challenges to researchers. The detailed study on these can be found in Bucak and Uslan (2011) and Pehlivan and Orhan (2011). As the impact of pattern matching is very high, it has been investigated that approximate pattern matching considers crucial and complex issues. That requires high-performance processing in terms of computation. While exact pattern matching algorithms enhances the search speed with the minimal use of hardware and power (Özcan & Ünsal, 2015). In this paper, we focus on computational complexity and memory efficiency of DNA sequence matching. We present computationally intelligent and memory efficient algorithm using binary encoding, multi-threading and searching techniques.

Section snippets

Motivations

Single threaded applications face the issue that lengthy processes must be completed prior the other process to begin. Designing and building such applications are simple because all operations are serialized. It shows that there is only one thread in execution at a time. However, it is very useful to have multiple threads that run simultaneously based on timesharing (Kofahi & Abusalama, 2012), as processor executes only one instruction at a time. Therefore, if a multithreaded application runs

Related work

DNA sequence matching is considered as a special case of general string matching problem. Currently, pattern matching in DNA sequences is major and challenging research area in computational and molecular biology. Many algorithms have been developed to cope with the pattern matching problems in DNA sequences. These algorithms are divided into three categories (Kim, Kim, & Park, 2007) i.e. Prefix, suffix and factor-based approaches. KMP algorithm (Knuth et al., 1977), the Shift-Or (SO) algorithm

Algorithm

Let a Pattern P = p1p2p3pm and Source S = s1s2s3sn are the two DNA sequences over the alphabet set ∑ = {A, C, G, T}*. The task is to find all occurrences of P in S. The proposed algorithm utilizes binary encoding which reduces memory requirements. The algorithm makes use of splitting the input string into multiple segments to effectively use multi-threading. For efficient searching and matching the proposed algorithm utilizes alpha skip searching technique. This techniqueisvery efficient in

Datasets

Our algorithm works with DNA sequences. We use ten real DNA sequences in experiments obtained from NCBI database ranging from 585.2 kb to 222.6 Mb in size with different base pairs (bp) in length, as summarized in Table 2. All datasets are in FASTA format, each sequence file contains information of genome nature i.e. accession number, reference number to NCBI database and name of species in a single line starts with the ‘>’ symbol. For each data set, we ignore that line and the patterns were

Results and discussion

All experiments were run on an Intel (R) Core i3-3220 with two [email protected] GHz, 4 GB of memory installed, running on 32-bit instruction set kernel Linux (Ubuntu 11.10) OS. A comprehensive experimental evaluation of exact string matching algorithms presented in Faro and Lecroq (2010) shows that the EBOM, HASH3, SO, SBNDMq4 and SSEF have good performance. Therefore, based on these observations such algorithms are selected for comparative analysis. It should be noted that for fair comparison all the

Conclusion

Pattern matching is not only common task in string processing but also in other fields where the pattern(s) need to be found e.g. image processing, artificial intelligence and DNA processing. Computational biology research and development aim to reduce the complexities of biological sequences data, sequencing capacity augmentation, read length raising and enlarging the field of applications. Inspite of the efforts for designing efficient and scalable string matching algorithms, there still

Authors’ contributions

Mr. Tahir has devised algorithmic solutions and all the co-authors have materially participated in implementation and preparation of the manuscript.

References (48)

  • R. Bhukya et al.

    Index based multiple pattern matching algorithm using DNA sequence and pattern count

    International Journal of Information Technology

    (2011)
  • R.S. Boyer et al.

    A fast string searching algorithm

    Communications of the ACM

    (1977)
  • I.Ö. Bucak et al.

    Sequence alignment from the perspective of stochastic optimization: A survey

    Turkish Journal of Electrical Engineering & Computer Sciences

    (2011)
  • F. Cao

    Fast string matching algorithm and its application in DNA sequence search

    (2004)
  • C. Charras et al.

    A very fast string matching algorithm for small alphabets and long patterns

    Combinatorial pattern matching

    (1998)
  • L. Chen et al.

    Compressed pattern matching in DNA sequences

  • H. Fan et al.

    Fast variants of the backward-oracle-marching algorithm

  • Faro, S., & Lecroq, T. (2010). The exact string matching problem: A comprehensive experimental evaluation. arXiv...
  • S. Faro et al.

    The exact online string matching problem: A review of the most recent results

    ACM Computing Surveys (CSUR)

    (2013)
  • F. Franek et al.

    A simple fast hybrid pattern-matching algorithm

    Combinatorial pattern matching

    (2005)
  • K. Fredriksson et al.

    Practical and optimal string matching

    String processing and information retrieval

    (2005)
  • A.A. Hassan

    Mixed heuristic algorithm for intelligent string matching for information retrieval

  • R.N. Horspool

    Practical fast searching in strings

    Software: Practice and Experience

    (1980)
  • R.M. Karp et al.

    Efficient randomized pattern-matching algorithms

    IBM Journal of Research and Development

    (1987)
  • Cited by (18)

    • A new approach in DNA sequence compression: Fast DNA sequence compression using parallel chaos game representation

      2019, Expert Systems with Applications
      Citation Excerpt :

      This method is lossless (Jung, Jang, & Cho, 2016). In an article presented by Muhammad Tahir.et., EPMA method was introduced to look for the patterns in DNA sequence, using multiple searcher agents, which can be an effective method in compression algorithms (Tahir, Sardaraz, & Ikram, 2017). The proposed algorithm is a hybrid algorithm which uses chaos game representation (CGR) and Hoffman coding.The proposed algorithm has no need to search for patterns like existing ones because the chaos game representation provides all the needed patterns.

    • Analyzing DNA Pattern Matching through String Similarity Measurements in Cancer Sequence Data

      2023, International Conference on Sustainable Communication Networks and Application, ICSCNA 2023 - Proceedings
    • A survey on improving pattern matching algorithms for biological sequences

      2022, Concurrency and Computation: Practice and Experience
    View all citing articles on Scopus
    View full text