EPMA: Efficient pattern matching algorithm for DNA sequences
Introduction
Pattern matching is subjected to find all occurrences of the pattern(s) in source (Faro & Lecroq, 2013). Pattern matching is one of the most challenging issues in computer science applications including intrusion detection systems (Hassan, 2005), operating systems, information retrieval, search engines (Somayajulu & DVLN, 2011), artificial intelligence (Almazroi, 2011), image and signal processing (Klaib and Osborne, 2009, Li et al., 2008). The string matching applications are library management systems, error detection and correction systems, processing text systems, speech and pattern recognition systems (Michailidis & Margaritis, 2002), bibliographic search systems, question-answer systems (Zubair, Wahab, Hussain, & Zaffar, 2010), dictionaries and memorized data systems (Hassan, 2005).String matching is also used to analyze protein sequences and pattern matching in DNA sequences (Cao, 2004, Bhukya and Somayajulu, 2011). String matching is also used in genome sequence compression and short read error correction (Sardaraz et al., 2016, Sardaraz et al., 2014, Tahir et al., 2015). Therefore, it plays a vital role in solving various problems in computer sciences (Faro and Lecroq, 2010, Hassan, 2005).
String matching is usually used to solve matching problems i.e. to compare pattern ‘p’ with target text ‘t’. The first and simplest string matching algorithm is brute force, which does not preprocess the p or t. Its computational time complexity is O(mn); where m and n both refer to lengths of p and t respectively. Later, many computationally improved variants of the brute force algorithm were developed e.g. Karp–Rabin algorithm (Karp & Rabin, 1987) and Knuth–Morris–Pratt (KMP) algorithm (Knuth, Morris, James, & Pratt, 1977). The string matching has been divided into two main categories i.e. exact string matching and approximate string matching. These are subdivided into five groups based on different approaches used. These groups are the classical approach, suffix automata approach, bit-parallel approach, hashing approach and hybrid approach based algorithms (Faro & Lecroq, 2013). String matching plays an important role in computer sciences, bioinformatics computational biology in data analysis like feature extraction, disease and structural analysis. Biological scientists and practitioner are mostly interested in performing searching to identify proteins or genes that contain sequence pattern. However, numerous number of algorithms has been developed to deal with specific challenging so far, but the biological database volumes are also increasing at a rapid rate. So, fast and efficient pattern matching algorithms are required to cope with current and future challenges.
The recent development in computational technology and increase in the volume of biological data pose many challenges to researchers. The detailed study on these can be found in Bucak and Uslan (2011) and Pehlivan and Orhan (2011). As the impact of pattern matching is very high, it has been investigated that approximate pattern matching considers crucial and complex issues. That requires high-performance processing in terms of computation. While exact pattern matching algorithms enhances the search speed with the minimal use of hardware and power (Özcan & Ünsal, 2015). In this paper, we focus on computational complexity and memory efficiency of DNA sequence matching. We present computationally intelligent and memory efficient algorithm using binary encoding, multi-threading and searching techniques.
Section snippets
Motivations
Single threaded applications face the issue that lengthy processes must be completed prior the other process to begin. Designing and building such applications are simple because all operations are serialized. It shows that there is only one thread in execution at a time. However, it is very useful to have multiple threads that run simultaneously based on timesharing (Kofahi & Abusalama, 2012), as processor executes only one instruction at a time. Therefore, if a multithreaded application runs
Related work
DNA sequence matching is considered as a special case of general string matching problem. Currently, pattern matching in DNA sequences is major and challenging research area in computational and molecular biology. Many algorithms have been developed to cope with the pattern matching problems in DNA sequences. These algorithms are divided into three categories (Kim, Kim, & Park, 2007) i.e. Prefix, suffix and factor-based approaches. KMP algorithm (Knuth et al., 1977), the Shift-Or (SO) algorithm
Algorithm
Let a Pattern P = p1p2p3…pm and Source S = s1s2s3…sn are the two DNA sequences over the alphabet set ∑ = {A, C, G, T}*. The task is to find all occurrences of P in S. The proposed algorithm utilizes binary encoding which reduces memory requirements. The algorithm makes use of splitting the input string into multiple segments to effectively use multi-threading. For efficient searching and matching the proposed algorithm utilizes alpha skip searching technique. This techniqueisvery efficient in
Datasets
Our algorithm works with DNA sequences. We use ten real DNA sequences in experiments obtained from NCBI database ranging from 585.2 kb to 222.6 Mb in size with different base pairs (bp) in length, as summarized in Table 2. All datasets are in FASTA format, each sequence file contains information of genome nature i.e. accession number, reference number to NCBI database and name of species in a single line starts with the ‘>’ symbol. For each data set, we ignore that line and the patterns were
Results and discussion
All experiments were run on an Intel (R) Core i3-3220 with two [email protected] GHz, 4 GB of memory installed, running on 32-bit instruction set kernel Linux (Ubuntu 11.10) OS. A comprehensive experimental evaluation of exact string matching algorithms presented in Faro and Lecroq (2010) shows that the EBOM, HASH3, SO, SBNDMq4 and SSEF have good performance. Therefore, based on these observations such algorithms are selected for comparative analysis. It should be noted that for fair comparison all the
Conclusion
Pattern matching is not only common task in string processing but also in other fields where the pattern(s) need to be found e.g. image processing, artificial intelligence and DNA processing. Computational biology research and development aim to reduce the complexities of biological sequences data, sequencing capacity augmentation, read length raising and enlarging the field of applications. Inspite of the efforts for designing efficient and scalable string matching algorithms, there still
Authors’ contributions
Mr. Tahir has devised algorithmic solutions and all the co-authors have materially participated in implementation and preparation of the manuscript.
References (48)
- et al.
Improving practical exact string matching
Information Processing Letters
(2010) Shift-or string matching with super-alphabets
Information Processing Letters
(2003)Fast exact string matching algorithms
Information Processing Letters
(2007)- et al.
Practical and flexible pattern matching over Ziv–Lempel compressed text
Journal of Discrete Algorithms
(2004) - et al.
SeqCompress: An algorithm for biological sequence compression
Genomics
(2014) - et al.
Efficient experimental string matching by weak factor recognition*
Combinatorial pattern matching
(2001) A fast hybrid algorithm approach for the exact string matching problem via berry ravindran and alpha skip search algorithms
Journal of Computer Science
(2011)- et al.
Efficient two-dimensional compressed matching
- et al.
Let sleeping files lie: Pattern matching in Z-compressed files
- et al.
A new approach to text searching
Communications of the ACM
(1992)
Index based multiple pattern matching algorithm using DNA sequence and pattern count
International Journal of Information Technology
A fast string searching algorithm
Communications of the ACM
Sequence alignment from the perspective of stochastic optimization: A survey
Turkish Journal of Electrical Engineering & Computer Sciences
Fast string matching algorithm and its application in DNA sequence search
A very fast string matching algorithm for small alphabets and long patterns
Combinatorial pattern matching
Compressed pattern matching in DNA sequences
Fast variants of the backward-oracle-marching algorithm
The exact online string matching problem: A review of the most recent results
ACM Computing Surveys (CSUR)
A simple fast hybrid pattern-matching algorithm
Combinatorial pattern matching
Practical and optimal string matching
String processing and information retrieval
Mixed heuristic algorithm for intelligent string matching for information retrieval
Practical fast searching in strings
Software: Practice and Experience
Efficient randomized pattern-matching algorithms
IBM Journal of Research and Development
Cited by (18)
A new approach in DNA sequence compression: Fast DNA sequence compression using parallel chaos game representation
2019, Expert Systems with ApplicationsCitation Excerpt :This method is lossless (Jung, Jang, & Cho, 2016). In an article presented by Muhammad Tahir.et., EPMA method was introduced to look for the patterns in DNA sequence, using multiple searcher agents, which can be an effective method in compression algorithms (Tahir, Sardaraz, & Ikram, 2017). The proposed algorithm is a hybrid algorithm which uses chaos game representation (CGR) and Hoffman coding.The proposed algorithm has no need to search for patterns like existing ones because the chaos game representation provides all the needed patterns.
Analyzing DNA Pattern Matching through String Similarity Measurements in Cancer Sequence Data
2023, International Conference on Sustainable Communication Networks and Application, ICSCNA 2023 - ProceedingsA new fast technique for pattern matching in biological sequences
2023, Journal of SupercomputingA survey on improving pattern matching algorithms for biological sequences
2022, Concurrency and Computation: Practice and Experience